Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand supported file formats #109

Open
phydev opened this issue Mar 13, 2023 · 0 comments
Open

Expand supported file formats #109

phydev opened this issue Mar 13, 2023 · 0 comments
Assignees
Labels
enhancement New feature or request future-work Code that should be implemented soon good first issue Good for newcomers help wanted Extra attention is needed

Comments

@phydev
Copy link
Member

phydev commented Mar 13, 2023

A parser for different file formats is needed, especially for processing csv files that contain several trajectories.

file format description priority
.xyz file with several atoms and time points 1
.csv csv file with several trajectories 2
LAMMPS Molecular dynamics 2
.pdb protein data bank 3

.xyz file

Nparticles [integer]
comment [character]
X Y Z [repeat Nparticles]
[repeat Nframes] 

CSV with several trajectories - format definition

The csv should contain 5 columns: time t, 3 spatial (x, y, z) components and the trajectory identifier id.

LAMMPS data file format

Large-scale Atomic/Molecular Massively Parallel Simulator is a molecular dynamics program from Sandia National Laboratories.

More details about the file format: https://docs.lammps.org/read_data.html

The LAMMPS data dump file format is written in yaml with the following structure:

---
creator: LAMMPS
timestep: 0
units: lj
time: 0
natoms: 3
boundary: [ p, p, p, p, p, p, ]
thermo:
  - keywords: [ Step, Temp, E_pair, E_mol, TotEng, Press, ]
  - data: [ 0, 0, -27093.472213010766, 0, 0, 0, ]
box:
  - [ 0, 16.795961913825074 ]
  - [ 0, 16.795961913825074 ]
  - [ 0, 16.795961913825074 ]
  - [ 0, 0, 0 ]
keywords: [ id, type, x, y, z, vx, vy, vz, ix, iy, iz,  ]
data:
  - [     1 , 1 ,  0.000000e+00 ,  0.000000e+00 ,  0.000000e+00 ,  -1.841579e-01 , -9.710036e-01 , -2.934617e+00 , 0 , 0 , 0, ]
  - [     2 , 1 ,  8.397981e-01 ,  8.397981e-01 ,  0.000000e+00 ,  -1.799591e+00 ,  2.127197e+00 ,  2.298572e+00 , 0 , 0 , 0, ]
  - [     3 , 1 ,  8.397981e-01 ,  0.000000e+00 ,  8.397981e-01 ,  -1.807682e+00 , -9.585130e-01 ,  1.605884e+00 , 0 , 0 , 0, ]
---
timestep: 100
...
---

A parser for this file format is straightforward with yaml.load_all() function.

Protein Data Bank (PDB) format

Standard file format for protein structures containing several atoms each file at different time steps. Each pdb file can contain a screenshot of the system or several trajectories, so we need to process several pdb files at once to extract trajectories.

A possible workflow would be:

  1. Read each pdb file and extract the trajectories per atom
  2. Write a CSV file using the format (y, x, y, z, id), where id is the atom identifier.
  3. Use the CSV file to compute the features using trajpy

More information about pdb file format: https://en.wikipedia.org/wiki/Protein_Data_Bank_(file_format)

@phydev phydev added enhancement New feature or request help wanted Extra attention is needed good first issue Good for newcomers future-work Code that should be implemented soon labels Mar 13, 2023
@phydev phydev pinned this issue Mar 13, 2023
@phydev phydev self-assigned this Jul 19, 2023
phydev added a commit that referenced this issue Jul 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request future-work Code that should be implemented soon good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

1 participant