Skip to content
This repository has been archived by the owner on Sep 11, 2023. It is now read-only.

how to ? #1352

Closed
naveenmeena584 opened this issue Aug 30, 2018 · 17 comments
Closed

how to ? #1352

naveenmeena584 opened this issue Aug 30, 2018 · 17 comments

Comments

@naveenmeena584
Copy link

naveenmeena584 commented Aug 30, 2018

Thanks

@cwehmeyer
Copy link
Member

Let's look at the tutorial notebook you mentioned.

The call

pdb = mdshare.fetch('alanine-dipeptide-nowater.pdb', working_directory='data')

ensures that the file alanine-dipeptide-nowater.pdb exists in the directory data and returns the relative path as a string. If you know that the file already exists, you could also (on a Linux/Unix/OSX system) write

pdb = 'data/alanine-dipeptide-nowater.pdb'

Likewise,

files = mdshare.fetch('alanine-dipeptide-*-250ns-nowater.xtc', working_directory='data')

would be equivalent to

files = [
    'data/alanine-dipeptide-0-250ns-nowater.xtc',
    'data/alanine-dipeptide-1-250ns-nowater.xtc',
    'data/alanine-dipeptide-2-250ns-nowater.xtc']

And that is exactly the kind of information you need to pass to pyemma's loading functions: the relative or absolute paths of your files as strings.

Once you have the location of your PDB file stored in the variable pdb and the location of one or more trajectories in the variable files, you can create a featurizer

feat = pyemma.coordinates.featurizer(pdb)
feat.add_backbone_torsions(periodic=False) # load only backbone torsions

and load the selected molecular features into memory

data = pyemma.coordinates.load(files, features=feat)

or create a reader object (recommended for huge data sets)

reader = pyemma.coordinates.source(files, features=feat)

@naveenmeena584
Copy link
Author

naveenmeena584 commented Aug 30, 2018

its working but got another error i following https://github.com/markovmodel/pyemma_tutorials/blob/master/notebooks/01-data-io-and-featurization.ipynb this tutorial and @ this step

data_concatenated = np.concatenate(data)
pyemma.plots.plot_feature_histograms(data_concatenated, feature_labels=feat);

getting following error:

IndexError Traceback (most recent call last)
in ()
1 data_concatenated = np.concatenate(data)
----> 2 pyemma.plots.plot_feature_histograms(data_concatenated, feature_labels=feat);

~/miniconda3/envs/lib/python3.6/site-packages/pyemma/plots/plots1d.py in plot_feature_histograms(xyzall, feature_labels, ax, ylog, outfile, n_bins, ignore_dim_warning, **kwargs)
64 raise ValueError('Input data hast to be a numpy array. Did you concatenate your data?')
65
---> 66 if xyzall.shape[1] > 50 and not ignore_dim_warning:
67 raise RuntimeError('This function is only useful for less than 50 dimensions. Turn-off this warning '
68 'at your own risk with ignore_dim_warning=True.')

IndexError: tuple index out of range

@cwehmeyer
Copy link
Member

Yes, that exception is raised if you want to plot the histograms of more than 50 features. You can either plot your features in batches, e.g., via

pyemma.plots.plot_feature_histograms(data_concatenated[:, 0:10])
pyemma.plots.plot_feature_histograms(data_concatenated[:, 10:20])
...

or use the option mentioned in the Traceback to suppress the exception:

pyemma.plots.plot_feature_histograms(
    data_concatenated, feature_labels=feat, ignore_dim_warning=True)

The latter, however, will most likely result in a completely unusable figure.

@thempel
Copy link
Member

thempel commented Aug 30, 2018

Actually, can you show us what data_concatenated.shape returns? I suspect that this array is not set up correctly.

@cwehmeyer
Copy link
Member

Yes, you are right @thempel, I misread the Traceback.

@naveenmeena584
Copy link
Author

type of data: <class 'numpy.ndarray'>
lengths: 250000
shape of elements: (2,)

@naveenmeena584
Copy link
Author

alanine-dipeptide-0-250ns-nowater.xtc and alanine-dipeptide-nowater.pdb

@thempel
Copy link
Member

thempel commented Aug 30, 2018

Thanks, unfortunately we are still having problems to follow you. Could you please provide the code that you are trying to run? A minimal example would be great so we can reproduce the issue.

It might be possible that, if you have only one single trajectory, you should not concatenate the data. If that is the case, try to use the original data instead of the concatenated data. Concatenation only makes sense if you have multiple trajectories that you need to concatenate e.g. for histogram plotting.

@naveenmeena584
Copy link
Author

actually i want to do analysis for my simulation files for this jupyter code https://github.com/markovmodel/deeptime/blob/master/vampnet/examples/Alanine_dipeptide_multiple_files.ipynb
i have confusion for this
#Download alanine coordinates and dihedral angles data
mdshare.load('alanine-dipeptide-3x250ns-heavy-atom-positions.npz')
mdshare.load('alanine-dipeptide-3x250ns-backbone-dihedrals.npz')
alanine_files = np.load('alanine-dipeptide-3x250ns-heavy-atom-positions.npz')

how you get following 2 files for heavy atom position and bacbone dihedral in npz format. is it necessary to take 3 files currently i have one xtc and one pdb file so how i can get npz file and can use this code.

@cwehmeyer
Copy link
Member

OK, a few points on this:

mdshare.load('alanine-dipeptide-3x250ns-heavy-atom-positions.npz')
mdshare.load('alanine-dipeptide-3x250ns-backbone-dihedrals.npz')

mdshare.load() is deprecated and will not work if you have the latest version of mdshare. Please use mdshare.fetch() instead.

how you get following 2 files for heavy atom position and bacbone dihedral in npz format. is it necessary to take 3 files currently i have one xtc and one pdb file so how i can get npz file and can use this code.

The functions pyemma.coordinates.featurizer() and pyemma.coordinates.load() are used to extract molecular features (e.g., backbone dihedrals or heavy atom positions) from files which are stored in one of the usual molecular dynamics formats (e.g., .xtc or .dcd).

In the vampnet example you mentioned, we are using precomputed molecular features. In detail, we have run the code

feat = pyemma.coordinates.featurizer(pdb)
feat.add_backbone_torsions(periodic=False)
data = pyemma.coordinates.load(files, features=feat)
np.savez('alanine-dipeptide-3x250ns-backbone-dihedrals.npz', *data)

to extract the backbone dihedrals from the three .xtc files and saved the resulting three numpy.ndarrays in the file alanine-dipeptide-3x250ns-backbone-dihedrals.npz.

Now, if we want to run a vampnet calculation using backbone dihedrals, we can load this precomputed data via

with np.load('alanine-dipeptide-3x250ns-backbone-dihedrals.npz') as fh:
    data = [fh['arr_0'], fh['arr_1'], fh['arr_2']]

Unfortunately, pyemma cannot directly read .npz or .npy files and, thus, we use numpy to load the data into memory; this is explained in https://github.com/markovmodel/pyemma_tutorials/blob/master/notebooks/01-data-io-and-featurization.ipynb, Case 1: preprocessed data (toy model).

@naveenmeena584
Copy link
Author

i got that but you have not clear another doubt that is you used 3 xtc files becuse of that we have 3 npy files to use in this code :

Save the files separately

np.save('traj0.npy', alanine_files['arr_0'])
np.save('traj1.npy', alanine_files['arr_1'])
np.save('traj2.npy', alanine_files['arr_2'])

Separate data files between training data and validation data

train_data_files_list = [
'traj0.npy',
'traj1.npy',
]

valid_data_files_list = [
'traj2.npy',
]
my doubt is if i have only one npy file how i define train data and valid_data and if i have more than 3 npy file, is it necessary to take 3 files .
another doubt is can i take .gro as a toplology file instead of .pdb file

@thempel
Copy link
Member

thempel commented Aug 30, 2018

Yes, you can use .gro files as topology file.

The number of files is arbitrary, you can structure the data as you like. The crucial part is that you subsample your data such that there is no overlap between training and validation data. In the above case, we had 3 independent trajectories and chose the first two for training and the third for validation. If you have multiple trajectories, you can take an arbitrary subset for training and the remainder for validation. If you have only a single trajectory, you need to subsample this trajectory into blocks.

Generally, this split does not require the data to be in different files. More information on this kind of splitting is provided in introductions about cross-validation. This should be explained in the PyEMMA tutorials that you already mentioned (notebook 00 and 01). If you have further issues with VAMPNets in particular, please consider opening an issue in the deeptime repository.

@naveenmeena584
Copy link
Author

Thanks a lot, understand everything only have last doubt that is as you mentioned above :"If you have only a single trajectory, you need to sub sample this trajectory into blocks. " how to do this do you have any example where you did for single trajectory.

@cwehmeyer
Copy link
Member

Let us assume your single trajectory is loaded into the variable data. Then, running

n = len(data) // 2
data_train = data[:n]
data_validation = data[n:]

would split your trajectory into roughly equal sized parts which are not overlapping. This is a crude but simple example.

If you want a more elaborate example, please consider working through this block subsampling function from deeptime's time-lagged autoencoder project: https://github.com/markovmodel/deeptime/blob/f2b97328baa1c38c92616f058195fa5803ff05d9/time-lagged-autoencoder/tae/utils.py#L190-L211

@thempel thempel closed this as completed Sep 3, 2018
@naveenmeena584 naveenmeena584 changed the title how to load my file to calculate dihedral and heavy atom positions how to ? Sep 6, 2018
@naveenmeena584
Copy link
Author

naveenmeena584 commented Apr 17, 2019 via email

@thempel
Copy link
Member

thempel commented Apr 17, 2019

Thank you very much for posting your tensorflow problem to the pyemma issue tracker and for putting so much efforts into formatting it. I assume you resolve this by not using floats but integers when calling range().

@Wencesgiovanni
Copy link

What is the difference between a working directory and a path?

I am at a loss.

I am trying to fetch the file pH10-amber-R1-dry.xtc via the command

files = fetch( 'pH10-amber-R1-dry.xtc', working_directory='C:/Users/giova/data/')

but I keep on obtaining the following error message:

pH10-amber-R1-dry.xtc [no match in repository]

I assure you that the file pH10-amber-R1-dry.xtc does belong to the directory /data. Why do I get the said message?

Thank you very much for you attentive reply!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants