Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduced Trajectories #23

Merged
merged 22 commits into from
Mar 21, 2017
Merged

Reduced Trajectories #23

merged 22 commits into from
Mar 21, 2017

Conversation

jhprinz
Copy link
Contributor

@jhprinz jhprinz commented Mar 16, 2017

Allow trajectories to be saved by a frame-reduced full-atom trajectory + a atom-reduced full-frame trajectory.

This changes a Trajectory or its subclass to be a group of files. These should be handled like an ordinary file, but a name change will alter all files in the same way.

This needs some thought.

But finally you will have a ReducedTrajectory that is a group of 3 files.

  1. Restart
  2. frame-reduced traj
  3. atom-reduced traj

And a RestartTrajectory just

  1. Restart
  2. full trajectory

Or whatever combination you want. Copying, etc will work on the whole group and I think it makes sense, to use one filename + multiple extensions Instead of multiple unique filenames.

@jhprinz
Copy link
Contributor Author

jhprinz commented Mar 18, 2017

Well, I decided to implement this in the following way (~ 95% done). Please comment

So far, I have a strict single file approach. You have commands to express copy, move, link etc of a single file and a trajectory is considered a single file which you can copy, etc like any other file.

The restart file is treated also like a separate file and everywhere you are dealing with trajectories - in principle - you need to check if it is a trajectory with restart or not and copy the restart file as well. SInce we now want to use a their file for the reduced trajectory we always have to check if this is the case for a particular trajectory and act accordingly.

This behavior works but can lead to errors and requires a Task writer to know about the structure of a Trajectory and cover all the possible cases. Since we do not want the user to have a knowledge about the actual implementation of trajectories (unless he write a specific engine) we could propose multi-file File objects.

Multi-File

This is simply treated as a single file with a name, but can have additional files with it, that only differ by an extra extension. Like

00000000.dcd
00000000.dcd.restart
00000000.dcd.reduced

If you copy or move such a multi file, you will immediately copy and move all the files.

The MultiFile object has meta information and knows the basename and a list of extensions. THis also saves space in saving multiple related names. You can access the other locations using an attribute

file = Trajectory(...)
print file.restart  # `Location()`
print file.reduced  # `Location()`

It seems in general useful to treat some file groups as one, so I would add this to the File class.

How to treat trajectories

The main question to be answered it, whether we hard-code the reduced capabilities into the Trajectory object or a sub-class), or do we let the user deal with it.

1. Question

Assume you have some simple trajectories and some that use reduced (have a time-stride full trajectory and a full-time atom-reduced trajectory). Which one do you pass to the PyEMMAAnalysis?

Or do you pass the multi-file object and the pyemma task figures out what to do. At least, the generator knows all the trajectory files that are passed to it and it could pass the appropriate files like reduced topology, etc...

2. Question

When you call

traj = Trajectory()
frame = traj[20]

what is frame. Does it relate to the native time steps or does it relate to the steps in the time-stride full trajectory. And if the first, what do you get or do, when you use a frame from which you cannot start because this one is only in the atom-reduced trajectory.

3. Question

If we say that a ReducedTrajectory object exists, is the specific reduction something that is also fixed for a particular engine or is that an option of the task generator.

# engine attribute approach (recommended)
engine = OpenMMEngineReduced(
    ..., 
    selecion='{some selection str}', 
    time_restart_stride=200,  # every 200th saved frame also store full coordinates
    save_every_n_frames=1000  # save every 1000th simulated frame)

# task property approach
t = engine.run(
    ...,
    selecion='{some selection str}', 
    time_restart_stride=200,  # every 200th saved frame also store full coordinates
    save_every_n_frames=1000  # save every 1000th simulated frame)

The engine approach makes sense, but means to create a new engine if you want to change these settings. If so, then we should store the engine used with a trajectory.

There are more things to consider to really keep all of this consistent. I will continue tomorrow and think about it.

@franknoe
Copy link
Collaborator

franknoe commented Mar 19, 2017 via email

@jhprinz
Copy link
Contributor Author

jhprinz commented Mar 20, 2017

The example below suggests that you only consider variations of dcds.

Well, no, that was not the point, the bundle idea was like dictionary but with a fixed set of files and types. What these files are and what they represent depends on the specific class.

But, I like the idea of just referencing a directory. That is much simpler and practically already implemented. So we have a folder and adaptivemd will only move the complete folder around. The specific content is not known to the framework in general, but is specific to the engine. You also do not register each file separately in the DB but only the folder.

A (probably too complex) example

00000000/
    traj.dcd  # or blablu.xtc, ..., e.g.full, but stride 100
    restart.npz
    reduced1.xtc  # e.g. without water, but stride 10
    reduced2.xtc  # e.g. only active site, but stride 1

So the engine you use writes these files and if you want to restart from a frame or continue a trajectory, then the engine need to make sense out of your file structure.

This approach is somewhat less organized than the previous attempt and requires the user to be more careful, but it is MUCH more flexible.

in the Task code this would then look like

t = Task(...)
my_traj = t.get(traj_obj, 'traj')  # this link the whole folder to the working directory as `traj/`

# then load a file by concatenation
np_file = np.load(os.join(my_traj, 'restart.npz'))

# perhaps some helper method
np_file = np.load(my_traj.path_to('restart.npz'))

That's it. Nothing else needs to change. It also means, that you cannot directly get specific files, because the framework does not explicitely know, which files will exist.

Another thing to consider is, that all real trajectory files will have the same name now! In different folders, but still. That could be a downside.

@franknoe
Copy link
Collaborator

franknoe commented Mar 20, 2017 via email

@jhprinz
Copy link
Contributor Author

jhprinz commented Mar 20, 2017

You mean inside the directory? That would be possible but doubling data in the DB.
There is an entry in the DB, which depends on the class you use and you can always subclass a Trajectory and add more metadata, e.g. The current Trajectory supports restart already. I would leave that, but turn the restart into a bool. Marking if there is a restart file (but noone know the name other then the engine that wrote it).

If you want the Reduced Trajectory we could add these to the main class or you subclass into ReducedTrajectory and it has metadata like used stride the atom-selection string, etc. That is arbitrary and you can use all of that in your adaptive code. But it becomes less reusable because it gets engine specific eventually. So a gromacs user might need another subclass etc...

I would like to keep it as easy as possible. Maybe the following metadata by default. If you need more, say (multiple reduced trajectories, multiple strides, ...) you need to subclass and do that yourself.

class Trajectory(...):
    """
    location : `Location`
        the physical location of the file an the HPC
    initial : `Frame` or `File` or None
        the initial frame if known. Can be the frame of another file
    length : int
        the length in frames of the trajectory
    # either engine and/or task. I think task should be fine
    engine : `Engine` : None
        a reference to the engine used. The engine knows things like strides/ 
        save intervals, selections, topology, etc...
    task : `Task` or None
        the task used to generate the trajectory. Since a task knows the engine 
        you know which engine was used, too

@franknoe
Copy link
Collaborator

franknoe commented Mar 20, 2017 via email

@thempel
Copy link
Member

thempel commented Mar 20, 2017

  • Taking folders sounds perfect to me. It also reduces the trajectory folder size for huge projects with 1,000s of trajectories. In the previous framework, the trajectory files were organized like that (with the same names) and as I recall, we did not have any issues with that.

  • Concerning Question 1, as Frank already said, normally the analysis is done on the protein-only trajectory (atom-reduced in your terms) while the frames are drawn from the all-atom (time-reduced) trajectory.

  • Question 3: From my perspective, it makes sense to couple the different time-strides to the engine in a immutable way. If the user changes these settings, everything will break down. On the other hand however, it would be good to be able to add feature trajectories for the case of an adaptive strategy change. I was thinking of some kind of dictionary, maybe something like

output_dict = [{'name' : 'all_atom_traj', 
                        'selection' : 'all',
                         'save_every_frames' : 1000,
                         'dt' : 100,
                         'time_unit' : 'ps', 
                          'master': True, ...}, 
                          {'name' : 'protein_traj',
                            'selection' : 'protein', ...},
                          {'intramolecular_distances', ...}, ...]

engine = OpenMMEngine(..., output=output_dict)

One could reference the trajectories by their name and name the files like that, making it very flexible and also self-explanatory to the user. The trajectory that has to be used for frame-picking should have something like a 'master': True to be found.
We spoke about computing features for already finished trajectories, which in this formulation would mean something like engine.add_output_type({some dictionary...}). Probably it would make sense to embed this better in the project-object... Is that too general?

@jhprinz
Copy link
Contributor Author

jhprinz commented Mar 21, 2017

Question 3: From my perspective, it makes sense to couple the different time-strides to the engine in a immutable way. If the user changes these settings, everything will break down. On the other hand however, it would be good to be able to add feature trajectories for the case of an adaptive strategy change. I was thinking of some kind of dictionary, maybe something like

I like that idea. What about this:

engine = OpenMMEngine(..., )
engine.dt  # get the integrator stepping
>>> 2 fs
engine.add_output_type(
    name='master',
    file='output.dcd'  # default would be name + '.dcd'
    save_every=1000  # or use str `10ps`
)
engine.add_output_type(
    name='protein',
    save_every=100,  # or use str `1ps`
    selection='protein'
)
engine.add_output_type(
    name='intramoldist',
    save_every=100,  # or use `1ps
    pyemma='feature description'  # I guess that is overkill for now
)

Then the engine would just now what to do and what the structure of the folder is

traj/
  output.dcd  # should have a simple default setting option
  protein.dcd
  intramoldist.npy
  restart.npz  # always there

If you have a trajectory generated by that engine you can (already) do:

t = Trajectory(...)
t.file('output.dcd')  # get a location of the file

# simpler would be
t.output('master')

# and to get all files of a certain type run
project.trajectories.all.output('protein')  # this calls the output() on all trajectories

@thempel
Copy link
Member

thempel commented Mar 21, 2017

Looks good!

Now, how would you access the trajectory-related properties, or how is that stored? The user would probably need to extract save_every etc. for all or a given output type. Maybe something like the mentioned dictionary as a property, like engine.output_types?

@jhprinz
Copy link
Contributor Author

jhprinz commented Mar 21, 2017

We spoke about computing features for already finished trajectories, which in this formulation would mean something like engine.add_output_type({some dictionary...}). Probably it would make sense to embed this better in the project-object... Is that too general?

A few things to consider when you change (or only add) output types along the way

  1. To not violate immutability you need to create a new engine, but that would be possible and simple. Still you now have 2 engines and need to get the right one.

  2. What happens, when you want to extend a trajectory? All of the previous trajectories do not have the new output computed. So you would have either run all trajectories and compute the new output or forbid to do that with existing trajectories.

  3. As you said, you could have a project function that can enhance an existing engine and make sure that all is prepared after adding a new output.

All in all not trivial...

@jhprinz
Copy link
Contributor Author

jhprinz commented Mar 21, 2017

Now, how would you access the trajectory-related properties, or how is that stored? The user would probably need to extract save_every etc. for all or a given output type. Maybe something like the mentioned dictionary as a property, like engine.output_types?

Hmm, I would create a new OutputTypeDescription class which basically is like a dict, but you can access the attributes with a dot.

From the engine, maybe something like

engine = OpenMM(...)
engine.add_output_type(...)
# as short for
engine.outputs[name] = OutputTypeDecriptor(...)
and access it in the same way
# option 1
engine.outputs['protein'].save_every

# option 2 - access via attribute if the name is python compatible
engine.protein.save_every

# option 3 - rename the file access with __getitem__
engine['protein'].save_every
# and access the files (which is really no problem)
engine.files['my_file']
# this option could also be a shortcut for option 1

using the descriptor you can use it also as an input to other generators (pyemma) without storing it again.

modeller = PyEMMA(traj_descriptor=engine['protein'])

I think that looks pretty nice

@jhprinz
Copy link
Contributor Author

jhprinz commented Mar 21, 2017

The Descriptors can even be independent of the Engine and be used (to some extend) for other engines, too. At least options like the filename or saving interval work everywhere. Selection could be difficult.

@thempel
Copy link
Member

thempel commented Mar 21, 2017

New Trajectory Output Types
In practice, when deciding on a new feature for the analysis, the MSM needs to be build for all trajectories. Not using the trajectories that have been computed before would generally not be a good idea. But I see the problem that it is not necessarily known outside the actual MD script, how these features are actually computed. Hmm. The simplest (and probably most messy...) approach that comes to my mind would be something like this (related to your bullet point 3) with a string to be executed.

trajs = [trj['protein'] for trj in project.trajectories if trj.engine.name == 'old engine']
command = 'import mdtraj\n mdtraj.load(INPUT).save_xtc(OUTPUT)'  # INPUT and OUTPUT being called internally to be compatible to the DB
task = complete_existing_trajectories(trajs, command)
project.queue(task)

I'm not sure it this is a good idea, though. Since the command cannot be extracted from the engine, it is impossible to check whether anything is congruent at all. Maybe it's better if the user does this manually and there is only a function to add the files, that were generated, to the DB.

Output Type Descriptors

Hmm, I would create a new OutputTypeDescription class which basically is like a dict, but you can access the attributes with a dot.

I like that approach, looks nice and fits with the rest.

@jhprinz
Copy link
Contributor Author

jhprinz commented Mar 21, 2017

New Trajectory Output Types

Correct, this will usually be part of a strategy. Just running Pyemma with new features is trivial, of course. Create a new PyEmma generator and use this one instead. If you want to add a new reduced trajectory or a feature trajectory that is automatically used is another matter.

Not using the trajectories that have been computed before would generally not be a good idea.

of course, we always use what we have.

So, when you want to add new output types whcih produce .dcd (or whatever you are using) or feature trajcectories with numpy/pyemma you will probably have to proceed like this:

  1. Create a new enhanced engine (NEW), that has the old and new outputs and keeps the rest.

  2. So far all existing trajctories are compatible with the old engine (OLD), so if you have a traj of length 100, then all the outputs ran with the old engine conform to that length (depending on individual strides, of course)

  3. Now you can run an traj update, that will take an existing trajectory and update it to the new engine like

new_engine.update(trajectory)

This will create a new trajectory with the same name, but the updated engine and will update all outputs to match the new engine. Limitation is that you can only add outputs, not remove some.
That would lead to inconsistencies.

  1. if you want, run the update on all trajectories or whichever you want. This will update your project

  2. If you want to extend a trajectory then you have to first update.

Looks not too complicated and concise in that a trajectory and its engine fit together.

@thempel
Copy link
Member

thempel commented Mar 21, 2017

This looks much better than what I posted above, but I don't understand how this can internally be handled. How does engine know how to generate the output, e.g. if this is normally done by an openMM reporter?

@jhprinz
Copy link
Contributor Author

jhprinz commented Mar 21, 2017

How does engine know how to generate the output, e.g. if this is normally done by an openMM reporter?

Well, the engine needs to pass (somehow) the output types to openmmrun.py This is the most annoying part. Not difficult but requires some work.

As an example you need to call openmmrun.py with

--outputtypes="{'master: {'file': 'output.dcd', 'stride': 10, 'selection': 'protein'},  ... }"

I think just JSON would be good. Maybe replace " with ' to make it look like python and still have " for the bash. This is the easiest.

@jhprinz jhprinz merged commit ea87ad8 into markovmodel:master Mar 21, 2017
@jhprinz jhprinz mentioned this pull request Mar 22, 2017
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants