-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduced Trajectories #23
Conversation
Well, I decided to implement this in the following way (~ 95% done). Please comment So far, I have a strict single file approach. You have commands to express copy, move, link etc of a single file and a trajectory is considered a single file which you can copy, etc like any other file. The restart file is treated also like a separate file and everywhere you are dealing with trajectories - in principle - you need to check if it is a trajectory with restart or not and copy the restart file as well. SInce we now want to use a their file for the reduced trajectory we always have to check if this is the case for a particular trajectory and act accordingly. This behavior works but can lead to errors and requires a Multi-FileThis is simply treated as a single file with a name, but can have additional files with it, that only differ by an extra extension. Like
If you copy or move such a multi file, you will immediately copy and move all the files. The file = Trajectory(...)
print file.restart # `Location()`
print file.reduced # `Location()` It seems in general useful to treat some file groups as one, so I would add this to the How to treat trajectoriesThe main question to be answered it, whether we hard-code the reduced capabilities into the 1. QuestionAssume you have some simple trajectories and some that use reduced (have a time-stride full trajectory and a full-time atom-reduced trajectory). Which one do you pass to the Or do you pass the multi-file object and the pyemma task figures out what to do. At least, the generator knows all the trajectory files that are passed to it and it could pass the appropriate files like reduced topology, etc... 2. QuestionWhen you call traj = Trajectory()
frame = traj[20] what is frame. Does it relate to the native time steps or does it relate to the steps in the time-stride full trajectory. And if the first, what do you get or do, when you use a frame from which you cannot start because this one is only in the atom-reduced trajectory. 3. QuestionIf we say that a # engine attribute approach (recommended)
engine = OpenMMEngineReduced(
...,
selecion='{some selection str}',
time_restart_stride=200, # every 200th saved frame also store full coordinates
save_every_n_frames=1000 # save every 1000th simulated frame)
# task property approach
t = engine.run(
...,
selecion='{some selection str}',
time_restart_stride=200, # every 200th saved frame also store full coordinates
save_every_n_frames=1000 # save every 1000th simulated frame) The engine approach makes sense, but means to create a new engine if you want to change these settings. If so, then we should store the engine used with a trajectory. There are more things to consider to really keep all of this consistent. I will continue tomorrow and think about it. |
Nuria and Tim can better comment on the practicalities than me, but I
have a few general notes:
- It's useful to consider "bundles" of files generated by a simulation
task that belong together. But these could be files of different types.
The example below suggests that you only consider variations of dcds.
How about organizing the outputs of a simulation in a results folder
that can contain multiple files with arbitrary types / file extensions?
- The analysis task should at least *be able* to get a full list of all
files from a run, because we don't know yet what type of analysis we
wanna do in the future. The most typical analysis is to process the
full-frame trajectories without water, but we definitely shouldn't limit
ourselves to only being able to do that.
- I think indexes used literally as below (traj[20]) should always just
refer to the 20th element of that list, so it depends on traj what that
means in terms of time. However, it would be very useful to have the
times available as metadata. For example, if you have generated a
simulation file with 1000 Frames, saving at 100 ps intervals, and a
strided version with 10 Frames saving at 10 ns intervals, it would be
great to have these data available:
traj.dt
> 100
traj.time_unit
> 'ps'
traj.times
> [0, 1, ..., 999] # this also allows to encode when a trajectory
has been continued - e.g. we could start with a different time than 0.
traj_strided.dt
> 10000
traj_strided.time_unit
> 'ps'
traj_strided.times
> [0, 100, ..., 900] # with this information we can map between
trajectories
Does that help answer some of the questions?
Am 18/03/17 um 23:59 schrieb Jan-Hendrik Prinz:
…
Well, I decided to implement this in the following way (~ 95% done).
Please comment
So far, I have a strict single file approach. You have commands to
express copy, move, link etc of a single file and a trajectory is
considered a single file which you can copy, etc like any other file.
The restart file is treated also like a separate file and everywhere
you are dealing with trajectories - in principle - you need to check
if it is a trajectory with restart or not and copy the restart file as
well. SInce we now want to use a their file for the reduced trajectory
we always have to check if this is the case for a particular
trajectory and act accordingly.
This behavior works but can lead to errors and requires a |Task|
writer to know about the structure of a |Trajectory| and cover all the
possible cases. Since we do not want the user to have a knowledge
about the actual implementation of trajectories (unless he write a
specific engine) we could propose /multi-file/ |File| objects.
Multi-File
This is simply treated as a single file with a name, but can have
additional files with it, that only differ by an extra extension. Like
|00000000.dcd 00000000.dcd.restart 00000000.dcd.reduced |
If you copy or move such a multi file, you will immediately copy and
move all the files.
The |MultiFile| object has meta information and knows the basename and
a list of extensions. THis also saves space in saving multiple related
names. You can access the other locations using an attribute
file = Trajectory(...)
print file.restart# `Location()`
print file.reduced# `Location()`
It seems in general useful to treat some file groups as one, so I
would add this to the |File| class.
How to treat trajectories
The main question to be answered it, whether we hard-code the reduced
capabilities into the |Trajectory| object or a sub-class), or do we
let the user deal with it.
1. Question
Assume you have some simple trajectories and some that use /reduced/
(have a time-stride full trajectory and a full-time atom-reduced
trajectory). Which one do you pass to the |PyEMMAAnalysis|?
Or do you pass the multi-file object and the pyemma task figures out
what to do. At least, the generator knows all the trajectory files
that are passed to it and it could pass the appropriate files like
reduced topology, etc...
2. Question
When you call
traj= Trajectory()
frame= traj[20]
what is frame. Does it relate to the native time steps or does it
relate to the steps in the time-stride full trajectory. And if the
first, what do you get or do, when you use a frame from which you
cannot start because this one is only in the atom-reduced trajectory.
3. Question
If we say that a |ReducedTrajectory| object exists, is the specific
reduction something that is also fixed for a particular engine or is
that an option of the task generator.
# engine attribute approach (recommended)
engine= OpenMMEngineReduced(
...,
selecion='{some selection str}',
time_restart_stride=200,# every 200th saved frame also store full coordinates
save_every_n_frames=1000 # save every 1000th simulated frame)
# task property approach
t = engine.run(
...,
selecion='{some selection str}',
time_restart_stride=200,# every 200th saved frame also store full coordinates
save_every_n_frames=1000 # save every 1000th simulated frame)
The engine approach makes sense, but means to create a new engine if
you want to change these settings. If so, then we should store the
engine used with a trajectory.
There are more things to consider to really keep all of this
consistent. I will continue tomorrow and think about it.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#23 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGMeQgsfGYB8Bp0FfSqaCQNfBvPsVCgnks5rnGHhgaJpZM4Mfk3N>.
--
----------------------------------------------
Prof. Dr. Frank Noe
Head of Computational Molecular Biology group
Freie Universitaet Berlin
Phone: (+49) (0)30 838 75354
Web: research.franknoe.de
Mail: Arnimallee 6, 14195 Berlin, Germany
----------------------------------------------
|
Well, no, that was not the point, the bundle idea was like dictionary but with a fixed set of files and types. What these files are and what they represent depends on the specific class. But, I like the idea of just referencing a directory. That is much simpler and practically already implemented. So we have a folder and A (probably too complex) example
So the engine you use writes these files and if you want to restart from a frame or continue a trajectory, then the engine need to make sense out of your file structure. This approach is somewhat less organized than the previous attempt and requires the user to be more careful, but it is MUCH more flexible. in the Task code this would then look like
That's it. Nothing else needs to change. It also means, that you cannot directly get specific files, because the framework does not explicitely know, which files will exist. Another thing to consider is, that all real trajectory files will have the same name now! In different folders, but still. That could be a downside. |
That looks good to me. What about storing metadata such as timepoints?
Sent from my T-Mobile 4G LTE Device
…-------- Original message --------
From: Jan-Hendrik Prinz <notifications@github.com>
Date: 3/20/17 3:22 AM (GMT-06:00)
To: markovmodel/adaptivemd <adaptivemd@noreply.github.com>
Cc: Frank Noe <frank.noe@fu-berlin.de>, Comment <comment@noreply.github.com>
Subject: Re: [markovmodel/adaptivemd] Reduced Trajectories (#23)
The example below suggests that you only consider variations of dcds.
Well, no, that was not the point, the bundle idea was like dictionary but with a fixed set of files and types. What these files are and what they represent depends on the specific class.
But, I like the idea of just referencing a directory. That is much simpler and practically already implemented. So we have a folder and adaptivemd will only move the complete folder around. The specific content is not known to the framework in general, but is specific to the engine. You also do not register each file separately in the DB but only the folder.
A (probably too complex) example
00000000/
traj.dcd # or blablu.xtc, ..., e.g.full, but stride 100
restart.npz
reduced1.xtc # e.g. without water, but stride 10
reduced2.xtc # e.g. only active site, but stride 1
So the engine you use writes these files and if you want to restart from a frame or continue a trajectory, then the engine need to make sense out of your file structure.
This approach is somewhat less organized than the previous attempt and requires the user to be more careful, but it is MUCH more flexible.
in the Task code this would then look like
t = Task(...)
my_traj = t.get(traj_obj, 'traj') # this link the whole folder to the working directory as `traj/`
# then load a file by concatenation
np_file = np.load(os.join(my_traj, 'restart.npz'))
# perhaps some helper method
np_file = np.load(my_traj.path_to('restart.npz'))
That's it. Nothing else needs to change. It also means, that you cannot directly get specific files, because the framework does not explicitely know, which files will exist.
Another thing to consider is, that all real trajectory files will have the same name now! In different folders, but still. That could be a downside.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/markovmodel/adaptivemd","title":"markovmodel/adaptivemd","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/markovmodel/adaptivemd"}},"updates":{"snippets":[{"icon":"PERSON","message":"@jhprinz in #23: \u003e The example below suggests that you only consider variations of dcds.\r\n\r\nWell, no, that was not the point, the bundle idea was like dictionary but with a fixed set of files and types. What these files are and what they represent depends on the specific class. \r\n\r\nBut, I like the idea of just referencing a directory. That is much simpler and practically already implemented. So we have a folder and `adaptivemd` will only move the complete folder around. The specific content is not known to the framework in general, but is specific to the engine. You also do not register each file separately in the DB but only the folder.\r\n\r\nA (probably too complex) example\r\n\r\n```\r\n00000000/\r\n traj.dcd # or blablu.xtc, ..., e.g.full, but stride 100\r\n restart.npz\r\n reduced1.xtc # e.g. without water, but stride 10\r\n reduced2.xtc # e.g. only active site, but stride 1\r\n```\r\n\r\nSo the engine you use writes these files and if you want to restart from a frame or continue a trajectory, then the engine need to make sense out of your file structure.\r\n\r\nThis approach is somewhat less organized than the previous attempt and requires the user to be more careful, but it is MUCH more flexible. \r\n\r\nin the Task code this would then look like\r\n```\r\nt = Task(...)\r\nmy_traj = t.get(traj_obj, 'traj') # this link the whole folder to the working directory as `traj/`\r\n\r\n# then load a file by concatenation\r\nnp_file = np.load(os.join(my_traj, 'restart.npz'))\r\n\r\n# perhaps some helper method\r\nnp_file = np.load(my_traj.path_to('restart.npz'))\r\n```\r\n\r\nThat's it. Nothing else needs to change. It also means, that you cannot directly get specific files, because the framework does not explicitely know, which files will exist.\r\n\r\nAnother thing to consider is, that all real trajectory files will have the same name now! In different folders, but still. That could be a downside."}],"action":{"name":"View Pull Request","url":"#23 (comment)"}}}
|
You mean inside the directory? That would be possible but doubling data in the DB. If you want the Reduced Trajectory we could add these to the main class or you subclass into I would like to keep it as easy as possible. Maybe the following metadata by default. If you need more, say (multiple reduced trajectories, multiple strides, ...) you need to subclass and do that yourself. class Trajectory(...):
"""
location : `Location`
the physical location of the file an the HPC
initial : `Frame` or `File` or None
the initial frame if known. Can be the frame of another file
length : int
the length in frames of the trajectory
# either engine and/or task. I think task should be fine
engine : `Engine` : None
a reference to the engine used. The engine knows things like strides/
save intervals, selections, topology, etc...
task : `Task` or None
the task used to generate the trajectory. Since a task knows the engine
you know which engine was used, too |
I think adding some basic information which times are associated with
the frames is universal and can be done with using any engine as a
backend (that doesn't necessarily mean that information is in the
trajectory file itself, which is why I think the sampling framework
would be the natural place to take care of that).
Maybe it's sufficient to set a t0 and dt parameter (starting time and
timestep)? Assuming that the saving time is regular, it would then be
possible to relate different versions of this trajectory with different
saving intervals, or to concatenate continued trajectories (which may or
may not have overlaps that need to be removed) etc.
Am 20/03/17 um 10:09 schrieb Jan-Hendrik Prinz:
…
You mean inside the directory? That would be possible but doubling
data in the DB.
There is an entry in the DB, which depends on the class you use and
you can always subclass a |Trajectory| and add more metadata, e.g. The
current |Trajectory| supports restart already. I would leave that, but
turn the |restart| into a bool. Marking if there is a restart file
(but noone know the name other then the engine that wrote it).
If you want the Reduced Trajectory we could add these to the main
class or you subclass into |ReducedTrajectory| and it has metadata
like used stride the atom-selection string, etc. That is arbitrary and
you can use all of that in your adaptive code. But it becomes less
reusable because it gets engine specific eventually. So a gromacs user
might need another subclass etc...
I would like to keep it as easy as possible. Maybe the following
metadata by default. If you need more, say (multiple reduced
trajectories, multiple strides, ...) you need to subclass and do that
yourself.
class Trajectory(...):
"""
location : `Location`
the physical location of the file an the HPC
initial : `Frame` or `File` or None
the initial frame if known. Can be the frame of another file
length : int
the length in frames of the trajectory
# either engine and/or task. I think task should be fine
engine : `Engine` : None
a reference to the engine used. The engine knows things like strides/
save intervals, selections, topology, etc...
task : `Task` or None
the task used to generate the trajectory. Since a task knows the engine
you know which engine was used, too
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#23 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGMeQr8l-TRsUt85pXxYQY9mIah1EuaAks5rnkJkgaJpZM4Mfk3N>.
--
----------------------------------------------
Prof. Dr. Frank Noe
Head of Computational Molecular Biology group
Freie Universitaet Berlin
Phone: (+49) (0)30 838 75354
Web: research.franknoe.de
Mail: Arnimallee 6, 14195 Berlin, Germany
----------------------------------------------
|
One could reference the trajectories by their |
I like that idea. What about this: engine = OpenMMEngine(..., )
engine.dt # get the integrator stepping
>>> 2 fs
engine.add_output_type(
name='master',
file='output.dcd' # default would be name + '.dcd'
save_every=1000 # or use str `10ps`
)
engine.add_output_type(
name='protein',
save_every=100, # or use str `1ps`
selection='protein'
)
engine.add_output_type(
name='intramoldist',
save_every=100, # or use `1ps
pyemma='feature description' # I guess that is overkill for now
) Then the engine would just now what to do and what the structure of the folder is
If you have a trajectory generated by that engine you can (already) do:
|
Looks good! Now, how would you access the trajectory-related properties, or how is that stored? The user would probably need to extract |
A few things to consider when you change (or only add) output types along the way
All in all not trivial... |
Hmm, I would create a new OutputTypeDescription class which basically is like a dict, but you can access the attributes with a dot. From the engine, maybe something like
using the descriptor you can use it also as an input to other generators (pyemma) without storing it again.
I think that looks pretty nice |
The Descriptors can even be independent of the Engine and be used (to some extend) for other engines, too. At least options like the filename or saving interval work everywhere. Selection could be difficult. |
New Trajectory Output Types
I'm not sure it this is a good idea, though. Since the command cannot be extracted from the engine, it is impossible to check whether anything is congruent at all. Maybe it's better if the user does this manually and there is only a function to add the files, that were generated, to the DB. Output Type Descriptors
I like that approach, looks nice and fits with the rest. |
Correct, this will usually be part of a strategy. Just running Pyemma with new features is trivial, of course. Create a new PyEmma generator and use this one instead. If you want to add a new reduced trajectory or a feature trajectory that is automatically used is another matter.
of course, we always use what we have. So, when you want to add new output types whcih produce
This will create a new trajectory with the same name, but the updated engine and will update all outputs to match the new engine. Limitation is that you can only add outputs, not remove some.
Looks not too complicated and concise in that a trajectory and its engine fit together. |
This looks much better than what I posted above, but I don't understand how this can internally be handled. How does |
Well, the engine needs to pass (somehow) the output types to As an example you need to call openmmrun.py with
I think just JSON would be good. Maybe replace " with ' to make it look like python and still have " for the bash. This is the easiest. |
Allow trajectories to be saved by a frame-reduced full-atom trajectory + a atom-reduced full-frame trajectory.
This changes a
Trajectory
or its subclass to be a group of files. These should be handled like an ordinary file, but a name change will alter all files in the same way.This needs some thought.
But finally you will have a
ReducedTrajectory
that is a group of 3 files.And a
RestartTrajectory
justOr whatever combination you want. Copying, etc will work on the whole group and I think it makes sense, to use one filename + multiple extensions Instead of multiple unique filenames.