[coor/save_traj]: building of an index file? #788

gph82 · 2016-04-28T16:34:33Z

Hi, I don't know where this was introduced, but this was unknown to me:
save_traj relies upon indexing xtcs before carrying out its frame-of-file extraction.

If the index has not been constructed before, this is definitely NOT the moment to do it, imho. Notice that I have to wait 20h+ to extract one single frame.

I think that at least the possibility should be offered to skip reader construction (

PyEMMA/pyemma/coordinates/data/util/frames_from_file.py

Line 35 in b5d9d6a

reader = source(files, top=top)

) and use the old implementation in cases where the trade-off is obviously worth it.

This is a REAL blocker for me now...

marscher · 2016-04-28T16:41:20Z

We have to find a suitable solution soon. In the meantime try reverting this change on your local branch: git revert 9363956

franknoe · 2016-04-28T16:43:29Z

I agree. We should build the index lazily (i.e. just remember what you
know), but not run into the danger of wasting time by indexing
everything that is being opened.

Is this in a release or just devel?

Am 28/04/16 um 18:34 schrieb Guillermo Pérez-Hernández:

Hi, I don't know where this was introduced, but this was unknown to me:
save_traj relies upon indexing xtcs before carrying out its
frame-of-file extraction.

snap1
https://cloud.githubusercontent.com/assets/7518004/14893228/cecac358-0d6e-11e6-81aa-397c431c59ba.png

If the index has not being constructed before, this is definitely NOT
the moment to do it, imho. Notice that I have to wait 20h+ to extract
one single frame.

I think that at least the possibility should be offered to skip reader
construction
(

PyEMMA/pyemma/coordinates/data/util/frames_from_file.py

Line 35 in b5d9d6a

reader = source(files, top=top)

)
and use the old implementation in cases where the trade-off is
obviously worth it.

This is a REAL blocker for me now...

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#788

Prof. Dr. Frank Noe
Head of Computational Molecular Biology group
Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354
Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

gph82 · 2016-04-28T16:43:53Z

Actually, I can prepare a PR that gives to option to choose between frames_from_files and frames_from_file here:

PyEMMA/pyemma/coordinates/api.py

Line 643 in b5d9d6a

gph82 · 2016-04-28T16:47:48Z

That would cure it. Some optarg of save_traj along the lines of "no_cache=False".
""""
no_cache: boolean, default is False
If set to true, pyemma will skip the indexing of trajectories before accessing them. In some cases with short :py:obj:indexes arrays and long :py:obj:traj_inp" lists, this can be significantly faster. In others, with long :py:obj:indexes but short :py:obj:traj_inp" lists, it is worth waiting to build an index before extracting the frames.
"""

franknoe · 2016-04-28T16:51:25Z

I think defaults should behave well. Having bad behavior that you can
switch off optionally is not a good strategy.

Am 28/04/16 um 18:47 schrieb Guillermo Pérez-Hernández:

That would cure it. Some optarg of save_traj along the lines of
"no_cache=False".
""""
no_cache: boolean, default is False
If set to true, pyemma will skip the indexing of trajectories before
accessing them. In some cases with short :py:obj:indexes arrays and
long :py:obj:traj_inp" lists, this can be significantly faster. In
others, with long :py:obj:indexes but short :py:obj:traj_inp" lists,
it is worth waiting to build an index before extracting the frames.
"""

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#788 (comment)

Prof. Dr. Frank Noe
Head of Computational Molecular Biology group
Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354
Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

gph82 · 2016-04-28T16:54:53Z

Agree, but I the thing is we don't know a priory what's "good" or "bad". In my case is disastrous, but in other it may not. I don't have a strong opinion here...it's really hard to know a priori, i think

franknoe · 2016-04-28T16:56:30Z

I think it's clear. We are working with mass data, so no proactive
indexing of everything in addition to what is being read anyway. We
can't predict how the user wants to use the data.

Am 28/04/16 um 18:54 schrieb Guillermo Pérez-Hernández:

Agree, but I the thing is we don't know a priory what's "good" or
"bad". In my case is disastrous, but in other it may not. I don't have
a strong opinion here...it's really hard to know a priori, i think

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#788 (comment)

Prof. Dr. Frank Noe
Head of Computational Molecular Biology group
Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354
Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

gph82 · 2016-04-28T16:57:58Z

Got it!

What about sth like this?

    build_index : boolean, default is False
        If set to true, save_trajs will build a frame-index of the trajectory files before accessing them to extract the
        frames specified in :py:obj:`indexes`. For cases with long :py:obj:`indexes` but short :py:obj:`traj_inp` lists,
        it may be worth setting :py:obj:`build_index` to True and allow before extracting the frames 
        For cases with short :py:obj:`indexes` arrays and long :py:obj:`traj_inp` lists, the default behaviour is 
        significantly faster.

franknoe · 2016-04-28T16:59:30Z

This issue really has nothing to do with this particular function. It is
just triggered in this case by calling the function, but indexing should
not be controlled at this level.

Am 28/04/16 um 18:57 schrieb Guillermo Pérez-Hernández:

Got it!

What about sth like this?

|build_index : boolean, default is False If set to true, save_trajs
will build a frame-index of trajectory the files before accessing them
to extract the frames specified in :py:obj:indexes. For cases with
long :py:obj:indexes but short :py:obj:traj_inp lists, it may be
worth setting :py:obj:build_index to True and allow before
extracting the frames For cases with short :py:obj:indexes arrays
and long :py:obj:traj_inp lists, the default behaviour is
significantly faster. |

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#788 (comment)

Prof. Dr. Frank Noe
Head of Computational Molecular Biology group
Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354
Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

gph82 · 2016-04-28T17:01:53Z

Well, then that is much bigger issue, I guess you have to ping @clonker and talk design.

I wrote the original save_traj to be able to work with lists of files in case the user has not built (or doesn't want to build) readers...but that got changed along the way

gph82 · 2016-04-28T17:13:21Z

@marscher, I am not getting the revert to work...am I missing something?

gph82@southsea:  ~/programs/PyEmma$ git reset --hard origin/devel
HEAD is now at 3a79a22 latest state
gph82@southsea:  ~/programs/PyEmma$ git revert 9363956
error: could not revert 9363956... [coor.save_traj|FragmentedTrajReader] rewrote save_traj to use random access iterators.
hint: after resolving the conflicts, mark the corrected paths
hint: with 'git add <paths>' or 'git rm <paths>'
hint: and commit the result with 'git commit'

gph82 · 2016-04-28T17:17:16Z

plus, my proposed solution (#788 (comment)) would not be of any use, actually...

marscher · 2016-04-29T11:26:22Z

What if we only index the files from which actually frames are going to be extracted? So we just filter the given indices to kick out files, which are not desired. This would then only trigger a pre-indexing of the desired files, not the whole list.

franknoe · 2016-04-29T11:29:50Z

I suppose that makes sense. For xtc you'd anyway have to index the file
in order to extract frames - you could only save work on indexing in the
situation that only frames in the beginning of the file are requested.

Can you try it out and @gph82 checks how this behaves on his data?

Am 29/04/16 um 13:26 schrieb Martin K. Scherer:

What if we only index the files from which actually frames are going
to be extracted?

So we just filter the given indices to kick out files, which are not
desired.
This would then only trigger a pre-indexing of the desired files, not
the whole list.

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#788 (comment)

Prof. Dr. Frank Noe
Head of Computational Molecular Biology group
Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354
Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

gph82 · 2016-04-29T11:41:08Z

Yeah I'll try it out.

Still think that it might be a bit overkill. I don't think there's a
general solution. Then one will be "sold" by whatever is in indexes...

I can think of scenarios where one would still choose not to build an
index and just stream through the needed files

On 04/29/2016 01:29 PM, Frank Noe wrote:

I suppose that makes sense. For xtc you'd anyway have to index the file
in order to extract frames - you could only save work on indexing in the
situation that only frames in the beginning of the file are requested.

Can you try it out and @gph82 checks how this behaves on his data?

Am 29/04/16 um 13:26 schrieb Martin K. Scherer:

What if we only index the files from which actually frames are going
to be extracted?

So we just filter the given indices to kick out files, which are not
desired.
This would then only trigger a pre-indexing of the desired files, not
the whole list.

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#788 (comment)

Prof. Dr. Frank Noe
Head of Computational Molecular Biology group
Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354
Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#788 (comment)

Dr. Guillermo Pérez-Hernández
Freie Universität Berlin
Institute for Mathematics
Arnimallee 6
D-14195 Berlin
tel 0049 30 838 75775

http://userpage.fu-berlin.de/gph82/

franknoe · 2016-04-29T11:45:34Z

If files have been indexed, there is no reason not to use the
information. One just shouldn't waste time on indexing trajectories that
aren't used.

Am 29/04/16 um 13:41 schrieb Guillermo Pérez-Hernández:

Yeah I'll try it out.

Still think that it might be a bit overkill. I don't think there's a
general solution. Then one will be "sold" by whatever is in indexes...

I can think of scenarios where one would still choose not to build an
index and just stream through the needed files

On 04/29/2016 01:29 PM, Frank Noe wrote:

I suppose that makes sense. For xtc you'd anyway have to index the file
in order to extract frames - you could only save work on indexing in the
situation that only frames in the beginning of the file are requested.

Can you try it out and @gph82 checks how this behaves on his data?

Am 29/04/16 um 13:26 schrieb Martin K. Scherer:

What if we only index the files from which actually frames are going
to be extracted?

So we just filter the given indices to kick out files, which are not
desired.
This would then only trigger a pre-indexing of the desired files, not
the whole list.

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub

#788 (comment)

Prof. Dr. Frank Noe
Head of Computational Molecular Biology group
Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354
Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub

#788 (comment)

Dr. Guillermo Pérez-Hernández
Freie Universität Berlin
Institute for Mathematics
Arnimallee 6
D-14195 Berlin
tel 0049 30 838 75775

http://userpage.fu-berlin.de/gph82/

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#788 (comment)

Prof. Dr. Frank Noe
Head of Computational Molecular Biology group
Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354
Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

franknoe · 2016-05-09T08:59:15Z

What's the situation with this issue?

marscher · 2016-05-09T09:51:16Z

I'm resolving the last problem on my bug fixing branch.

franknoe · 2016-05-09T09:53:16Z

ok!

Am 09/05/16 um 11:51 schrieb Martin K. Scherer:

I'm resolving the last problem on my bug fixing branch.

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#788 (comment)

Prof. Dr. Frank Noe
Head of Computational Molecular Biology group
Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354
Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

marscher · 2016-05-12T16:57:02Z

Still some hassle to solve with the FragmentedReader. It seems it collects the frames in wrong order. I've implemented a special case for the FeatureReader and also for FragmentedReader to return mdtraj.Trajectory objects (to have unit cell information in place).

I think @clonker and me can sort this out tomorrow.

franknoe · 2016-05-12T18:09:02Z

ok, thanks

Am 12/05/16 um 18:57 schrieb Martin K. Scherer:

Still some hassle to solve with the FragmentedReader. It seems it
collects the frames in wrong order. I've implemented a special case
for the FeatureReader and also for FragmentedReader to return
mdtraj.Trajectory objects (to have unit cell information in place).

I think @clonker https://github.com/clonker and me can sort this out
tomorrow.

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#788 (comment)

Prof. Dr. Frank Noe
Head of Computational Molecular Biology group
Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354
Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

If a reader has to be constructed within coordinates.save_traj(s), we only pass the file names actually needed to extract frames from to the reader. This fixes #788 (index unneeded files). Involved changes: * [featurereader] added option to return plain mdtraj.Trajectory objects * [fragmentedreader] added reader_by_filename property * [coor/api] handle chunk_size argument correctly (0 evaluates to False). * default chunksize 1000 * [patches] added setitem method for Trajectory objs * [reader_utils] def chunksize=1000, added setitem method for empty_traj created objs. * [fragreader] use setitem method for traj objects collected by underlying featurereaders * [datasource] handle too large RA indices for stride * added create_traj in test.util.

If a reader has to be constructed within coordinates.save_traj(s), we only pass the file names actually needed to extract frames from to the reader. This fixes markovmodel#788 (index unneeded files). Involved changes: * [featurereader] added option to return plain mdtraj.Trajectory objects * [fragmentedreader] added reader_by_filename property * [coor/api] handle chunk_size argument correctly (0 evaluates to False). * default chunksize 1000 * [patches] added setitem method for Trajectory objs * [reader_utils] def chunksize=1000, added setitem method for empty_traj created objs. * [fragreader] use setitem method for traj objects collected by underlying featurereaders * [datasource] handle too large RA indices for stride * added create_traj in test.util.

marscher mentioned this issue May 14, 2016

Save traj file handling #803

Merged

marscher closed this as completed in #803 May 16, 2016

marscher added a commit that referenced this issue May 16, 2016

[doc] changelog fix formatting and added note about #788.

67f0298

marscher added a commit to marscher/PyEMMA that referenced this issue May 17, 2016

[doc] changelog fix formatting and added note about markovmodel#788.

5911801

gph82 mentioned this issue Jul 7, 2016

Sqlite db backend for TrajInfo (len, dim, offsets) #798

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[coor/save_traj]: building of an index file? #788

[coor/save_traj]: building of an index file? #788

gph82 commented Apr 28, 2016 •

edited

Loading

marscher commented Apr 28, 2016 via email

franknoe commented Apr 28, 2016

gph82 commented Apr 28, 2016

gph82 commented Apr 28, 2016

franknoe commented Apr 28, 2016

gph82 commented Apr 28, 2016

franknoe commented Apr 28, 2016

gph82 commented Apr 28, 2016 •

edited

Loading

franknoe commented Apr 28, 2016

gph82 commented Apr 28, 2016 •

edited

Loading

gph82 commented Apr 28, 2016

gph82 commented Apr 28, 2016

marscher commented Apr 29, 2016 via email

franknoe commented Apr 29, 2016

gph82 commented Apr 29, 2016

Mail: Arnimallee 6, 14195 Berlin, Germany

franknoe commented Apr 29, 2016

Mail: Arnimallee 6, 14195 Berlin, Germany

http://userpage.fu-berlin.de/gph82/

franknoe commented May 9, 2016

marscher commented May 9, 2016 via email

franknoe commented May 9, 2016

marscher commented May 12, 2016

franknoe commented May 12, 2016

[coor/save_traj]: building of an index file? #788

[coor/save_traj]: building of an index file? #788

Comments

gph82 commented Apr 28, 2016 • edited Loading

marscher commented Apr 28, 2016 via email

franknoe commented Apr 28, 2016

Mail: Arnimallee 6, 14195 Berlin, Germany

gph82 commented Apr 28, 2016

gph82 commented Apr 28, 2016

franknoe commented Apr 28, 2016

Mail: Arnimallee 6, 14195 Berlin, Germany

gph82 commented Apr 28, 2016

franknoe commented Apr 28, 2016

Mail: Arnimallee 6, 14195 Berlin, Germany

gph82 commented Apr 28, 2016 • edited Loading

franknoe commented Apr 28, 2016

Mail: Arnimallee 6, 14195 Berlin, Germany

gph82 commented Apr 28, 2016 • edited Loading

gph82 commented Apr 28, 2016

gph82 commented Apr 28, 2016

marscher commented Apr 29, 2016 via email

franknoe commented Apr 29, 2016

Mail: Arnimallee 6, 14195 Berlin, Germany

gph82 commented Apr 29, 2016

Mail: Arnimallee 6, 14195 Berlin, Germany

http://userpage.fu-berlin.de/gph82/

franknoe commented Apr 29, 2016

Mail: Arnimallee 6, 14195 Berlin, Germany

http://userpage.fu-berlin.de/gph82/

Mail: Arnimallee 6, 14195 Berlin, Germany

franknoe commented May 9, 2016

marscher commented May 9, 2016 via email

franknoe commented May 9, 2016

Mail: Arnimallee 6, 14195 Berlin, Germany

marscher commented May 12, 2016

franknoe commented May 12, 2016

Mail: Arnimallee 6, 14195 Berlin, Germany

gph82 commented Apr 28, 2016 •

edited

Loading

gph82 commented Apr 28, 2016 •

edited

Loading

gph82 commented Apr 28, 2016 •

edited

Loading