Skip to content
This repository has been archived by the owner on Sep 11, 2023. It is now read-only.

[coor/save_traj]: building of an index file? #788

Closed
gph82 opened this issue Apr 28, 2016 · 21 comments
Closed

[coor/save_traj]: building of an index file? #788

gph82 opened this issue Apr 28, 2016 · 21 comments

Comments

@gph82
Copy link
Contributor

gph82 commented Apr 28, 2016

Hi, I don't know where this was introduced, but this was unknown to me:
save_traj relies upon indexing xtcs before carrying out its frame-of-file extraction.

snap1

If the index has not been constructed before, this is definitely NOT the moment to do it, imho. Notice that I have to wait 20h+ to extract one single frame.

I think that at least the possibility should be offered to skip reader construction (

reader = source(files, top=top)
) and use the old implementation in cases where the trade-off is obviously worth it.

This is a REAL blocker for me now...

@marscher
Copy link
Member

marscher commented Apr 28, 2016 via email

@franknoe
Copy link
Contributor

I agree. We should build the index lazily (i.e. just remember what you
know), but not run into the danger of wasting time by indexing
everything that is being opened.

Is this in a release or just devel?

Am 28/04/16 um 18:34 schrieb Guillermo Pérez-Hernández:

Hi, I don't know where this was introduced, but this was unknown to me:
save_traj relies upon indexing xtcs before carrying out its
frame-of-file extraction.

snap1
https://cloud.githubusercontent.com/assets/7518004/14893228/cecac358-0d6e-11e6-81aa-397c431c59ba.png

If the index has not being constructed before, this is definitely NOT
the moment to do it, imho. Notice that I have to wait 20h+ to extract
one single frame.

I think that at least the possibility should be offered to skip reader
construction
(

reader = source(files, top=top)
)
and use the old implementation in cases where the trade-off is
obviously worth it.

This is a REAL blocker for me now...


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#788


Prof. Dr. Frank Noe
Head of Computational Molecular Biology group
Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354
Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

@gph82
Copy link
Contributor Author

gph82 commented Apr 28, 2016

Actually, I can prepare a PR that gives to option to choose between frames_from_files and frames_from_file here:

@gph82
Copy link
Contributor Author

gph82 commented Apr 28, 2016

That would cure it. Some optarg of save_traj along the lines of "no_cache=False".
""""
no_cache: boolean, default is False
If set to true, pyemma will skip the indexing of trajectories before accessing them. In some cases with short :py:obj:indexes arrays and long :py:obj:traj_inp" lists, this can be significantly faster. In others, with long :py:obj:indexes but short :py:obj:traj_inp" lists, it is worth waiting to build an index before extracting the frames.
"""

@franknoe
Copy link
Contributor

I think defaults should behave well. Having bad behavior that you can
switch off optionally is not a good strategy.

Am 28/04/16 um 18:47 schrieb Guillermo Pérez-Hernández:

That would cure it. Some optarg of save_traj along the lines of
"no_cache=False".
""""
no_cache: boolean, default is False
If set to true, pyemma will skip the indexing of trajectories before
accessing them. In some cases with short :py:obj:indexes arrays and
long :py:obj:traj_inp" lists, this can be significantly faster. In
others, with long :py:obj:indexes but short :py:obj:traj_inp" lists,
it is worth waiting to build an index before extracting the frames.
"""


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#788 (comment)


Prof. Dr. Frank Noe
Head of Computational Molecular Biology group
Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354
Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

@gph82
Copy link
Contributor Author

gph82 commented Apr 28, 2016

Agree, but I the thing is we don't know a priory what's "good" or "bad". In my case is disastrous, but in other it may not. I don't have a strong opinion here...it's really hard to know a priori, i think

@franknoe
Copy link
Contributor

I think it's clear. We are working with mass data, so no proactive
indexing of everything in addition to what is being read anyway. We
can't predict how the user wants to use the data.

Am 28/04/16 um 18:54 schrieb Guillermo Pérez-Hernández:

Agree, but I the thing is we don't know a priory what's "good" or
"bad". In my case is disastrous, but in other it may not. I don't have
a strong opinion here...it's really hard to know a priori, i think


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#788 (comment)


Prof. Dr. Frank Noe
Head of Computational Molecular Biology group
Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354
Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

@gph82
Copy link
Contributor Author

gph82 commented Apr 28, 2016

Got it!

What about sth like this?

    build_index : boolean, default is False
        If set to true, save_trajs will build a frame-index of the trajectory files before accessing them to extract the
        frames specified in :py:obj:`indexes`. For cases with long :py:obj:`indexes` but short :py:obj:`traj_inp` lists,
        it may be worth setting :py:obj:`build_index` to True and allow before extracting the frames 
        For cases with short :py:obj:`indexes` arrays and long :py:obj:`traj_inp` lists, the default behaviour is 
        significantly faster.

@franknoe
Copy link
Contributor

This issue really has nothing to do with this particular function. It is
just triggered in this case by calling the function, but indexing should
not be controlled at this level.

Am 28/04/16 um 18:57 schrieb Guillermo Pérez-Hernández:

Got it!

What about sth like this?

|build_index : boolean, default is False If set to true, save_trajs
will build a frame-index of trajectory the files before accessing them
to extract the frames specified in :py:obj:indexes. For cases with
long :py:obj:indexes but short :py:obj:traj_inp lists, it may be
worth setting :py:obj:build_index to True and allow before
extracting the frames For cases with short :py:obj:indexes arrays
and long :py:obj:traj_inp lists, the default behaviour is
significantly faster. |


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#788 (comment)


Prof. Dr. Frank Noe
Head of Computational Molecular Biology group
Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354
Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

@gph82
Copy link
Contributor Author

gph82 commented Apr 28, 2016

Well, then that is much bigger issue, I guess you have to ping @clonker and talk design.

I wrote the original save_traj to be able to work with lists of files in case the user has not built (or doesn't want to build) readers...but that got changed along the way

@gph82
Copy link
Contributor Author

gph82 commented Apr 28, 2016

@marscher, I am not getting the revert to work...am I missing something?

gph82@southsea:  ~/programs/PyEmma$ git reset --hard origin/devel
HEAD is now at 3a79a22 latest state
gph82@southsea:  ~/programs/PyEmma$ git revert 9363956
error: could not revert 9363956... [coor.save_traj|FragmentedTrajReader] rewrote save_traj to use random access iterators.
hint: after resolving the conflicts, mark the corrected paths
hint: with 'git add <paths>' or 'git rm <paths>'
hint: and commit the result with 'git commit'

@gph82
Copy link
Contributor Author

gph82 commented Apr 28, 2016

plus, my proposed solution (#788 (comment)) would not be of any use, actually...

@marscher
Copy link
Member

marscher commented Apr 29, 2016 via email

@franknoe
Copy link
Contributor

I suppose that makes sense. For xtc you'd anyway have to index the file
in order to extract frames - you could only save work on indexing in the
situation that only frames in the beginning of the file are requested.

Can you try it out and @gph82 checks how this behaves on his data?

Am 29/04/16 um 13:26 schrieb Martin K. Scherer:

What if we only index the files from which actually frames are going
to be extracted?

So we just filter the given indices to kick out files, which are not
desired.
This would then only trigger a pre-indexing of the desired files, not
the whole list.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#788 (comment)


Prof. Dr. Frank Noe
Head of Computational Molecular Biology group
Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354
Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

@gph82
Copy link
Contributor Author

gph82 commented Apr 29, 2016

Yeah I'll try it out.

Still think that it might be a bit overkill. I don't think there's a
general solution. Then one will be "sold" by whatever is in indexes...

I can think of scenarios where one would still choose not to build an
index and just stream through the needed files

On 04/29/2016 01:29 PM, Frank Noe wrote:

I suppose that makes sense. For xtc you'd anyway have to index the file
in order to extract frames - you could only save work on indexing in the
situation that only frames in the beginning of the file are requested.

Can you try it out and @gph82 checks how this behaves on his data?

Am 29/04/16 um 13:26 schrieb Martin K. Scherer:

What if we only index the files from which actually frames are going
to be extracted?

So we just filter the given indices to kick out files, which are not
desired.
This would then only trigger a pre-indexing of the desired files, not
the whole list.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#788 (comment)


Prof. Dr. Frank Noe
Head of Computational Molecular Biology group
Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354
Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#788 (comment)

Dr. Guillermo Pérez-Hernández
Freie Universität Berlin
Institute for Mathematics
Arnimallee 6
D-14195 Berlin
tel 0049 30 838 75775

http://userpage.fu-berlin.de/gph82/

@franknoe
Copy link
Contributor

If files have been indexed, there is no reason not to use the
information. One just shouldn't waste time on indexing trajectories that
aren't used.

Am 29/04/16 um 13:41 schrieb Guillermo Pérez-Hernández:

Yeah I'll try it out.

Still think that it might be a bit overkill. I don't think there's a
general solution. Then one will be "sold" by whatever is in indexes...

I can think of scenarios where one would still choose not to build an
index and just stream through the needed files

On 04/29/2016 01:29 PM, Frank Noe wrote:

I suppose that makes sense. For xtc you'd anyway have to index the file
in order to extract frames - you could only save work on indexing in the
situation that only frames in the beginning of the file are requested.

Can you try it out and @gph82 checks how this behaves on his data?

Am 29/04/16 um 13:26 schrieb Martin K. Scherer:

What if we only index the files from which actually frames are going
to be extracted?

So we just filter the given indices to kick out files, which are not
desired.
This would then only trigger a pre-indexing of the desired files, not
the whole list.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub

#788 (comment)


Prof. Dr. Frank Noe
Head of Computational Molecular Biology group
Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354
Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub

#788 (comment)

Dr. Guillermo Pérez-Hernández
Freie Universität Berlin
Institute for Mathematics
Arnimallee 6
D-14195 Berlin
tel 0049 30 838 75775

http://userpage.fu-berlin.de/gph82/


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#788 (comment)


Prof. Dr. Frank Noe
Head of Computational Molecular Biology group
Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354
Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

@franknoe
Copy link
Contributor

franknoe commented May 9, 2016

What's the situation with this issue?

@marscher
Copy link
Member

marscher commented May 9, 2016 via email

@franknoe
Copy link
Contributor

franknoe commented May 9, 2016

ok!

Am 09/05/16 um 11:51 schrieb Martin K. Scherer:

I'm resolving the last problem on my bug fixing branch.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#788 (comment)


Prof. Dr. Frank Noe
Head of Computational Molecular Biology group
Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354
Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

@marscher
Copy link
Member

Still some hassle to solve with the FragmentedReader. It seems it collects the frames in wrong order. I've implemented a special case for the FeatureReader and also for FragmentedReader to return mdtraj.Trajectory objects (to have unit cell information in place).

I think @clonker and me can sort this out tomorrow.

@franknoe
Copy link
Contributor

ok, thanks

Am 12/05/16 um 18:57 schrieb Martin K. Scherer:

Still some hassle to solve with the FragmentedReader. It seems it
collects the frames in wrong order. I've implemented a special case
for the FeatureReader and also for FragmentedReader to return
mdtraj.Trajectory objects (to have unit cell information in place).

I think @clonker https://github.com/clonker and me can sort this out
tomorrow.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#788 (comment)


Prof. Dr. Frank Noe
Head of Computational Molecular Biology group
Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354
Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

marscher added a commit that referenced this issue May 16, 2016
If a reader has to be constructed within coordinates.save_traj(s), we only pass the file names actually needed to extract frames from to the reader. This fixes #788 (index unneeded files).

Involved changes:
* [featurereader] added option to return plain mdtraj.Trajectory objects
* [fragmentedreader] added reader_by_filename property
* [coor/api] handle chunk_size argument correctly (0 evaluates to False).
* default chunksize 1000
* [patches] added setitem method for Trajectory objs
* [reader_utils] def chunksize=1000, added setitem method for empty_traj created objs.
* [fragreader] use setitem method for traj objects collected by underlying featurereaders
* [datasource] handle too large RA indices for stride
* added create_traj in test.util.
marscher added a commit to marscher/PyEMMA that referenced this issue May 17, 2016
If a reader has to be constructed within coordinates.save_traj(s), we only pass the file names actually needed to extract frames from to the reader. This fixes markovmodel#788 (index unneeded files).

Involved changes:
* [featurereader] added option to return plain mdtraj.Trajectory objects
* [fragmentedreader] added reader_by_filename property
* [coor/api] handle chunk_size argument correctly (0 evaluates to False).
* default chunksize 1000
* [patches] added setitem method for Trajectory objs
* [reader_utils] def chunksize=1000, added setitem method for empty_traj created objs.
* [fragreader] use setitem method for traj objects collected by underlying featurereaders
* [datasource] handle too large RA indices for stride
* added create_traj in test.util.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants