[ITS] fix lag setter and avoid recomputation #1030

marscher · 2017-01-27T17:47:44Z

added a setter for for lags
If the lags array is extended, or a lag time is removed, we compute only those models which are needed. Upon removal, the computed arrays are truncated correctly. If the input data is changed, all models are discarded.
Fixed a bug in estimate_param_scan (have to clone estimators if they are models, otherwise everything is overwritten).

These changes make it possible to avoid heave re-estimation, if one one lag time is added. @gph82 guess this is what you've desired. Please go ahead and test

…a has not changed. Fixes markovmodel#999

…etc.)

gph82 · 2017-01-27T17:54:43Z

Thanks, will do

marscher · 2017-01-27T18:16:31Z

the tests passed locally. Don't know whats going on there ... will check on Monday.

values are indeed wrong, if the estimator is not cloned, because in case of MLMSM, the RDL decomposition is cached between estimations. This seems to be a more serious bug.

codecov-io · 2017-01-30T17:34:32Z

Codecov Report

Merging #1030 into devel will increase coverage by -0.02%.

@@            Coverage Diff            @@
##           devel    #1030      +/-   ##
=========================================
- Coverage   84.5%   84.48%   -0.02%     
=========================================
  Files        189      189              
  Lines      18813    18913     +100     
=========================================
+ Hits       15898    15979      +81     
- Misses      2915     2934      +19

Impacted Files	Coverage Δ
pyemma/util/numeric.py	`66.66% <ø> (+50%)`	✅
pyemma/coordinates/clustering/interface.py	`88.78% <100%> (-0.82%)`	❌
pyemma/coordinates/data/featurization/util.py	`87.17% <100%> (-1.96%)`	❌
pyemma/msm/tests/test_its.py	`99.31% <100%> (+0.21%)`	✅
pyemma/_base/estimator.py	`66.4% <100%> (-0.8%)`	❌
pyemma/_ext/sklearn/base.py	`38.93% <50%> (+0.53%)`	✅
pyemma/msm/estimators/implied_timescales.py	`71.64% <78.26%> (+6.4%)`	✅
pyemma/_base/parallel.py	`77.35% <80.95%> (-7.02%)`	❌
pyemma/msm/tests/test_its_oom.py	`97.93% <85.71%> (-0.97%)`	❌
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4aa5201...497b889. Read the comment docs.

gph82 · 2017-01-31T13:48:47Z

Thanks @marscher.

I think the functionality is something like the attached notebook and seems to do what it's supposed to. Also, I took a look at the test test_insert_lag_time, everything looks sane.

What would you say about adding a method like "add_lagtimes?" that does both the adding and the estimating?

marscher · 2017-01-31T13:58:42Z

Hi thanks for the feedback. If we ensure the lags property is a list we could also write something like this:

its.lags.append(x)
its.estimate(dtrajs)

I'm not really sure, if we want another method for this to save just one line of code.

marscher · 2017-01-31T13:59:23Z

or its.lags.extend([1,2,3]), but you got the idea.

gph82 · 2017-01-31T14:00:49Z

i like more lags.extend, because appending to the list does not sort them, but, isn't extend a new method, still?

franknoe · 2017-01-31T14:03:53Z

I haven't followed this in detail. I think re-estimating without wasting time on redoing work already done is a useful functionality. But let's be careful not to handle general problems by adding very specialized API functions to specific estimators. For convenience reasons, ImpliedTimescales is currently an estimator but really it is performing a grid search over hyperparameters (here: the lag times). I'm sure adding parameters and re-estimating is something the sklearn folks have worked out, so please check the docs to see how to do this in a general way. Here's a starting point, but hopefully there's something more specific: http://scikit-learn.org/stable/modules/grid_search.html Am 31/01/17 um 14:48 schrieb Guillermo Pérez-Hernández:

…

Thanks @marscher <https://github.com/marscher>. I think the functionality is something like the attached notebook and seems to do what it's supposed to. Also, I took a look at the test test_insert_lag_time, everything looks sane. What would you say about adding a method like "add_lagtimes?" that does both the adding and the estimating? snap8 <https://cloud.githubusercontent.com/assets/7518004/22467269/3f4847ba-e7c4-11e6-905e-404af81899b5.png> — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1030 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGMeQpzQ0FbchmdyYqUZCS1a-OLD8I1Tks5rXzu_gaJpZM4LwCAG>.

--

---------------------------------------------- Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin Phone: (+49) (0)30 838 75354 Web: research.franknoe.de Mail: Arnimallee 6, 14195 Berlin, Germany ----------------------------------------------

marscher · 2017-01-31T14:24:29Z

I've red the code of sklearn, they do not support partial estimation/avoid re-estimating the same thing in the GridSearch based classes. If you want to change your hyper-plane, you have to call fit again. Also the estimators need to implement the scoring interface, in order to select the best model to delegate the public predict/transform methods to. So this is really different from ITS, because it does GridSearch does not maintain the list of all Estimators, but only the "best" one.

So we can think of having a general solution for this (eg. get the parameters for the already estimated estimators, do a set difference to what has been calculated and only perform the missing ones).

franknoe · 2017-01-31T14:59:55Z

So basicly we'd like to have something like GridSearch with persistence, e.g. ParameterScan. The basic functionality of this would be just to have a list of parameters and associated estimators/models. It could behave similar to GridSearch, but would need a way to redo estimates for new parameter sets. This object is a bit "use at your own risk", because it cannot guarantee that the data that went into different parameter estimation rounds are consistent, as data should generally not be stored in the estimator or models. But I think that's fine. So bottomline I think it'd be great to have this functionality in ImpliedTimescales if we make sure that we have a concept for generalization, e.g. a superclass pattern such as ParameterScan with function/attribute naming conventions that also work for other parameter scanners than ImpliedTimescales. Am 31/01/17 um 15:24 schrieb Martin K. Scherer:

…

I've red the code of sklearn, they do not support partial estimation/avoid re-estimating the same thing in the GridSearch based classes. If you want to change your hyper-plane, you have to call fit again. Also the estimators need to implement the scoring interface, in order to select the best model to delegate the public predict/transform methods to. So this is really different from ITS, because it does GridSearch does not maintain the list of all Estimators, but only the "best" ones. So we can think of having a general solution for this (eg. get the parameters for the already estimated estimators, do a set difference to what has been calculated and only perform the missing ones). — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1030 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGMeQvqYxTNRv3EOl1aw6iQxk5Q6iLTvks5rX0QegaJpZM4LwCAG>.

--

---------------------------------------------- Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin Phone: (+49) (0)30 838 75354 Web: research.franknoe.de Mail: Arnimallee 6, 14195 Berlin, Germany ----------------------------------------------

marscher · 2017-01-31T15:19:30Z

On 01/31/2017 03:59 PM, Frank Noe wrote: So basicly we'd like to have something like GridSearch with persistence, e.g. ParameterScan. The basic functionality of this would be just to have a list of parameters and associated estimators/models. It could behave similar to GridSearch, but would need a way to redo estimates for new parameter sets. This object is a bit "use at your own risk", because it cannot guarantee that the data that went into different parameter estimation rounds are consistent, as data should generally not be stored in the estimator or models. But I think that's fine.

This issue is still present for ML-MSM derived classes which stores the full dtrajs. In general it can be solved by storing a fingerprint of the last input and discard already estimated stuff, if this changes (this is how I've done it for ITS right now). Another risk is that, one blows up the memory, when new models/estimators are added for every combination of the new grid.

So bottomline I think it'd be great to have this functionality in ImpliedTimescales if we make sure that we have a concept for generalization, e.g. a superclass pattern such as ParameterScan with function/attribute naming conventions that also work for other parameter scanners than ImpliedTimescales.

We have no other use cases for this general class yet, right? Since the API of ITS is not being changed by this PR, we can just merge it for now and introduce this generalization later on (when needed).

franknoe · 2017-01-31T15:27:40Z

Am 31/01/17 um 16:19 schrieb Martin K. Scherer:

On 01/31/2017 03:59 PM, Frank Noe wrote: > So basicly we'd like to have something like GridSearch with persistence, > e.g. ParameterScan. The basic functionality of this would be just to > have a list of parameters and associated estimators/models. It could > behave similar to GridSearch, but would need a way to redo estimates for > new parameter sets. > > This object is a bit "use at your own risk", because it cannot guarantee > that the data that went into different parameter estimation rounds are > consistent, as data should generally not be stored in the estimator or > models. But I think that's fine. This issue is still present for ML-MSM derived classes which stores the full dtrajs. In general it can be solved by storing a fingerprint of the last input and discard already estimated stuff, if this changes (this is how I've done it for ITS right now).

Hm, I don't think data necessarily needs to be identical in order to be consistent. For example, you could fit with different random samples of a large data set, so checksums or similar might be too strict. I think it makes sense to leave the decision to the user what data they want to be combined under the same scan.

Another risk is that, one blows up the memory, when new models/estimators are added for every combination of the new grid.

Definitely, but that can't be avoided without making it much more complicated (e.g. by writing things to a cache file and lazy loading). At least it's pretty obvious why memory blows here, and it's in the hands of the user to avoid it.

> > So bottomline I think it'd be great to have this functionality in > ImpliedTimescales if we make sure that we have a concept for > generalization, e.g. a superclass pattern such as ParameterScan with > function/attribute naming conventions that also work for other parameter > scanners than ImpliedTimescales. We have no other use cases for this general class yet, right? Since the API of ITS is not being changed by this PR, we can just merge it for now and introduce this generalization later on (when needed).

Yes, as long as we don't use function/attribute naming conventions that are so specific that they can't be generalized and that we have to remove from the API again later. Removing things from the API is always a pain. ``add_lagtimes()`` would not generalize for example.

…

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1030 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGMeQkj6yjTRGMfOx4YWHyK714wd6Zz9ks5rX1EFgaJpZM4LwCAG>.

--

---------------------------------------------- Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin Phone: (+49) (0)30 838 75354 Web: research.franknoe.de Mail: Arnimallee 6, 14195 Berlin, Germany ----------------------------------------------

gph82 · 2017-01-31T17:09:33Z

BTW I was talking to @marscher about this, it's really not that important, just a shorthand that is more intuitive, but in the end we're talking about two lines of code

marscher · 2017-02-02T13:37:27Z

I'm currently working on the concept of an abstract ParameterSearch estimator. In the meantime: can we merge this, because it also contains other useful fixes and make ITS use the ParameterSearch class later on?

franknoe · 2017-02-02T13:39:32Z

Since this doesn't seem to change existing API but only extend it, yes.

marscher · 2017-02-02T13:43:52Z

Thanks.

marscher added 4 commits January 27, 2017 17:59

[ITS] re-estimate only models for not yet seen lag times if input dat…

aa7c4f1

…a has not changed. Fixes markovmodel#999

[base/estimator] fix bug, if estimator is a model, we have to clone.

b469c5a

[featurization/util] use impl from pyemma.util

6b899b5

[its] upon lags change, remove obsolete models (and their timescales …

a2d4cb6

…etc.)

marscher added 5 commits January 30, 2017 15:21

[estimator] fix doctest example

2282025

values are indeed wrong, if the estimator is not cloned, because in case of MLMSM, the RDL decomposition is cached between estimations. This seems to be a more serious bug.

[its] fix nits mistake

b026a52

[ext/sklearn] updated code from sklearn repo

8677b86

[appveyor] try to fix conda-build/conda version conflict

b20ec6f

[appveyor] use np=111

7b02858

marscher added 7 commits January 30, 2017 18:49

use clustering algo n_jobs handling as a mixin pattern

5735d26

[its] fix insertion and removal at the same time. Use njobs mixin

2b29ef0

[njobs_mixin] fix preference of env over n_jobs=None.

5d499d1

[test_its_oom] use n_jobs=None to speed up testing.

48279f0

f

8385389

[travis] set deployment target for osx

422adab

[travis] use a recent conda-build version

497b889

marscher mentioned this pull request Jan 31, 2017

Add Prinz potential to datasets #1015

Closed

franknoe merged commit 56a100b into markovmodel:devel Feb 2, 2017

marscher deleted the its_fix_lag_setter_avoid_recomputation branch February 2, 2017 13:43

gph82 mentioned this pull request Sep 11, 2017

ITS/MSM: allow to extend nsamples for error='bayes' a posteriori? #1157

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ITS] fix lag setter and avoid recomputation #1030

[ITS] fix lag setter and avoid recomputation #1030

marscher commented Jan 27, 2017

gph82 commented Jan 27, 2017

marscher commented Jan 27, 2017

codecov-io commented Jan 30, 2017 •

edited

Loading

gph82 commented Jan 31, 2017

marscher commented Jan 31, 2017

marscher commented Jan 31, 2017

gph82 commented Jan 31, 2017

franknoe commented Jan 31, 2017 via email

marscher commented Jan 31, 2017 •

edited

Loading

franknoe commented Jan 31, 2017 via email

marscher commented Jan 31, 2017 via email

franknoe commented Jan 31, 2017 via email

gph82 commented Jan 31, 2017

marscher commented Feb 2, 2017

franknoe commented Feb 2, 2017

marscher commented Feb 2, 2017

[ITS] fix lag setter and avoid recomputation #1030

[ITS] fix lag setter and avoid recomputation #1030

Conversation

marscher commented Jan 27, 2017

gph82 commented Jan 27, 2017

marscher commented Jan 27, 2017

codecov-io commented Jan 30, 2017 • edited Loading

Codecov Report

gph82 commented Jan 31, 2017

marscher commented Jan 31, 2017

marscher commented Jan 31, 2017

gph82 commented Jan 31, 2017

franknoe commented Jan 31, 2017 via email

marscher commented Jan 31, 2017 • edited Loading

franknoe commented Jan 31, 2017 via email

marscher commented Jan 31, 2017 via email

franknoe commented Jan 31, 2017 via email

gph82 commented Jan 31, 2017

marscher commented Feb 2, 2017

franknoe commented Feb 2, 2017

marscher commented Feb 2, 2017

codecov-io commented Jan 30, 2017 •

edited

Loading

marscher commented Jan 31, 2017 •

edited

Loading