-
Notifications
You must be signed in to change notification settings - Fork 119
[ITS] fix lag setter and avoid recomputation #1030
[ITS] fix lag setter and avoid recomputation #1030
Conversation
…a has not changed. Fixes markovmodel#999
Thanks, will do |
the tests passed locally. Don't know whats going on there ... will check on Monday. |
values are indeed wrong, if the estimator is not cloned, because in case of MLMSM, the RDL decomposition is cached between estimations. This seems to be a more serious bug.
Codecov Report@@ Coverage Diff @@
## devel #1030 +/- ##
=========================================
- Coverage 84.5% 84.48% -0.02%
=========================================
Files 189 189
Lines 18813 18913 +100
=========================================
+ Hits 15898 15979 +81
- Misses 2915 2934 +19
Continue to review full report at Codecov.
|
Thanks @marscher. I think the functionality is something like the attached notebook and seems to do what it's supposed to. Also, I took a look at the test test_insert_lag_time, everything looks sane. What would you say about adding a method like "add_lagtimes?" that does both the adding and the estimating? |
Hi thanks for the feedback. If we ensure the lags property is a list we could also write something like this: its.lags.append(x) I'm not really sure, if we want another method for this to save just one line of code. |
or its.lags.extend([1,2,3]), but you got the idea. |
i like more lags.extend, because appending to the list does not sort them, but, isn't extend a new method, still? |
I haven't followed this in detail. I think re-estimating without wasting
time on redoing work already done is a useful functionality. But let's
be careful not to handle general problems by adding very specialized API
functions to specific estimators. For convenience reasons,
ImpliedTimescales is currently an estimator but really it is performing
a grid search over hyperparameters (here: the lag times). I'm sure
adding parameters and re-estimating is something the sklearn folks have
worked out, so please check the docs to see how to do this in a general
way. Here's a starting point, but hopefully there's something more specific:
http://scikit-learn.org/stable/modules/grid_search.html
Am 31/01/17 um 14:48 schrieb Guillermo Pérez-Hernández:
…
Thanks @marscher <https://github.com/marscher>.
I think the functionality is something like the attached notebook and
seems to do what it's supposed to. Also, I took a look at the test
test_insert_lag_time, everything looks sane.
What would you say about adding a method like "add_lagtimes?" that
does both the adding and the estimating?
snap8
<https://cloud.githubusercontent.com/assets/7518004/22467269/3f4847ba-e7c4-11e6-905e-404af81899b5.png>
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1030 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGMeQpzQ0FbchmdyYqUZCS1a-OLD8I1Tks5rXzu_gaJpZM4LwCAG>.
--
----------------------------------------------
Prof. Dr. Frank Noe
Head of Computational Molecular Biology group
Freie Universitaet Berlin
Phone: (+49) (0)30 838 75354
Web: research.franknoe.de
Mail: Arnimallee 6, 14195 Berlin, Germany
----------------------------------------------
|
I've red the code of sklearn, they do not support partial estimation/avoid re-estimating the same thing in the GridSearch based classes. If you want to change your hyper-plane, you have to call fit again. Also the estimators need to implement the scoring interface, in order to select the best model to delegate the public predict/transform methods to. So this is really different from ITS, because it does GridSearch does not maintain the list of all Estimators, but only the "best" one. So we can think of having a general solution for this (eg. get the parameters for the already estimated estimators, do a set difference to what has been calculated and only perform the missing ones). |
So basicly we'd like to have something like GridSearch with persistence,
e.g. ParameterScan. The basic functionality of this would be just to
have a list of parameters and associated estimators/models. It could
behave similar to GridSearch, but would need a way to redo estimates for
new parameter sets.
This object is a bit "use at your own risk", because it cannot guarantee
that the data that went into different parameter estimation rounds are
consistent, as data should generally not be stored in the estimator or
models. But I think that's fine.
So bottomline I think it'd be great to have this functionality in
ImpliedTimescales if we make sure that we have a concept for
generalization, e.g. a superclass pattern such as ParameterScan with
function/attribute naming conventions that also work for other parameter
scanners than ImpliedTimescales.
Am 31/01/17 um 15:24 schrieb Martin K. Scherer:
…
I've red the code of sklearn, they do not support partial
estimation/avoid re-estimating the same thing in the GridSearch based
classes. If you want to change your hyper-plane, you have to call fit
again. Also the estimators need to implement the scoring interface, in
order to select the best model to delegate the public
predict/transform methods to. So this is really different from ITS,
because it does GridSearch does not maintain the list of all
Estimators, but only the "best" ones.
So we can think of having a general solution for this (eg. get the
parameters for the already estimated estimators, do a set difference
to what has been calculated and only perform the missing ones).
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1030 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGMeQvqYxTNRv3EOl1aw6iQxk5Q6iLTvks5rX0QegaJpZM4LwCAG>.
--
----------------------------------------------
Prof. Dr. Frank Noe
Head of Computational Molecular Biology group
Freie Universitaet Berlin
Phone: (+49) (0)30 838 75354
Web: research.franknoe.de
Mail: Arnimallee 6, 14195 Berlin, Germany
----------------------------------------------
|
On 01/31/2017 03:59 PM, Frank Noe wrote:
So basicly we'd like to have something like GridSearch with persistence,
e.g. ParameterScan. The basic functionality of this would be just to
have a list of parameters and associated estimators/models. It could
behave similar to GridSearch, but would need a way to redo estimates for
new parameter sets.
This object is a bit "use at your own risk", because it cannot guarantee
that the data that went into different parameter estimation rounds are
consistent, as data should generally not be stored in the estimator or
models. But I think that's fine.
This issue is still present for ML-MSM derived classes which stores the full dtrajs.
In general it can be solved by storing a fingerprint of the last input and discard already
estimated stuff, if this changes (this is how I've done it for ITS right now).
Another risk is that, one blows up the memory, when new models/estimators are added for every
combination of the new grid.
So bottomline I think it'd be great to have this functionality in
ImpliedTimescales if we make sure that we have a concept for
generalization, e.g. a superclass pattern such as ParameterScan with
function/attribute naming conventions that also work for other parameter
scanners than ImpliedTimescales.
We have no other use cases for this general class yet, right?
Since the API of ITS is not being changed by this PR, we can just merge it for now and introduce
this generalization later on (when needed).
|
Am 31/01/17 um 16:19 schrieb Martin K. Scherer:
On 01/31/2017 03:59 PM, Frank Noe wrote:
> So basicly we'd like to have something like GridSearch with persistence,
> e.g. ParameterScan. The basic functionality of this would be just to
> have a list of parameters and associated estimators/models. It could
> behave similar to GridSearch, but would need a way to redo estimates for
> new parameter sets.
>
> This object is a bit "use at your own risk", because it cannot guarantee
> that the data that went into different parameter estimation rounds are
> consistent, as data should generally not be stored in the estimator or
> models. But I think that's fine.
This issue is still present for ML-MSM derived classes which stores
the full dtrajs.
In general it can be solved by storing a fingerprint of the last input
and discard already
estimated stuff, if this changes (this is how I've done it for ITS
right now).
Hm, I don't think data necessarily needs to be identical in order to be
consistent. For example, you could fit with different random samples of
a large data set, so checksums or similar might be too strict. I think
it makes sense to leave the decision to the user what data they want to
be combined under the same scan.
Another risk is that, one blows up the memory, when new
models/estimators are added for every
combination of the new grid.
Definitely, but that can't be avoided without making it much more
complicated (e.g. by writing things to a cache file and lazy loading).
At least it's pretty obvious why memory blows here, and it's in the
hands of the user to avoid it.
>
> So bottomline I think it'd be great to have this functionality in
> ImpliedTimescales if we make sure that we have a concept for
> generalization, e.g. a superclass pattern such as ParameterScan with
> function/attribute naming conventions that also work for other parameter
> scanners than ImpliedTimescales.
We have no other use cases for this general class yet, right?
Since the API of ITS is not being changed by this PR, we can just
merge it for now and introduce
this generalization later on (when needed).
Yes, as long as we don't use function/attribute naming conventions that
are so specific that they can't be generalized and that we have to
remove from the API again later. Removing things from the API is always
a pain. ``add_lagtimes()`` would not generalize for example.
…
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1030 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGMeQkj6yjTRGMfOx4YWHyK714wd6Zz9ks5rX1EFgaJpZM4LwCAG>.
--
----------------------------------------------
Prof. Dr. Frank Noe
Head of Computational Molecular Biology group
Freie Universitaet Berlin
Phone: (+49) (0)30 838 75354
Web: research.franknoe.de
Mail: Arnimallee 6, 14195 Berlin, Germany
----------------------------------------------
|
BTW I was talking to @marscher about this, it's really not that important, just a shorthand that is more intuitive, but in the end we're talking about two lines of code |
I'm currently working on the concept of an abstract ParameterSearch estimator. In the meantime: can we merge this, because it also contains other useful fixes and make ITS use the ParameterSearch class later on? |
Since this doesn't seem to change existing API but only extend it, yes. |
Thanks. |
These changes make it possible to avoid heave re-estimation, if one one lag time is added. @gph82 guess this is what you've desired. Please go ahead and test