Serialization add version interpolation #849

marscher · 2016-06-30T08:22:55Z

In order to facilitate the loading of older class versions, which might have new, renamed and deleted fields, we define a interpolation mapping like this:

    _serialize_interpolation_map = {1 : [('set', 'attr', value),
                                         ('mv', 'old', 'new'),
                                         ('rm', 'attr')]}

The key of the dict indicate the version number for which this mapping will be applied. So for version 2 of the class the mapping 1 will be applied.

Currently these operations work:

We can set attributes to different values (eg. a new attribute has been added in a new version)
Rename an attribute
delete an attribute

This way we can track (all?) changes made to classes via versioning. To track class rename refactorings, we need to provide a renaming dictionary in the load function to translate the class name to its actual location. This just requires a simple rename in the JSON string.

…del/PyEMMA into models_estimators_serializeable

franknoe · 2016-06-30T08:53:35Z

That's a good idea. But this needs to be updated manually after a rename, right?

I think we should clean up class attributes and their names before releasing this feature, such that we're in a situation where attribute names appear stable.

marscher · 2016-06-30T10:20:45Z

Am 30.06.2016 um 10:53 schrieb Frank Noe:

That's a good idea. But this needs to be updated manually after a rename, right?
Yes this has to be maintained, but I'd guess it doesn't happen to often, right?

I think we should clean up class attributes and their names before releasing this feature, such that we're in a situation where attribute names appear stable.
You mean we should clean up before merging model_estimators_serializable and not this PR?

I agree that we should tidy up things beforehand to minimize the need to use this feature. Do you have any suggestion which modules need some attention?

franknoe · 2016-06-30T11:38:56Z

Am 30/06/16 um 12:20 schrieb Martin K. Scherer:

Am 30.06.2016 um 10:53 schrieb Frank Noe:

That's a good idea. But this needs to be updated manually after a
rename, right?
Yes this has to be maintained, but I'd guess it doesn't happen to
often, right?
Hopefully not.

I think we should clean up class attributes and their names before
releasing this feature, such that we're in a situation where attribute
names appear stable.
You mean we should clean up before merging
model_estimators_serializable and not this PR?
Correct, before merging this branch into master.

I agree that we should tidy up things beforehand to minimize the need
to use this feature. Do you have any suggestion which modules need
some attention?
The estimators I have worked on are generally quite full of hidden
variables (with leading ''), but then there are getters without leading
''. Now I think this is a little redundant and it might be better to
have sklearn-style where estimator variables are just accessible (also
settable if needed) under their names.

Generally, it may be useful to order variables into principle variables
(information we really need to store), and secondary variables
(information we can compute from the principle variables). For
estimators we need to decide what to do with the data - saving the data
used to parametrize the estimator may be excessive. If the data can be
compressed to an efficient format which parametrizes the estimator (e.g.
count matrix for MaximumLikelihoodMSM), this is feasible, but in other
cases (e.g. TRAM) we don't know such a format. I'm not sure how to solve
this, so I guess we need a developer meeting soon.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#849 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AGMeQkuCKNYYf9ZsM1sFIM0y3twm_VYUks5qQ5h9gaJpZM4JB6iS.

Prof. Dr. Frank Noe
Head of Computational Molecular Biology group
Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354
Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

coveralls · 2016-06-30T12:20:06Z

Coverage decreased (-0.08%) to 85.184% when pulling 1b2cdbf on marscher:serialization_add_version_interpolation into 0f9b452 on markovmodel:models_estimators_serializeable.

marscher · 2016-06-30T12:34:50Z

I agree that it would be easier to implement this if we stick to the sklearn pattern of storing everything as plain attribute. However our classes are much more complex than a simple sklearn estimator and we are relying on some logic, which requires private attributes. If we want to have a fully operational restored object, e.g. which behaves exactly the same way prior saving it, I think we have to store these internal parameters too.
This is not a problem for plain properties by the way, since the serialization protocol will just invoke the getters and setters for them. So there is no extra overhead for properties, just internal flow control variables.

For TRAM I'd suggest to omit very large arrays, though the data format is at least space efficient, since it uses Numpys npz format for arrays. But I'd guess these arrays are still too large to store them. Maybe we should store the code to a function which generates/restores the user data in this case (which might be error prone)?

I agree that we should discuss is in detail in a meeting.

franknoe · 2016-06-30T12:38:54Z

I see generally two usecases:

You want to store the estimated model
You want to store also the data used to estimate the model, and keep
this relationship intact.
We might want to leave the choice to the user. However, the fact that we
chose to use the model as a mixin to the estimator, confuses this
distinction from a user perspective, so I'm currently not clear what's
the best design to solve this.

Am 30/06/16 um 14:34 schrieb Martin K. Scherer:

I agree that it would be easier to implement this if we stick to the
sklearn pattern of storing everything as plain attribute. However our
classes are much more complex than a simple sklearn estimator and we
are relying on some logic, which requires private attributes. If we
want to have a fully operational restored object, e.g. which behaves
exactly the same way prior saving it, I think we have to store these
internal parameters too.
This is not a problem for plain properties by the way, since the
serialization protocol will just invoke the getters and setters for
them. So there is no extra overhead for properties, just internal flow
control variables.

For TRAM I'd suggest to omit very large arrays, though the data format
is at least space efficient, since it uses Numpys npz format for
arrays. But I'd guess these arrays are still too large to store them.
Maybe we should store the code to a function which generates/restores
the user data in this case (which might be error prone)?

I agree that we should discuss is in detail in a meeting.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#849 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AGMeQoWgahVJQq8onBmXgwEZvHVlebn-ks5qQ7fqgaJpZM4JB6iS.

Prof. Dr. Frank Noe
Head of Computational Molecular Biology group
Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354
Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

marscher · 2016-06-30T12:45:09Z

I think this choice can be made by a simple parameter in Estimator.save, eg. save_estimation_data=False. Then we introduce another field in the Estimator to declare which fields are "estimation data".

Since most models are currently encapsulated within the Estimator (eg. the plain model is not trivially usable), I do not see the need to distinguish between them. If we want to store the model separately the user can simply invoke:

tica = pyemma.coordinates.tica(data, ...)
tica.model.save("tica_model")

However it is more difficult to handle, since I'm not sure if just setting the model in the Estimator will fully restore the desired state.

coveralls · 2016-06-30T16:37:58Z

Coverage decreased (-0.1%) to 85.113% when pulling c220b54 on marscher:serialization_add_version_interpolation into 0f9b452 on markovmodel:models_estimators_serializeable.

marscher added 3 commits June 24, 2016 18:56

wip

52d01c9

Merge branch 'models_estimators_serializeable' of github.com:markovmo…

5b63070

…del/PyEMMA into models_estimators_serializeable

added version interpolation

f936c5f

invoke parent serialization getters to obtain serialize_version

1b2cdbf

better exception msg [ci skip]

4f0c6ee

added more tests. Interpolation works for version ranges.

c220b54

dbg flag for logging [ci skip]

70123ad

marscher merged commit e3f5b1c into markovmodel:models_estimators_serializeable Jul 1, 2016

marscher deleted the serialization_add_version_interpolation branch July 1, 2016 12:12

franknoe mentioned this pull request Jul 12, 2016

Models estimators serializable #867

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serialization add version interpolation #849

Serialization add version interpolation #849

marscher commented Jun 30, 2016

franknoe commented Jun 30, 2016

marscher commented Jun 30, 2016

franknoe commented Jun 30, 2016

coveralls commented Jun 30, 2016 •

edited

Loading

marscher commented Jun 30, 2016

franknoe commented Jun 30, 2016

marscher commented Jun 30, 2016

coveralls commented Jun 30, 2016 •

edited

Loading

Serialization add version interpolation #849

Serialization add version interpolation #849

Conversation

marscher commented Jun 30, 2016

franknoe commented Jun 30, 2016

marscher commented Jun 30, 2016

franknoe commented Jun 30, 2016

Mail: Arnimallee 6, 14195 Berlin, Germany

coveralls commented Jun 30, 2016 • edited Loading

marscher commented Jun 30, 2016

franknoe commented Jun 30, 2016

Mail: Arnimallee 6, 14195 Berlin, Germany

marscher commented Jun 30, 2016

coveralls commented Jun 30, 2016 • edited Loading

coveralls commented Jun 30, 2016 •

edited

Loading

coveralls commented Jun 30, 2016 •

edited

Loading