Skip to content
This repository has been archived by the owner on Sep 11, 2023. It is now read-only.

Conversation

marscher
Copy link
Member

In order to facilitate the loading of older class versions, which might have new, renamed and deleted fields, we define a interpolation mapping like this:

    _serialize_interpolation_map = {1 : [('set', 'attr', value),
                                         ('mv', 'old', 'new'),
                                         ('rm', 'attr')]}

The key of the dict indicate the version number for which this mapping will be applied. So for version 2 of the class the mapping 1 will be applied.

Currently these operations work:

  • We can set attributes to different values (eg. a new attribute has been added in a new version)
  • Rename an attribute
  • delete an attribute

This way we can track (all?) changes made to classes via versioning. To track class rename refactorings, we need to provide a renaming dictionary in the load function to translate the class name to its actual location. This just requires a simple rename in the JSON string.

@franknoe
Copy link
Contributor

That's a good idea. But this needs to be updated manually after a rename, right?

I think we should clean up class attributes and their names before releasing this feature, such that we're in a situation where attribute names appear stable.

@marscher
Copy link
Member Author

Am 30.06.2016 um 10:53 schrieb Frank Noe:

That's a good idea. But this needs to be updated manually after a rename, right?
Yes this has to be maintained, but I'd guess it doesn't happen to often, right?

I think we should clean up class attributes and their names before releasing this feature, such that we're in a situation where attribute names appear stable.
You mean we should clean up before merging model_estimators_serializable and not this PR?

I agree that we should tidy up things beforehand to minimize the need to use this feature. Do you have any suggestion which modules need some attention?

@franknoe
Copy link
Contributor

Am 30/06/16 um 12:20 schrieb Martin K. Scherer:

Am 30.06.2016 um 10:53 schrieb Frank Noe:

That's a good idea. But this needs to be updated manually after a
rename, right?
Yes this has to be maintained, but I'd guess it doesn't happen to
often, right?
Hopefully not.

I think we should clean up class attributes and their names before
releasing this feature, such that we're in a situation where attribute
names appear stable.
You mean we should clean up before merging
model_estimators_serializable and not this PR?
Correct, before merging this branch into master.

I agree that we should tidy up things beforehand to minimize the need
to use this feature. Do you have any suggestion which modules need
some attention?
The estimators I have worked on are generally quite full of hidden
variables (with leading ''), but then there are getters without leading
'
'. Now I think this is a little redundant and it might be better to
have sklearn-style where estimator variables are just accessible (also
settable if needed) under their names.

Generally, it may be useful to order variables into principle variables
(information we really need to store), and secondary variables
(information we can compute from the principle variables). For
estimators we need to decide what to do with the data - saving the data
used to parametrize the estimator may be excessive. If the data can be
compressed to an efficient format which parametrizes the estimator (e.g.
count matrix for MaximumLikelihoodMSM), this is feasible, but in other
cases (e.g. TRAM) we don't know such a format. I'm not sure how to solve
this, so I guess we need a developer meeting soon.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#849 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AGMeQkuCKNYYf9ZsM1sFIM0y3twm_VYUks5qQ5h9gaJpZM4JB6iS.


Prof. Dr. Frank Noe
Head of Computational Molecular Biology group
Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354
Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

@coveralls
Copy link

coveralls commented Jun 30, 2016

Coverage Status

Coverage decreased (-0.08%) to 85.184% when pulling 1b2cdbf on marscher:serialization_add_version_interpolation into 0f9b452 on markovmodel:models_estimators_serializeable.

@marscher
Copy link
Member Author

I agree that it would be easier to implement this if we stick to the sklearn pattern of storing everything as plain attribute. However our classes are much more complex than a simple sklearn estimator and we are relying on some logic, which requires private attributes. If we want to have a fully operational restored object, e.g. which behaves exactly the same way prior saving it, I think we have to store these internal parameters too.
This is not a problem for plain properties by the way, since the serialization protocol will just invoke the getters and setters for them. So there is no extra overhead for properties, just internal flow control variables.

For TRAM I'd suggest to omit very large arrays, though the data format is at least space efficient, since it uses Numpys npz format for arrays. But I'd guess these arrays are still too large to store them. Maybe we should store the code to a function which generates/restores the user data in this case (which might be error prone)?

I agree that we should discuss is in detail in a meeting.

@franknoe
Copy link
Contributor

I see generally two usecases:

  1. You want to store the estimated model
  2. You want to store also the data used to estimate the model, and keep
    this relationship intact.
    We might want to leave the choice to the user. However, the fact that we
    chose to use the model as a mixin to the estimator, confuses this
    distinction from a user perspective, so I'm currently not clear what's
    the best design to solve this.

Am 30/06/16 um 14:34 schrieb Martin K. Scherer:

I agree that it would be easier to implement this if we stick to the
sklearn pattern of storing everything as plain attribute. However our
classes are much more complex than a simple sklearn estimator and we
are relying on some logic, which requires private attributes. If we
want to have a fully operational restored object, e.g. which behaves
exactly the same way prior saving it, I think we have to store these
internal parameters too.
This is not a problem for plain properties by the way, since the
serialization protocol will just invoke the getters and setters for
them. So there is no extra overhead for properties, just internal flow
control variables.

For TRAM I'd suggest to omit very large arrays, though the data format
is at least space efficient, since it uses Numpys npz format for
arrays. But I'd guess these arrays are still too large to store them.
Maybe we should store the code to a function which generates/restores
the user data in this case (which might be error prone)?

I agree that we should discuss is in detail in a meeting.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#849 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AGMeQoWgahVJQq8onBmXgwEZvHVlebn-ks5qQ7fqgaJpZM4JB6iS.


Prof. Dr. Frank Noe
Head of Computational Molecular Biology group
Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354
Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

@marscher
Copy link
Member Author

I think this choice can be made by a simple parameter in Estimator.save, eg. save_estimation_data=False. Then we introduce another field in the Estimator to declare which fields are "estimation data".

Since most models are currently encapsulated within the Estimator (eg. the plain model is not trivially usable), I do not see the need to distinguish between them. If we want to store the model separately the user can simply invoke:

tica = pyemma.coordinates.tica(data, ...)
tica.model.save("tica_model")

However it is more difficult to handle, since I'm not sure if just setting the model in the Estimator will fully restore the desired state.

@coveralls
Copy link

coveralls commented Jun 30, 2016

Coverage Status

Coverage decreased (-0.1%) to 85.113% when pulling c220b54 on marscher:serialization_add_version_interpolation into 0f9b452 on markovmodel:models_estimators_serializeable.

@marscher marscher merged commit e3f5b1c into markovmodel:models_estimators_serializeable Jul 1, 2016
@marscher marscher deleted the serialization_add_version_interpolation branch July 1, 2016 12:12
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants