Save your models in PyEMMA
==========================

Most of the Estimators and Models in PyEMMA are serializable. If a given Estimator or Model can be saved to disk,
it provides a **save** method. In this notebook we will explain the basic concepts of file handling.

We try our best to provide **future** compatiblity of already saved data. This means it should always be possible to load
data with a newer version of the software, but you can not do reverse, eg. load a model saved by a new version with an old version of PyEMMA.

If you are interested in the technical background, go ahead and read the source code (it is not that much actually).

In [1]:
import pyemma
import numpy as np
import os
import pprint

In [2]:
# generate some syntetical data with 10 states
dtrajs = [np.random.randint(0, 10, size=10000) for _ in range(5)]

In [3]:
# estimate a Baysian Markov state model
bmsm = pyemma.msm.bayesian_markov_model(dtrajs, lag=10)
print(bmsm)

BayesianMSM(conf=0.95, connectivity='largest', count_mode='effective',
      dt_traj='1 step', lag=10, mincount_connectivity='1/n', nsamples=100,
      nsteps=3, reversible=True, show_progress=False, sparse=False,
      statdist_constraint=None)


We can now save the estimator (which contains the model) to disk, We delete any existing file to avoid an exception.

In [4]:
try:
    os.unlink('my_models.h5')
except FileNotFoundError: pass

# now save our model
bmsm.save('my_models.h5')

We can now restore the model, by simply invoking pyemma.load function with our file name.

In [5]:
pyemma.load('my_models.h5')

BayesianMSM(conf=0.95, connectivity='largest', count_mode='effective',
      dt_traj='1 step', lag=10, mincount_connectivity='1/n', nsamples=100,
      nsteps=3, reversible=True, show_progress=False, sparse=False,
      statdist_constraint=None)

Note that we can save multiple models in one file. Because HDF5 acts like a file system, we have each model in a separate "folder", which is completely independent of the other models. We now change a parameter during estimation and save the estimator again in the same file, but in a different "folder".

In [6]:
bmsm.estimate(dtrajs, lag=100)
print(bmsm)
bmsm.save('my_models.h5', model_name='lag100')

BayesianMSM(conf=0.95, connectivity='largest', count_mode='effective',
      dt_traj='1 step', lag=100, mincount_connectivity='1/n', nsamples=100,
      nsteps=3, reversible=True, show_progress=False, sparse=False,
      statdist_constraint=None)


Likewise when we want to restore the model with the new name, we have to pass it to the load function accordingly.

In [7]:
pyemma.load('my_models.h5', model_name='lag100')

BayesianMSM(conf=0.95, connectivity='largest', count_mode='effective',
      dt_traj='1 step', lag=100, mincount_connectivity='1/n', nsamples=100,
      nsteps=3, reversible=True, show_progress=False, sparse=False,
      statdist_constraint=None)

As you may have noted, there is no need to pass a model name. For convenience we always save under model_name "latest", if the argument is not provided. To check which models are contained in a file, we provide a command line tool named "pyemma_list_models".

In [8]:
! pyemma_list_models

usage: pyemma_list_models [-h] [--json] [--recursive] [-v] files [files ...]
pyemma_list_models: error: the following arguments are required: files


In [9]:
! pyemma_list_models my_models.h5

PyEMMA models

file: my_models.h5
--------------------------------------------------------------------------------
1. name: lag100
created: 1515632708.0470107
BayesianMSM(conf=0.95, connectivity='largest', count_mode='effective',
      dt_traj='1 step', lag=100, mincount_connectivity='1/n', nsamples=100,
      nsteps=3, reversible=True, show_progress=False, sparse=False,
      statdist_constraint=None)
2. name: latest
created: 1515632707.0779583
BayesianMSM(conf=0.95, connectivity='largest', count_mode='effective',
      dt_traj='1 step', lag=10, mincount_connectivity='1/n', nsamples=100,
      nsteps=3, reversible=True, show_progress=False, sparse=False,
      statdist_constraint=None)
--------------------------------------------------------------------------------



You can also check the list of already stored models directly in PyEMMA.

In [10]:
content = pyemma.list_models('my_models.h5')
print("available models:", content.keys())
print("-" * 80)
print("detailed:")
pprint.pprint(content)

available models: dict_keys(['lag100', 'latest'])
--------------------------------------------------------------------------------
detailed:
{'lag100': {'class_repr': "BayesianMSM(conf=0.95, connectivity='largest', "
                          "count_mode='effective',\n"
                          "      dt_traj='1 step', lag=100, "
                          "mincount_connectivity='1/n', nsamples=100,\n"
                          '      nsteps=3, reversible=True, '
                          'show_progress=False, sparse=False,\n'
                          '      statdist_constraint=None)',
            'class_str': "BayesianMSM(conf=0.95, connectivity='largest', "
                         "count_mode='effective',\n"
                         "      dt_traj='1 step', lag=100, "
                         "mincount_connectivity='1/n', nsamples=100,\n"
                         '      nsteps=3, reversible=True, '
                         'show_progress=False, sparse=False,\n'
                    

Overwriting existing models is also possible, but we have to tell the save method, that we want to overwrite.

In [11]:
# we now expect that we get a failure, because the model already exists in the file.
try:
    bmsm.save('my_models.h5')
except RuntimeError as e:
    print("can not save:", e)

11-01-18 02:05:11 pyemma.msm.estimators.bayesian_msm.BayesianMSM[0] ERROR    During saving the object BayesianMSM(conf=0.95, connectivity='largest', count_mode='effective',
      dt_traj='1 step', lag=100, mincount_connectivity='1/n', nsamples=100,
      nsteps=3, reversible=True, show_progress=False, sparse=False,
      statdist_constraint=None)") the following error occurred: model "latest" already exists. Either use overwrite=True, or use a different name/file.
Traceback (most recent call last):
  File "/home/marscher/workspace/pyemma/pyemma/_base/serialization/serialization.py", line 224, in save
    f.add_serializable(model_name, obj=self, overwrite=overwrite, save_streaming_chain=save_streaming_chain)
  File "/home/marscher/workspace/pyemma/pyemma/_base/serialization/h5file.py", line 120, in add_serializable
    self._set_group(name, overwrite)
  File "/home/marscher/workspace/pyemma/pyemma/_base/serialization/h5file.py", line 66, in _set_group
    ' or use a different name/file.

In [12]:
bmsm.save('my_models.h5', overwrite=True)

11-01-18 02:05:11 pyemma._base.serialization.h5file INFO     overwriting model "latest" in file /


This concludes the storage tutorial of PyEMMA. Happy saving!