Serialization of data #164

jabooth · 2013-12-04T17:25:35Z

Right now there is no good way to serialize data in PyBug. You can manually strip numpy data out of objects and save it to mat files, but this is not well understood and is combersome.

In the long term it would be nice to develop a way to save out models (which can be very expensive computationally to build) to disk for rapid research. In the short term we should provide some guidance on how to achieve some form of serialization manually.

patricksnape · 2014-03-20T14:33:35Z

Update: essentially what needs doing is we need to override the correct __copy__ or __pickle__ methods so that copying/pickling works correctly. This will also help with backwards compatibility if our objects only pickle their important information (e.g. the pixel data).

The final decision is deciding in what way we want to standardize saving our data out (pickle/hd5/something else).

patricksnape · 2014-07-28T12:19:53Z

@jabooth has some really nice ideas about this using HDF5.

jabooth · 2014-08-26T14:23:42Z

Yeah, we definitely need to re-address this shortly. Let me try and layout the potential approach we could go for in Menpo.

#398 introduced Copyable - a simple interface that allowed for efficient copying of Menpo datatypes in a consistent manor. In brief:

from copy import deepcopy
copy_of_a = a.copy() 
deepcopy_of_a = deepcopy(a)
deepcopy_of_a == copy_of_a

With .copy() being much faster than deepcopy() on large collections of small objects (I encourage you to refresh your memory on the approach taken in this PR, it's pretty simple once you grasp it!). This change, for instance, sped up GeneralizedProcrustesAnalysis by a factor of 33%. On large simple objects (e.g. an image with a very large single numpy array) the difference is negligible. However, it's convenient that the method is recursive - calling .copy() on a Copyable type will .copy() all it's instance attributes - so we use it everywhere to keep things simple, reliable, and fast.

The problem of serialisation is very similar to the problem of copying - just one needs to happen in memory, whilst one needs to write the 'copy' out to disk. Both have the same penalty for 'forgetting' to copy/serialise a piece of data - everything breaks. Because of this, it makes sense to unify our approach to copying and serialisation - in the majority of cases, every item on an object that needs to be .copy()'ed should be '.serialize()'ed'.

Generic serialisation is tricky

Writing a serialisation technique that is powerful enough to serialise any valid Python code is very tricky. That's what the pickle module tries to do, and even then it's not sufficient (can't deal with closures/lambdas, fragile to code chages) and the ways around it (dill) causes all sorts of problems with IPython.

Because of this, I think we should avoid using Pickle or related technologies for Menpo.

Desirable traits in a serialisation solution

Simplicity of format

Ideally, it would be possible to inspect the serialised data in a direct way and interpret it.

Matlab support

Without question Menpo-produced models with need to be useable in Matlab. Rather than write a custom parser for each type that we want to 'export' to Matlab, it would be nice if the exporting format was natively readable in Matlab, so other users can at least get access to the raw arrays we are saving out.

Ability to save arbitrary Menpo collections

It ideally should be possible to save out a nested Python data structure of Menpo types. For example:

tosave = {}
tosave['AAM'] = my_aam  # a single menpo type
tosave['initial_images'] = images  # python list of Menpo type

save(tosave, 'myfile.lol')  # should save out this dict
loaded = load('myfile.lol')
if loaded == tosave:
    print("we succeeded!")

Doesn't need to be hand-coded for each Menpo type

We don't want to write serialize() or some equilivent on each of our classes. That is asking for trouble with bugs. Just like we did in #398, we need an approach that is low maintenance and simple.

The proposed solution - HDF5

HDF5 is an industry standard array container file format. When you open a HDF5 file up, you see an internal 'directory structure' that is just like the UNIX file system (/) and you are free to add what are effectively files and folders of array data types in any structure you want. HDF5 is:

Natively supported by Matlab. Newer .mat files are just .hdf5 files of a particular structure, and you can load a general .hdf5 file in any recent version of Matlab.
Well supported in Python. See h4py.
Fast. It's designed for raw blocks of C-style arrays, so reading and writing numpy arrays to and from .hdf5 files is very performant.
Flexible. It's actually possible to read in a subset of a large array from a HDF5 file. For instance, a very large (say 60GB) model could be saved to a .hdf5 file and stored on a shared drive. Many independent workers could read the file at once - each one only loading a small subset (say 4GB) of the huge array.
Familiar. Because the internal structure of any HDF5 file so strongly represents the UNIX filesystem it's natural to open up a HDF5 file in any of the helper tools that exist such as HDFView and just browse through the structure looking for data. Even if we don't write dedicated 'loaders' for Matlab for our data format, you could still give a hdf5 file to someone for use in Matlab and they would manage - they would just have to look through the provided file for the array they need.

Due to all the reasons above, and due to the fact that all data in Menpo is essentially just collections of numpy arrays, I propose that we develop a simple standard for serialising colections of Menpo types to an HDF5 file, and then a standard approach for reading in such a file to rebuild the Menpo data structures. I am not ready to propose the full solution yet, but here are a few comments:

The 'rebuilding' step will look a lot like how .copy() looks - i.e. we will __new__ up objects and manually fill in the dictionary for them. We will in some way re-use the Copyable interface so that we have a kind of guarantee: if the type copies well, it will serialise/de-serialise well.
The approach will not work for circular references. We already have this limitation in Copyable, so in a sense it's nice to be able to benefit from it here.
Unlike pickle, the solution will be resilient to code changes. If we change the name of an attribute in a class it will just mean that we need to specialise it's serialization method. This isn't such a big problem, at the massive advantage that we get to support the format well into the future.

I need this implemented in the fairly short term for GOSH work, so I'm going to give it some thought this week. The biggest open question is how we deal with any place where we have functional injection (namely, features into the fit framework). These objects will require some kind of strategy for serialisation. (this isn't a problem with Copyable btw, as functions are immutable types). Solutions include encoding the raw source code of the function in the file (less crazy than it sounds in a language as dynamic as Python) or some form of wrapper function that encodes the arguments passed into a function and the function itself.

jabooth · 2014-10-02T15:15:49Z

closed by #427

jabooth added the in progress label Aug 27, 2014

jabooth self-assigned this Aug 27, 2014

jabooth mentioned this issue Sep 4, 2014

Adds HDF5able serialization support to Menpo #427

Merged

jabooth closed this as completed Oct 2, 2014

jabooth removed the in progress label Oct 2, 2014

andrejmaris mentioned this issue Apr 26, 2015

Find a way to serialize the model into a file andrejmaris/facefit#8

Closed

AnaLadeira mentioned this issue Feb 10, 2020

Segmentation Fault when importing menpo and matplotlib #845

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serialization of data #164

Serialization of data #164

jabooth commented Dec 4, 2013

patricksnape commented Mar 20, 2014

patricksnape commented Jul 28, 2014

jabooth commented Aug 26, 2014

jabooth commented Oct 2, 2014

Serialization of data #164

Serialization of data #164

Comments

jabooth commented Dec 4, 2013

patricksnape commented Mar 20, 2014

patricksnape commented Jul 28, 2014

jabooth commented Aug 26, 2014

Generic serialisation is tricky

Desirable traits in a serialisation solution

Simplicity of format

Matlab support

Ability to save arbitrary Menpo collections

Doesn't need to be hand-coded for each Menpo type

The proposed solution - HDF5

jabooth commented Oct 2, 2014