Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialization of data #164

Closed
jabooth opened this issue Dec 4, 2013 · 4 comments
Closed

Serialization of data #164

jabooth opened this issue Dec 4, 2013 · 4 comments
Assignees

Comments

@jabooth
Copy link
Member

jabooth commented Dec 4, 2013

Right now there is no good way to serialize data in PyBug. You can manually strip numpy data out of objects and save it to mat files, but this is not well understood and is combersome.

In the long term it would be nice to develop a way to save out models (which can be very expensive computationally to build) to disk for rapid research. In the short term we should provide some guidance on how to achieve some form of serialization manually.

@patricksnape
Copy link
Contributor

Update: essentially what needs doing is we need to override the correct __copy__ or __pickle__ methods so that copying/pickling works correctly. This will also help with backwards compatibility if our objects only pickle their important information (e.g. the pixel data).

The final decision is deciding in what way we want to standardize saving our data out (pickle/hd5/something else).

@patricksnape
Copy link
Contributor

@jabooth has some really nice ideas about this using HDF5.

@jabooth
Copy link
Member Author

jabooth commented Aug 26, 2014

Yeah, we definitely need to re-address this shortly. Let me try and layout the potential approach we could go for in Menpo.

#398 introduced Copyable - a simple interface that allowed for efficient copying of Menpo datatypes in a consistent manor. In brief:

from copy import deepcopy
copy_of_a = a.copy() 
deepcopy_of_a = deepcopy(a)
deepcopy_of_a == copy_of_a

With .copy() being much faster than deepcopy() on large collections of small objects (I encourage you to refresh your memory on the approach taken in this PR, it's pretty simple once you grasp it!). This change, for instance, sped up GeneralizedProcrustesAnalysis by a factor of 33%. On large simple objects (e.g. an image with a very large single numpy array) the difference is negligible. However, it's convenient that the method is recursive - calling .copy() on a Copyable type will .copy() all it's instance attributes - so we use it everywhere to keep things simple, reliable, and fast.

The problem of serialisation is very similar to the problem of copying - just one needs to happen in memory, whilst one needs to write the 'copy' out to disk. Both have the same penalty for 'forgetting' to copy/serialise a piece of data - everything breaks. Because of this, it makes sense to unify our approach to copying and serialisation - in the majority of cases, every item on an object that needs to be .copy()'ed should be '.serialize()'ed'.

Generic serialisation is tricky

Writing a serialisation technique that is powerful enough to serialise any valid Python code is very tricky. That's what the pickle module tries to do, and even then it's not sufficient (can't deal with closures/lambdas, fragile to code chages) and the ways around it (dill) causes all sorts of problems with IPython.

Because of this, I think we should avoid using Pickle or related technologies for Menpo.

Desirable traits in a serialisation solution

Simplicity of format

Ideally, it would be possible to inspect the serialised data in a direct way and interpret it.

Matlab support

Without question Menpo-produced models with need to be useable in Matlab. Rather than write a custom parser for each type that we want to 'export' to Matlab, it would be nice if the exporting format was natively readable in Matlab, so other users can at least get access to the raw arrays we are saving out.

Ability to save arbitrary Menpo collections

It ideally should be possible to save out a nested Python data structure of Menpo types. For example:

tosave = {}
tosave['AAM'] = my_aam  # a single menpo type
tosave['initial_images'] = images  # python list of Menpo type

save(tosave, 'myfile.lol')  # should save out this dict
loaded = load('myfile.lol')
if loaded == tosave:
    print("we succeeded!")

Doesn't need to be hand-coded for each Menpo type

We don't want to write serialize() or some equilivent on each of our classes. That is asking for trouble with bugs. Just like we did in #398, we need an approach that is low maintenance and simple.

The proposed solution - HDF5

HDF5 is an industry standard array container file format. When you open a HDF5 file up, you see an internal 'directory structure' that is just like the UNIX file system (/) and you are free to add what are effectively files and folders of array data types in any structure you want. HDF5 is:

  • Natively supported by Matlab. Newer .mat files are just .hdf5 files of a particular structure, and you can load a general .hdf5 file in any recent version of Matlab.
  • Well supported in Python. See h4py.
  • Fast. It's designed for raw blocks of C-style arrays, so reading and writing numpy arrays to and from .hdf5 files is very performant.
  • Flexible. It's actually possible to read in a subset of a large array from a HDF5 file. For instance, a very large (say 60GB) model could be saved to a .hdf5 file and stored on a shared drive. Many independent workers could read the file at once - each one only loading a small subset (say 4GB) of the huge array.
  • Familiar. Because the internal structure of any HDF5 file so strongly represents the UNIX filesystem it's natural to open up a HDF5 file in any of the helper tools that exist such as HDFView and just browse through the structure looking for data. Even if we don't write dedicated 'loaders' for Matlab for our data format, you could still give a hdf5 file to someone for use in Matlab and they would manage - they would just have to look through the provided file for the array they need.

Due to all the reasons above, and due to the fact that all data in Menpo is essentially just collections of numpy arrays, I propose that we develop a simple standard for serialising colections of Menpo types to an HDF5 file, and then a standard approach for reading in such a file to rebuild the Menpo data structures. I am not ready to propose the full solution yet, but here are a few comments:

  1. The 'rebuilding' step will look a lot like how .copy() looks - i.e. we will __new__ up objects and manually fill in the dictionary for them. We will in some way re-use the Copyable interface so that we have a kind of guarantee: if the type copies well, it will serialise/de-serialise well.
  2. The approach will not work for circular references. We already have this limitation in Copyable, so in a sense it's nice to be able to benefit from it here.
  3. Unlike pickle, the solution will be resilient to code changes. If we change the name of an attribute in a class it will just mean that we need to specialise it's serialization method. This isn't such a big problem, at the massive advantage that we get to support the format well into the future.

I need this implemented in the fairly short term for GOSH work, so I'm going to give it some thought this week. The biggest open question is how we deal with any place where we have functional injection (namely, features into the fit framework). These objects will require some kind of strategy for serialisation. (this isn't a problem with Copyable btw, as functions are immutable types). Solutions include encoding the raw source code of the function in the file (less crazy than it sounds in a language as dynamic as Python) or some form of wrapper function that encodes the arguments passed into a function and the function itself.

@jabooth
Copy link
Member Author

jabooth commented Oct 2, 2014

closed by #427

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants