New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serialization of data #164
Comments
Update: essentially what needs doing is we need to override the correct The final decision is deciding in what way we want to standardize saving our data out (pickle/hd5/something else). |
@jabooth has some really nice ideas about this using HDF5. |
Yeah, we definitely need to re-address this shortly. Let me try and layout the potential approach we could go for in Menpo. #398 introduced from copy import deepcopy
copy_of_a = a.copy()
deepcopy_of_a = deepcopy(a)
deepcopy_of_a == copy_of_a With The problem of serialisation is very similar to the problem of copying - just one needs to happen in memory, whilst one needs to write the 'copy' out to disk. Both have the same penalty for 'forgetting' to copy/serialise a piece of data - everything breaks. Because of this, it makes sense to unify our approach to copying and serialisation - in the majority of cases, every item on an object that needs to be Generic serialisation is trickyWriting a serialisation technique that is powerful enough to serialise any valid Python code is very tricky. That's what the pickle module tries to do, and even then it's not sufficient (can't deal with closures/lambdas, fragile to code chages) and the ways around it (dill) causes all sorts of problems with IPython. Because of this, I think we should avoid using Pickle or related technologies for Menpo. Desirable traits in a serialisation solutionSimplicity of formatIdeally, it would be possible to inspect the serialised data in a direct way and interpret it. Matlab supportWithout question Menpo-produced models with need to be useable in Matlab. Rather than write a custom parser for each type that we want to 'export' to Matlab, it would be nice if the exporting format was natively readable in Matlab, so other users can at least get access to the raw arrays we are saving out. Ability to save arbitrary Menpo collectionsIt ideally should be possible to save out a nested Python data structure of Menpo types. For example: tosave = {}
tosave['AAM'] = my_aam # a single menpo type
tosave['initial_images'] = images # python list of Menpo type
save(tosave, 'myfile.lol') # should save out this dict
loaded = load('myfile.lol')
if loaded == tosave:
print("we succeeded!") Doesn't need to be hand-coded for each Menpo typeWe don't want to write The proposed solution - HDF5HDF5 is an industry standard array container file format. When you open a HDF5 file up, you see an internal 'directory structure' that is just like the UNIX file system (
Due to all the reasons above, and due to the fact that all data in Menpo is essentially just collections of numpy arrays, I propose that we develop a simple standard for serialising colections of Menpo types to an HDF5 file, and then a standard approach for reading in such a file to rebuild the Menpo data structures. I am not ready to propose the full solution yet, but here are a few comments:
I need this implemented in the fairly short term for GOSH work, so I'm going to give it some thought this week. The biggest open question is how we deal with any place where we have functional injection (namely, features into the fit framework). These objects will require some kind of strategy for serialisation. (this isn't a problem with |
closed by #427 |
Right now there is no good way to serialize data in PyBug. You can manually strip numpy data out of objects and save it to mat files, but this is not well understood and is combersome.
In the long term it would be nice to develop a way to save out models (which can be very expensive computationally to build) to disk for rapid research. In the short term we should provide some guidance on how to achieve some form of serialization manually.
The text was updated successfully, but these errors were encountered: