<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Saving-and-restoring-data-with-Python" data-toc-modified-id="Saving-and-restoring-data-with-Python-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Saving and restoring data with Python</a></span><ul class="toc-item"><li><span><a href="#np.save" data-toc-modified-id="np.save-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span><code>np.save</code></a></span></li><li><span><a href="#MacOSFile" data-toc-modified-id="MacOSFile-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span><code>MacOSFile</code></a></span></li><li><span><a href="#np.savez" data-toc-modified-id="np.savez-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span><code>np.savez</code></a></span></li><li><span><a href="#h5py" data-toc-modified-id="h5py-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>h5py</a></span><ul class="toc-item"><li><span><a href="#h5py" data-toc-modified-id="h5py-1.4.1"><span class="toc-item-num">1.4.1&nbsp;&nbsp;</span><code>h5py</code></a></span></li><li><span><a href="#DeepDish" data-toc-modified-id="DeepDish-1.4.2"><span class="toc-item-num">1.4.2&nbsp;&nbsp;</span>DeepDish</a></span><ul class="toc-item"><li><span><a href="#Dealing-with-large-datasets:" data-toc-modified-id="Dealing-with-large-datasets:-1.4.2.1"><span class="toc-item-num">1.4.2.1&nbsp;&nbsp;</span>Dealing with large datasets:</a></span></li></ul></li></ul></li></ul></li></ul></div>

**@juliaroquette: In this notebook I present some ways of saving data in Python formating. Here, by Python format I mean saving arrays, dictionaries, lists, etc inside a single file**

**- last modified: 17May19**

# Saving and restoring data with Python

In [None]:
import numpy as np

Supose we have a dictionary of dictionaries:

In [2]:
d1={"a":np.random.random(size=100),"b":np.random.random(size=100),"c":np.random.random(size=100)}
d2={"a":np.random.random(size=100),"b":np.random.random(size=100),"c":np.random.random(size=100)}
d3={"a":np.random.random(size=100),"b":np.random.random(size=100),"c":np.random.random(size=100)}
D={"d1":d1,"d2":d2,"d3":d3}
D.keys()

dict_keys(['d1', 'd2', 'd3'])

And a second version of it which includes some very large array:

In [3]:
L={"d1":d1,"d2":d2,"d3":d3,"d4":bytearray(8 * 1000 * 1000 * 1000)}
L.keys()

dict_keys(['d1', 'd2', 'd3', 'd4'])

## `np.save`
Documentation: https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.save.html#numpy.save

To save:

In [5]:
np.save('save',D)

In [6]:
%timeit np.save('save',D)

466 µs ± 25.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


To restore:

In [7]:
D1=np.load('save.npy')

In [8]:
%timeit D1=np.load('save.npy')

317 µs ± 10.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


Problems: It doesn't re-store the Dictionary with the original structure. For example, one can not use:

In [9]:
D1.keys()

AttributeError: 'numpy.ndarray' object has no attribute 'keys'

But:

In [10]:
D1.item().keys()

dict_keys(['d1', 'd2', 'd3'])

Major issue: In Python 3, `np.save` crashes for large tables. For example, if we try to save the large version:

In [11]:
np.save('saveL',L)

OverflowError: cannot serialize a string larger than 4GiB

## `MacOSFile`

StackOverflow presents a workaround this problem using `pickle` directly: https://stackoverflow.com/questions/31468117/python-3-can-pickle-handle-byte-objects-larger-than-4gb
Which was implemented criating a class, such as bellow, saved in `.../work/prog/MacOSFile`

And that can be used by importing it:

In [12]:
import MacOSFile
MacOSFile.pickle_dump(L,'D1save.npy')

writing total_bytes=7708...
writing bytes [0, 7708)... done.
writing total_bytes=8000000000...
writing bytes [0, 1073741824)... done.
writing bytes [1073741824, 2147483648)... done.
writing bytes [2147483648, 3221225472)... done.
writing bytes [3221225472, 4294967296)... done.
writing bytes [4294967296, 5368709120)... done.
writing bytes [5368709120, 6442450944)... done.
writing bytes [6442450944, 7516192768)... done.
writing bytes [7516192768, 8000000000)... done.
writing total_bytes=16...
writing bytes [0, 16)... done.


In [13]:
%timeit MacOSFile.pickle_dump(L,'D1save.npy')

writing total_bytes=7708...
writing bytes [0, 7708)... done.
writing total_bytes=8000000000...
writing bytes [0, 1073741824)... done.
writing bytes [1073741824, 2147483648)... done.
writing bytes [2147483648, 3221225472)... done.
writing bytes [3221225472, 4294967296)... done.
writing bytes [4294967296, 5368709120)... done.
writing bytes [5368709120, 6442450944)... done.
writing bytes [6442450944, 7516192768)... done.
writing bytes [7516192768, 8000000000)... done.
writing total_bytes=16...
writing bytes [0, 16)... done.
writing total_bytes=7708...
writing bytes [0, 7708)... done.
writing total_bytes=8000000000...
writing bytes [0, 1073741824)... done.
writing bytes [1073741824, 2147483648)... done.
writing bytes [2147483648, 3221225472)... done.
writing bytes [3221225472, 4294967296)... done.
writing bytes [4294967296, 5368709120)... done.
writing bytes [5368709120, 6442450944)... done.
writing bytes [6442450944, 7516192768)... done.
writing bytes [7516192768, 8000000000)... done.
wri

In [14]:
L1=MacOSFile.pickle_load('D1save.npy')

In [15]:
%timeit MacOSFile.pickle_load('D1save.npy')

59.2 s ± 1.96 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


And it does solve the problem of restoring the dictionaries:

In [16]:
L1.keys()

dict_keys(['d1', 'd2', 'd3', 'd4'])

## `np.savez`
Documentation: https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.savez.html

To open:

In [17]:
np.savez('savez',D)

In [18]:
%timeit np.savez('savez',D)

1.34 ms ± 671 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


To open:

In [19]:
D2=np.load('savez.npz')

In [20]:
%timeit np.load('savez.npz')

82.7 µs ± 872 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [21]:
D2.files

['arr_0']

In [22]:
D2d=D2['arr_0']

In [23]:
D2d.keys()

AttributeError: 'numpy.ndarray' object has no attribute 'keys'

I also doesn't work for large datasets:

In [26]:
np.savez('savez',L)

OverflowError: cannot serialize a string larger than 4GiB

## h5py

Other workaround found in StackOverflow (https://stackoverflow.com/questions/30811918/saving-dictionary-of-numpy-arrays/30812215) is to use the `HDF5` format (https://en.wikipedia.org/wiki/Hierarchical_Data_Format) 


I found two packages in `Python` for that and will compare them.

Some discussion about the two formats: http://tdeboissiere.github.io/h5py-vs-npz.html

### `h5py`
(Documentation http://docs.h5py.org/en/latest/index.html) It looks to me as a primitive package to work with HDF5 - a bit like using only built-in functions of python to write an ascii file. 

### DeepDish

(http://deepdish.readthedocs.io/en/latest/tutorials.html)
(https://github.com/uchicago-cs/deepdish) 

Which is a "series of data science tools from the University of Chicago" including a `save/load` function similar to `np.save` or `pickle` and that deals with the `HDF5` format.


In [29]:
!pip install deepdish

Collecting deepdish
  Using cached https://files.pythonhosted.org/packages/6e/39/2a47c852651982bc5eb39212ac110284dd20126bdc7b49bde401a0139f5d/deepdish-0.3.6-py2.py3-none-any.whl
Installing collected packages: deepdish
Successfully installed deepdish-0.3.6


In [30]:
import deepdish as dd

In [31]:
dd.io.save('saveDdd',D)

In [32]:
%timeit dd.io.save('saveDdd',D)

10.3 ms ± 466 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [33]:
Ddd=dd.io.load('saveDdd.h5')

OSError: ``saveDdd.h5`` does not exist

In [None]:
%timeit dd.io.load('saveDdd.h5')

In [None]:
Ddd.keys()

The `deepdish` package deals well with the dictionary structure, but it crashes for datasets larger than 4GB.  

##### Some comparison:


Larger dataset version of L1 saved with `MacOSFile` has 7.5G

Smaller dataset version saved with `np.save` has 12K

Smaller dataset version saved wit `np.savez` has 12K


Medium dataset version save with `MacOSFile` has 1.6G, took 19.134106159210205s to be saved and 9.956901788711548s to be re-opened.
Medium dataset version save with `MacOSFile` has 1.6G, took 83.79772400856018s to be saved and 0.1864032745361328s to be re-opened.

#### Dealing with large datasets:

`Deepdish` seems to be the fastest option to reload datasets, but the conclusion here is that as long as `DeepDish` is unable of dealing with files larger than 2GB, the best alternative is to keep using the `MacOSFile` package