<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Saving-and-restoring-data-with-Python" data-toc-modified-id="Saving-and-restoring-data-with-Python-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Saving and restoring data with Python</a></span><ul class="toc-item"><li><span><a href="#np.save" data-toc-modified-id="np.save-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span><code>np.save</code></a></span></li><li><span><a href="#MacOSFile" data-toc-modified-id="MacOSFile-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span><code>MacOSFile</code></a></span></li><li><span><a href="#np.savez" data-toc-modified-id="np.savez-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span><code>np.savez</code></a></span></li><li><span><a href="#h5py" data-toc-modified-id="h5py-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>h5py</a></span><ul class="toc-item"><li><span><a href="#h5py" data-toc-modified-id="h5py-1.4.1"><span class="toc-item-num">1.4.1&nbsp;&nbsp;</span><code>h5py</code></a></span></li><li><span><a href="#DeepDish" data-toc-modified-id="DeepDish-1.4.2"><span class="toc-item-num">1.4.2&nbsp;&nbsp;</span>DeepDish</a></span><ul class="toc-item"><li><span><a href="#Dealing-with-large-datasets:" data-toc-modified-id="Dealing-with-large-datasets:-1.4.2.1"><span class="toc-item-num">1.4.2.1&nbsp;&nbsp;</span>Dealing with large datasets:</a></span></li></ul></li></ul></li></ul></li></ul></div>

**@juliaroquette: In this notebook I present some ways of saving data in Python formating. Here, by Python format I mean saving arrays, dictionaries, lists, etc inside a single file**

**- last modified: 17May19**

# Saving and restoring data with Python

In [3]:
import numpy as np
import time

Supose we have a dictionary of dictionaries:

In [4]:
d1={"a":np.random.random(size=100),"b":np.random.random(size=100),"c":np.random.random(size=100)}
d2={"a":np.random.random(size=100),"b":np.random.random(size=100),"c":np.random.random(size=100)}
d3={"a":np.random.random(size=100),"b":np.random.random(size=100),"c":np.random.random(size=100)}
D={"d1":d1,"d2":d2,"d3":d3}
D.keys()

dict_keys(['d1', 'd2', 'd3'])

And a second version of it which includes some very large array:

In [14]:
L={"d1":d1,"d2":d2,"d3":d3,"d4":bytearray(8 * 1000 * 1000 * 1000)}
L.keys()

dict_keys(['d1', 'd2', 'd3', 'd4'])

## `np.save`
Documentation: https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.save.html#numpy.save

To save:

In [4]:
t0=time.time()
np.save('save',D)
ts=time.time()-t0
print(ts)

0.00414586067199707


To restore:

In [5]:
t0=time.time()
D1=np.load('save.npy')
tr=time.time()-t0
print(tr)

0.004454135894775391


Problems: It doesn't re-store the Dictionary with the original structure. For example, one can not use:

In [6]:
D1.keys()

AttributeError: 'numpy.ndarray' object has no attribute 'keys'

But:

In [7]:
D1.item().keys()

dict_keys(['d1', 'd2', 'd3'])

Major issue: In Python 3, `np.save` crashes for large tables. For example, if we try to save the large version:

In [6]:
np.save('saveL',L)

OSError: [Errno 22] Invalid argument

## `MacOSFile`

StackOverflow presents a workaround this problem using `pickle` directly: https://stackoverflow.com/questions/31468117/python-3-can-pickle-handle-byte-objects-larger-than-4gb
Which was implemented criating a class, such as bellow, saved in `.../work/prog/MacOSFile`

And that can be used by importing it:

In [15]:
import MacOSFile
t0=time.time()
MacOSFile.pickle_dump(L,'D1save.npy')
trL=time.time()-t0
print(trL)

writing total_bytes=1600007711...
writing bytes [0, 1073741824)... done.
writing bytes [1073741824, 1600007711)... done.
19.134106159210205


In [16]:
t0=time.time()
L1=MacOSFile.pickle_load('D1save.npy')
trLo=time.time()-t0
print(trLo)

9.956901788711548


And it does solve the problem of restoring the dictionaries:

In [11]:
L1.keys()

dict_keys(['d1', 'd2', 'd3', 'd4'])

## `np.savez`
Documentation: https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.savez.html

To open:

In [12]:
t0=time.time()
np.savez('savez',D)
tsz=time.time()-t0
print(tsz)

0.25655317306518555


To open:

In [13]:
t0=time.time()
D2=np.load('savez.npz')
tzr=time.time()-t0
print(tzr)

0.11436891555786133


In [14]:
D2.files

['arr_0']

In [15]:
D2d=D2['arr_0']

In [16]:
D2d.keys()

AttributeError: 'numpy.ndarray' object has no attribute 'keys'

In [17]:
D2d.items().keys()

AttributeError: 'numpy.ndarray' object has no attribute 'items'

I also doesn't work for large datasets:

In [18]:
t0=time.time()
np.savez('savez',L)
tzL=time.time()-t0
print(tzL)

OverflowError: cannot serialize a string larger than 4GiB

## h5py

Other workaround found in StackOverflow (https://stackoverflow.com/questions/30811918/saving-dictionary-of-numpy-arrays/30812215) is to use the `HDF5` format (https://en.wikipedia.org/wiki/Hierarchical_Data_Format) 


I found two packages in `Python` for that and will compare them.

Some discussion about the two formats: http://tdeboissiere.github.io/h5py-vs-npz.html

### `h5py`
(Documentation http://docs.h5py.org/en/latest/index.html) It looks to me as a primitive package to work with HDF5 - a bit like using only built-in functions of python to write an ascii file. 

### DeepDish

(http://deepdish.readthedocs.io/en/latest/tutorials.html)
(https://github.com/uchicago-cs/deepdish) 

Which is a "series of data science tools from the University of Chicago" including a `save/load` function similar to `np.save` or `pickle` and that deals with the `HDF5` format.


In [8]:
import deepdish as dd

  return f(*args, **kwds)
  return f(*args, **kwds)


In [17]:
t0=time.time()
dd.io.save('saveDdd',D)
tDdd=time.time()-t0
print(tDdd)

83.79772400856018


In [18]:
t0=time.time()
Ddd=dd.io.load('saveDdd.h5')
tDddl=time.time()-t0
print(tDddl)

0.1864032745361328


In [23]:
Ddd.keys()

dict_keys(['d1', 'd2', 'd3'])

The `deepdish` package deals well with the dictionary structure, but it crashes for datasets larger than 4GB.  

##### Some comparison:


Larger dataset version of L1 saved with `MacOSFile` has 7.5G

Smaller dataset version saved with `np.save` has 12K

Smaller dataset version saved wit `np.savez` has 12K


Medium dataset version save with `MacOSFile` has 1.6G, took 19.134106159210205s to be saved and 9.956901788711548s to be re-opened.
Medium dataset version save with `MacOSFile` has 1.6G, took 83.79772400856018s to be saved and 0.1864032745361328s to be re-opened.

#### Dealing with large datasets:

`Deepdish` seems to be the fastest option to reload datasets, but the conclusion here is that as long as `DeepDish` is unable of dealing with files larger than 2GB, the best alternative is to keep using the `MacOSFile` package