# HDF5 and other serialization methods

Pandas supports a variety of serialization methods.

In [1]:
# Some style stuff for the plots

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from scipy import stats
import matplotlib.style as mplstyle
%matplotlib inline
mplstyle.use('ggplot')

In [5]:
frame = pd.read_csv('DATEADDED_by_year.csv')

frame.head()

Unnamed: 0,1983-12-31,322
0,1984-12-31,43
1,1985-12-31,203
2,1986-12-31,28
3,1987-12-31,70
4,1988-12-31,55


In [6]:
frame.to_pickle('dateadded_pickle')

In [9]:
pd.read_pickle('dateadded_pickle').head()

Unnamed: 0,1983-12-31,322
0,1984-12-31,43
1,1985-12-31,203
2,1986-12-31,28
3,1987-12-31,70
4,1988-12-31,55


## Is it smaller than csv?

In this case it is not. It would be nice to see how larger files compare.

In [11]:
!ls -la

total 17
drwxr-xr-x 1 mhemken 1049089     0 Nov  1 12:39 .
drwxr-xr-x 1 mhemken 1049089     0 Nov  1 12:32 ..
drwxr-xr-x 1 mhemken 1049089     0 Nov  1 12:36 .ipynb_checkpoints
-rw-r--r-- 1 mhemken 1049089   577 Oct 30 17:09 DATEADDED_by_year.csv
-rw-r--r-- 1 mhemken 1049089  1508 Nov  1 12:39 dateadded_pickle
-rw-r--r-- 1 mhemken 1049089 10192 Nov  1 12:39 hdf5_study.ipynb


## Other formats

- *bcolz*  
- *Feather*
- *HDF5*

# HDF5

In [13]:
frame = pd.DataFrame({'a': np.random.randn(100)})

frame.head()

Unnamed: 0,a
0,-0.429027
1,0.131689
2,1.545729
3,-1.118845
4,-1.71018


In [17]:
store = pd.HDFStore('mydata.h5')

In [20]:
!ls -la | grep mydata

-rw-r--r-- 1 mhemken 1049089    0 Nov  1 13:04 mydata.h5


In [21]:
store['obj1'] = frame

In [22]:
store['obj1_col'] = frame['a']

In [24]:
store

<class 'pandas.io.pytables.HDFStore'>
File path: mydata.h5
/obj1                frame        (shape->[100,1])
/obj1_col            series       (shape->[100])  

In [26]:
store['obj1'].head()

Unnamed: 0,a
0,-0.429027
1,0.131689
2,1.545729
3,-1.118845
4,-1.71018


In [27]:
store.put('obj2', frame, format='table')

In [28]:
store.select('obj2', where=['index >= 10 and index <= 15'])

Unnamed: 0,a
10,0.608553
11,0.444984
12,1.374219
13,0.235131
14,1.343744
15,-0.555501


In [29]:
store.close()

In [31]:
frame.to_hdf('mydata.h5', 'obj3', format='table')

pd.read_hdf('mydata.h5', 'obj3', where=['index < 5'])

Unnamed: 0,a
0,-0.429027
1,0.131689
2,1.545729
3,-1.118845
4,-1.71018


## Next steps

This is cool. I'd like to take this in a couple of directions.

1. What kinds of queries can I make with the `where` clause?
1. Is there a best-practice way to turn .csv into .h5?