# Binary Data Formats


One of the easiest ways to store data efficiently in binary format is using Python’s builtin
pickle serialization. Conveniently, pandas objects all have a save method which
writes the data to disk as a pickle:




In [8]:
from pandas import DataFrame, Series

import pandas as pd

import sys

import numpy as np

import json

In [4]:
frame = pd.read_csv('ex1.csv')

In [5]:
frame

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [10]:
frame.save('frame_pickle')

AttributeError: 'DataFrame' object has no attribute 'save'

You read the data back into Python with pandas.load, another pickle convenience
function:

You read the data back into Python with pandas.load, another pickle convenience
function:


In [11]:
pd.load('ch06/frame_pickle')

AttributeError: module 'pandas' has no attribute 'load'

pickle is only recommended as a short-term storage format. The problem
is that it is hard to guarantee that the format will be stable over time;
an object pickled today may not unpickle with a later version of a library.
I have made every effort to ensure that this does not occur with pandas,
but at some point in the future it may be necessary to “break” the pickle
format.

## Using HDF5 Format

There are a number of tools that facilitate efficiently reading and writing large amounts
of scientific data in binary format on disk. A popular industry-grade library for this is
HDF5, which is a C library with interfaces in many other languages like Java, Python,
and MATLAB. The “HDF” in HDF5 stands for hierarchical data format. Each HDF5
file contains an internal file system-like node structure enabling you to store multiple
datasets and supporting metadata. Compared with simpler formats, HDF5 supports
on-the-fly compression with a variety of compressors, enabling data with repeated patterns
to be stored more efficiently. For very large datasets that don’t fit into memory,
HDF5 is a good choice as you can efficiently read and write small sections of much
larger arrays.

There are not one but two interfaces to the HDF5 library in Python, PyTables and h5py,
each of which takes a different approach to the problem. h5py provides a direct, but
high-level interface to the HDF5 API, while PyTables abstracts many of the details of HDF5 to provide multiple flexible data containers, table indexing, querying capability,
and some support for out-of-core computations.


pandas has a minimal dict-like HDFStore class, which uses PyTables to store pandas
objects:

In [12]:
store = pd.HDFStore('mydata.h5')

In [13]:
store['obj1'] = frame

In [14]:
store['obj1_col'] = frame['a']

In [15]:
store

<class 'pandas.io.pytables.HDFStore'>
File path: mydata.h5
/obj1                frame        (shape->[3,5])
/obj1_col            series       (shape->[3])  

Objects contained in the HDF5 file can be retrieved in a dict-like fashion:

In [16]:
store['obj1']

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


If you work with huge quantities of data, I would encourage you to explore PyTables
and h5py to see how they can suit your needs. Since many data analysis problems are
IO-bound (rather than CPU-bound), using a tool like HDF5 can massively accelerate
your applications.

HDF5 is not a database. It is best suited for write-once, read-many datasets.
While data can be added to a file at any time, if multiple writers
do so simultaneously, the file can become corrupted.

## Reading Microsoft Excel Files

pandas also supports reading tabular data stored in Excel 2003 (and higher) files using
the ExcelFile class. Interally ExcelFile uses the xlrd and openpyxl packages, so you
may have to install them first. To use ExcelFile, create an instance by passing a path
to an xls or xlsx file:

In [17]:
xls_file = pd.ExcelFile('data.xls')

FileNotFoundError: [Errno 2] No such file or directory: 'data.xls'

Data stored in a sheet can then be read into DataFrame using parse:


In [18]:
table = xls_file.parse('Sheet1')

NameError: name 'xls_file' is not defined