## Task
Explore I/O in Python std lib, NumPy & pandas

## Notebook summary 
* **[CSV](#csv)**
 * Python std lib: `csv` module (converts all fields to strings)
 * NumPy: `numpy.loadtxt` & `numpy.savetxt` (can handle only numeric fields)
 * `read_csv` (automatically determines field types), `to_csv`, `from_csv`
* **[JSON](#json)** - `loads`, `dumps`, `read_json`
* **[HDF5](#hdf5)** 
 * [`h5py`](#h5py)
 * [`PyTables`](#pytables)
 * [`pandas.HDFStore`](#pdhdfstore)
* **[Binary Serialization](#binary)**
 * [`pickle`](#pickle)
 * [`msgpack`](#msgpack)
* **[DB](#db)** - import data from DB; requires SQL modules

## References
* *Python for Data Analysis*, Wes McKinney, O'Reilly, 2012
* *Numerical Python*, Robert Johansson, APress, 2015
* *Python Data Science Handbook*, Jake VanderPlas, O'Reilly, 2016


In [2]:
# display output from all cmds just like Python shell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import platform
print 'python.version = ', platform.python_version()
import IPython
print 'ipython.version =', IPython.version_info

import pandas as pd
print 'pandas.version = ', pd.__version__
from pandas import Series, DataFrame

import sys
import json
print 'json.version =', json.__version__

import h5py
print 'h5py.version =', h5py.__version__

import msgpack
print 'msgpack.version =', msgpack.version

import tables
print 'PyTables.version =', tables.__version__

import pandas.io.sql as sql
import sqlite3
print 'sqlite3.version =', sqlite3.version

from datetime import datetime


python.version =  2.7.10
ipython.version = (5, 1, 0, '')
pandas.version =  0.19.2
json.version = 2.0.9
h5py.version = 2.6.0
msgpack.version = (0, 4, 8)
PyTables.version = 3.3.0
sqlite3.version = 2.6.0


In [8]:
print '\n----- Data in input file'
!cat sample_data.csv



----- Data in input file
,Col1,Col2,Col3
Row1,Val11,Val12,Val13
Row2,Val21,Val22,Val23
Row3,Val31,Val32,Val33


<a id='csv' />
## CSV

In [7]:
# read_csv - import data from CSV file

print '\n----- read_csv with default options'
pd.read_csv('sample_data.csv')

print '\n----- read_csv with header=None'
pd.read_csv('sample_data.csv', header=None)

print '\n----- read_csv with custom header names'
pd.read_csv('sample_data.csv', names=['H1','H2','H3','H4'])



----- read_csv with default options


Unnamed: 0.1,Unnamed: 0,Col1,Col2,Col3
0,Row1,Val11,Val12,Val13
1,Row2,Val21,Val22,Val23
2,Row3,Val31,Val32,Val33



----- read_csv with header=None


Unnamed: 0,0,1,2,3
0,,Col1,Col2,Col3
1,Row1,Val11,Val12,Val13
2,Row2,Val21,Val22,Val23
3,Row3,Val31,Val32,Val33



----- read_csv with custom header names


Unnamed: 0,H1,H2,H3,H4
0,,Col1,Col2,Col3
1,Row1,Val11,Val12,Val13
2,Row2,Val21,Val22,Val23
3,Row3,Val31,Val32,Val33


In [8]:
# read_csv - import data from CSV file

print '\n----- read_csv with row names in col 0'
pd.read_csv('sample_data.csv', index_col=0)

print '\n----- read_csv with row names in col 1'
pd.read_csv('sample_data.csv', index_col=1)

print '\n----- read_csv with row names in col 0, skip first 2 rows'
pd.read_csv('sample_data.csv', index_col=0, skiprows=[2])
    
# Note: 
# Both Series and DataFrame have a from_csv() function that reads data from CSV file into the Series or DataFrame.
# It's use is discouraged in favor of read_csv()

# Use pd.info() to see the type to which each column in input file has been converted
# ToDo



----- read_csv with row names in col 0


Unnamed: 0,Col1,Col2,Col3
Row1,Val11,Val12,Val13
Row2,Val21,Val22,Val23
Row3,Val31,Val32,Val33



----- read_csv with row names in col 1


Unnamed: 0_level_0,Unnamed: 0,Col2,Col3
Col1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Val11,Row1,Val12,Val13
Val21,Row2,Val22,Val23
Val31,Row3,Val32,Val33



----- read_csv with row names in col 0, skip first 2 rows


Unnamed: 0,Col1,Col2,Col3
Row1,Val11,Val12,Val13
Row3,Val31,Val32,Val33


In [10]:
# read csv file in chunks

print '\n----- read in chunks'
pd.read_csv('sample_data.csv', index_col=0, nrows=2)
part = pd.read_csv('sample_data.csv', index_col=0, chunksize=2)
print 'part = ', part

for i, p in enumerate(part):
    print 'Part ', i
    p



----- read in chunks


Unnamed: 0,Col1,Col2,Col3
Row1,Val11,Val12,Val13
Row2,Val21,Val22,Val23


part =  <pandas.io.parsers.TextFileReader object at 0x10940cc10>
Part  0


Unnamed: 0,Col1,Col2,Col3
Row1,Val11,Val12,Val13
Row2,Val21,Val22,Val23


Part  1


Unnamed: 0,Col1,Col2,Col3
Row3,Val31,Val32,Val33


In [14]:
# to_csv

print '\n----- Write df to CSV file'
pd.read_csv('sample_data.csv').to_csv('out.csv')
! cat out.csv



----- Write df to CSV file
,Unnamed: 0,Col1,Col2,Col3
0,Row1,Val11,Val12,Val13
1,Row2,Val21,Val22,Val23
2,Row3,Val31,Val32,Val33


In [3]:
# to_csv

print '\n----- Write df to stdout'
pd.read_csv('sample_data.csv').to_csv(sys.stdout)
print '---'

print '\n----- Write df to stdout w/o row names & header'
pd.read_csv('sample_data.csv').to_csv(sys.stdout, index=False, header=False)
print '---' 

print '\n----- Write only Col1 to stdout w/o row names'
pd.read_csv('sample_data.csv').to_csv(sys.stdout, index=False, columns=['Col1'])


# Save to binary format
pd.read_csv('sample_data.csv').to_pickle('csv-pickled.out')



----- Write df to stdout


IOError: File sample_data.csv does not exist

<a id='json' />
## JSON

In [7]:
# loads, read_json, to_json

myjson = """
[{
"name": "MyName",
"age": 99,
"city": "MyCity",
"country": "MyCountry"
},
{
"name": "YourName",
"age": 100,
"city": "YourTown",
"country": "YourRepublic"
}
]
"""

print '----- import JSON using DataFrame(json.loads)'
json.loads(myjson)
DataFrame(json.loads(myjson), index=['Me', 'You']) 

print '\n----- Import JSON using pd.read_json'
df = pd.read_json(myjson, typ='frame')
df


----- import JSON using DataFrame(json.loads)


[{u'age': 99,
  u'city': u'MyCity',
  u'country': u'MyCountry',
  u'name': u'MyName'},
 {u'age': 100,
  u'city': u'YourTown',
  u'country': u'YourRepublic',
  u'name': u'YourName'}]

Unnamed: 0,age,city,country,name
Me,99,MyCity,MyCountry,MyName
You,100,YourTown,YourRepublic,YourName



----- Import JSON using pd.read_json


Unnamed: 0,age,city,country,name
0,99,MyCity,MyCountry,MyName
1,100,YourTown,YourRepublic,YourName


In [48]:

print '\n----- convert DataFrame to JSON'
df.to_json()

print '\n----- Import JSON time series into pd.Series object'

myjson_ts="""
[
{"2016-01-01": 1.0},
{"2016-01-02": 2.1},
{"2016-01-03": 3.2},
{"2016-01-04": 4.3},
{"2016-01-05": 5.4}
]"""

# using json.loads
Series(json.loads(myjson_ts))
    
# using read_json
pd.read_json(myjson_ts, typ='series')
# ToDo: convert JSON keys into Series index



----- convert DataFrame to JSON


'{"age":{"0":99,"1":100},"city":{"0":"MyCity","1":"YourTown"},"country":{"0":"MyCountry","1":"YourRepublic"},"name":{"0":"MyName","1":"YourName"}}'


----- Import JSON time series into pd.Series object


0    {u'2016-01-01': 1.0}
1    {u'2016-01-02': 2.1}
2    {u'2016-01-03': 3.2}
3    {u'2016-01-04': 4.3}
4    {u'2016-01-05': 5.4}
dtype: object

0    {u'2016-01-01': 1.0}
1    {u'2016-01-02': 2.1}
2    {u'2016-01-03': 3.2}
3    {u'2016-01-04': 4.3}
4    {u'2016-01-05': 5.4}
dtype: object

<a id='hdf5'></a>

## HDF5
* Hierarchical data format - cross-platform storage, efficient IO, metadata system, scales up to very large sizes
* Format consists of groups (dirs) and datasets (files). Both can have attributes to store metadata
* Groups can be nested to create a tree structure, ergo hierarchical
* 2 python libraries 
 * **h5py**: lower-level; tries to map HDF5 to NumPy; provides access to almost all of HDF5 C API
 * **PyTables**: additional database-like data model above HDF5 & NumPy
* Pandas uses `HDFStore` which is based on PyTables


<a id='h5py' />
h5py

In [2]:
# h5py - create groups in HDF file
f = h5py.File('h5py-file.h5', mode='w')

f.filename
f.mode
f.name

g1 = f.create_group('g1')
g1.name

g21 = f.create_group('g2/g21') # parent groups are created automatically
g21.name

f.flush()
f.close()


u'h5py-file.h5'

'r+'

u'/'

u'/g1'

u'/g2/g21'

In [3]:
# h5py - access groups in HDF file - like python dictionary
f = h5py.File('h5py-file.h5', mode='r')
f.mode

f['g1']
g2 = f['g2']
g2
g2['g21']


'r'

<HDF5 group "/g1" (0 members)>

<HDF5 group "/g2" (1 members)>

<HDF5 group "/g2/g21" (0 members)>

In [4]:
# h5py - iterate over all groups in HDF file

f.keys() # return names only
f.items() # return (name, value) tuples

print '-----'
def printlist(x):
    print x

print 'visit():'
f.visit(lambda x: printlist([x]))

print '\nvisititems():'
f.visititems(lambda name, val: printlist([name, val]))


[u'g1', u'g2']

[(u'g1', <HDF5 group "/g1" (0 members)>),
 (u'g2', <HDF5 group "/g2" (1 members)>)]

-----
visit():
[u'g1']
[u'g2']
[u'g2/g21']

visititems():
[u'g1', <HDF5 group "/g1" (0 members)>]
[u'g2', <HDF5 group "/g2" (1 members)>]
[u'g2/g21', <HDF5 group "/g2/g21" (0 members)>]


In [5]:
# h5py - test group membership

'g1' in f
'g3' in f

'g21' in f['g2']

f.flush()
f.close()


True

False

True

In [26]:
# h5py - attributes of groups

f = h5py.File('h5py-file.h5', mode='r+')
f.attrs
f.attrs['desc'] = 'My first group'

f['g2']['g21'].attrs['desc'] = 'G21 is a sub-group of G2'
f['g2']['g21'].attrs['date'] = str(datetime.now())

f.attrs.keys() # keys only
f.attrs.items() # (key, val) tuples

f['g2']['g21'].attrs.items()

'desc' in f['g2']['g21'].attrs
del f['g2']['g21'].attrs['desc']
'desc' in f['g2']['g21'].attrs

f.flush()
f.close()


<Attributes of HDF5 object at 4608806336>

[u'desc']

[(u'desc', 'My first group')]

[(u'desc', 'G21 is a sub-group of G2'),
 (u'date', '2017-01-27 12:40:32.104677')]

True

False

In [35]:
# h5py - create dataset

f = h5py.File('h5py-file.h5', mode='r+')
f.visititems(lambda name, val: printlist([name, val]))

g1 = f['g1']
g1['g1data'] = [1,2,3,4]

g21 = f['g2']['g21']
g21['g21data'] = ['a','b','c']

f.visititems(lambda name, val: printlist([name, val]))
f.flush()
f.close()

# See also Group.create_dataset() - more control over dataset including dtype, compression, etc.


[u'g1', <HDF5 group "/g1" (1 members)>]
[u'g1/g1data', <HDF5 dataset "g1data": shape (4,), type "<i8">]
[u'g2', <HDF5 group "/g2" (1 members)>]
[u'g2/g21', <HDF5 group "/g2/g21" (0 members)>]
[u'g1', <HDF5 group "/g1" (1 members)>]
[u'g1/g1data', <HDF5 dataset "g1data": shape (4,), type "<i8">]
[u'g2', <HDF5 group "/g2" (1 members)>]
[u'g2/g21', <HDF5 group "/g2/g21" (1 members)>]
[u'g2/g21/g21data', <HDF5 dataset "g21data": shape (3,), type "|S1">]


In [43]:
# h5py - access datasets

f = h5py.File('h5py-file.h5', mode='r')
arr = f['g1']['g1data']
arr # this is a Dataset object, not a list or NumPy array

arr.name, arr.dtype, arr.shape, arr.len()
arr.value
type(arr.value)

# most operations on Datasets are similar to NumPy arrays
# with Datasets only necessary data is loaded into memory, unlike operations on the underlying NumPy array


<HDF5 dataset "g1data": shape (4,), type "<i8">

(u'/g1/g1data', dtype('int64'), (4,), 4)

array([1, 2, 3, 4])

numpy.ndarray

<a id='pytables' /a>
PyTables


In [None]:
# PyTables - ?

# ToDo


<a id='pdhdfstore' />
pd.HDFStore

In [54]:
# HDF5 - pandas uses PyTables module to read/write HDF5

print '----- Empty HDF5 file'
myHDF5Store = pd.HDFStore('pandas-hdfstore-file.h5')
myHDF5Store

print '\n----- HDF5 file with items'
myHDF5Store['s'] = Series(range(5))
myHDF5Store['df'] = DataFrame(json.loads(myjson), index=['Me', 'You'])

myHDF5Store
myHDF5Store.root
myHDF5Store.keys()
's' in myHDF5Store

print '-----'
print 's:'
myHDF5Store['s']

print 'df:'
myHDF5Store['df']

del myHDF5Store['s']
myHDF5Store

myHDF5Store.close()


----- Empty HDF5 file


<class 'pandas.io.pytables.HDFStore'>
File path: pandas-hdfstore-file.h5
/df            frame        (shape->[2,2])


----- HDF5 file with items


your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->unicode,key->axis0] [items->None]

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->unicode,key->block0_items] [items->None]

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->unicode,key->block1_items] [items->None]



<class 'pandas.io.pytables.HDFStore'>
File path: pandas-hdfstore-file.h5
/df            frame        (shape->[2,2])
/s             series       (shape->[5])  

/ (RootGroup) ''
  children := ['df' (Group), 's' (Group)]

['/df', '/s']

True

-----
s:


0    0
1    1
2    2
3    3
4    4
dtype: int64

df:


Unnamed: 0,age,city,country,name
Me,99,MyCity,MyCountry,MyName
You,100,YourTown,YourRepublic,YourName


<class 'pandas.io.pytables.HDFStore'>
File path: pandas-hdfstore-file.h5
/df            frame        (shape->[2,2])

In [20]:
# read_hdf, to_hdf

df = pd.read_hdf('MyData.h5', 'df')
df

df.to_hdf('MyData.h5', 'df/again')

pd.HDFStore('MyData.h5')


Unnamed: 0,age,city,country,name
Me,99,MyCity,MyCountry,MyName
You,100,YourTown,YourRepublic,YourName


<class 'pandas.io.pytables.HDFStore'>
File path: MyData.h5
/df                  frame        (shape->[2,2])
/df/again            frame        (shape->[2,2])

<a id='binary' />
## Binary Serialization

<a id='pickle'>pickle</a>

* Any type of Python object can be serialized 
* But pickled objects cannot be read outside Python and sometimes even with different versions of Python
* Larger compressed size than `msgpack`

In [13]:
# pickle

s = Series(range(6), index=['A','B','C','D','E','F'])
s
s.to_pickle('MySeries.pkl')

s2 = pd.read_pickle('MySeries.pkl')
s2


A    0
B    1
C    2
D    3
E    4
F    5
dtype: int64

A    0
B    1
C    2
D    3
E    4
F    5
dtype: int64

<a id='msgpack'>msgpack</a>
* Binary protocol for storing JSON-like data efficiently
* operators on byte lists (packb / unpackb) & file handles (pack / unpack)
* Requires much less space than equivalent JSON format and `pickle`

In [14]:
# msgpack
myjson
b = msgpack.packb(myjson)
b
j = msgpack.unpackb(b)
j


'\n[{\n"name": "MyName",\n"age": 99,\n"city": "MyCity",\n"country": "MyCountry"\n},\n{\n"name": "YourName",\n"age": 100,\n"city": "YourTown",\n"country": "YourRepublic"\n}\n]\n'

'\xda\x00\xa1\n[{\n"name": "MyName",\n"age": 99,\n"city": "MyCity",\n"country": "MyCountry"\n},\n{\n"name": "YourName",\n"age": 100,\n"city": "YourTown",\n"country": "YourRepublic"\n}\n]\n'

'\n[{\n"name": "MyName",\n"age": 99,\n"city": "MyCity",\n"country": "MyCountry"\n},\n{\n"name": "YourName",\n"age": 100,\n"city": "YourTown",\n"country": "YourRepublic"\n}\n]\n'

<a id='db' />
## DB

In [23]:
# SQL 

# Create DB with SQLite
query = """
CREATE TABLE MyTable (
Col1 INT,
Col2 VARCHAR(50),
Col3 FLOAT
);
"""

conn = sqlite3.connect(":memory:")
conn.execute(query)
conn.commit


# Load data to DB
data = [
    (1, 'This is Row 1', 3.14),
    (2, 'This is Row 2', 4.15),
    (3, 'This is Row 3', 5.16),
    (4, 'This is Row 4', 6.17)
]
statement = "INSERT INTO MyTable VALUES(?,?,?)"
conn.executemany(statement, data)
conn.commit()


<sqlite3.Cursor at 0x109a6ee30>

<function commit>

<sqlite3.Cursor at 0x109a6ef80>

In [24]:
# Get data from this DB
cursor = conn.execute('Select * from MyTable')
print 'cursor description:'
cursor.description

colnames = zip(*cursor.description)[0]
colnames

rows = cursor.fetchall()
rows

DataFrame(rows, columns=colnames)

print '\n----- Alternately, get DataFrame from SQL using read_sql_query()'
# Using pandas function requires only a single statement
df = sql.read_sql_query('SELECT * from MyTable', conn)
type(df)
df


cursor description:


(('Col1', None, None, None, None, None, None),
 ('Col2', None, None, None, None, None, None),
 ('Col3', None, None, None, None, None, None))

('Col1', 'Col2', 'Col3')

[(1, u'This is Row 1', 3.14),
 (2, u'This is Row 2', 4.15),
 (3, u'This is Row 3', 5.16),
 (4, u'This is Row 4', 6.17)]

Unnamed: 0,Col1,Col2,Col3
0,1,This is Row 1,3.14
1,2,This is Row 2,4.15
2,3,This is Row 3,5.16
3,4,This is Row 4,6.17



----- Alternately, get DataFrame from SQL using read_sql_query()


pandas.core.frame.DataFrame

Unnamed: 0,Col1,Col2,Col3
0,1,This is Row 1,3.14
1,2,This is Row 2,4.15
2,3,This is Row 3,5.16
3,4,This is Row 4,6.17
