# Chapter 6. Data Loading, Storage, and File Formats
<a id='index'></a>
## Table of Content
- [6.1 Reading and Writing Data in Text Format](#61)
    - [6.1.1 Reading Text Files in Pieces](#611)
    - [6.1.2 Writing Data to Text Format](#612)
    - [6.1.3 Working with Delimited Formats](#613)
    - [6.1.4 JSON Data](#614)
    - [6.1.6 XML and HTML: Web Scraping](#615)
        - [6.1.6.1 Parsing XML with lxml.objectify](#6161)
- [6.2 Binary Data Formats](#62)
    - [6.2.1 Using HDF5 Format](#621)

## 6.1 Reading and Writing Data in Text Format
<a id='61'></a>
#### Category
* Indexing
* Type inference and data conversion
* Datetime parsing
* Iterating
* Unclean data issues

In [97]:
import numpy as np
import pandas as pd

In [98]:
try: 
    df = pd.read_csv('examples/ex1.csv')
except FileNotFoundError as e:
    print(e)
df

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [99]:
try: 
    df = pd.read_table('examples/ex1.csv', sep=',')
except FileNotFoundError as e:
    print(e)
df

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


> A file will not always have a header row

In [100]:
!cat examples/ex2.csv

1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo


In [101]:
try: 
    df = pd.read_csv('examples/ex2.csv', header=None)
except FileNotFoundError as e:
    print(e)
df

Unnamed: 0,0,1,2,3,4
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [102]:
# Then...
try: 
    df = pd.read_csv('examples/ex2.csv', names=['a', 'b', 'c', 'd', 'message'])
except FileNotFoundError as e:
    print(e)
df

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [103]:
# If you want to select one of them as index, then...
try: 
    df = pd.read_csv('examples/ex2.csv', names=['a', 'b', 'c', 'd', 'message'], index_col='message')
except FileNotFoundError as e:
    print(e)
df

Unnamed: 0_level_0,a,b,c,d
message,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
hello,1,2,3,4
world,5,6,7,8
foo,9,10,11,12


In [104]:
parsed = pd.read_csv('examples/csv_mindex.csv', index_col=['key1', 'key2'])
parsed

Unnamed: 0_level_0,Unnamed: 1_level_0,value1,value2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16


In [105]:
# In some cases, a table might not have a fixed delimiter, using whitespace or 
# some other pattern to separate fields. Consider a text file that looks like this
list(open('examples/ex3.txt'))

['            A         B         C\n',
 'aaa -0.264438 -1.026059 -0.619500\n',
 'bbb  0.927272  0.302904 -0.032399\n',
 'ccc -0.264273 -0.386314 -0.217601\n',
 'ddd -0.871858 -0.348382  1.100491\n']

> While you could do some munging by hand, the fields here are separated by a variable amount of whitespace. In these cases, you can pass a regular expression as a delimiter for ***read_table***. This can be expressed by the regular expression ***\s+***, so we have then:

In [106]:
result = pd.read_table('examples/ex3.txt', sep='\s+')
result

Unnamed: 0,A,B,C
aaa,-0.264438,-1.026059,-0.6195
bbb,0.927272,0.302904,-0.032399
ccc,-0.264273,-0.386314,-0.217601
ddd,-0.871858,-0.348382,1.100491


In [107]:
# To omit certain rows
pd.read_csv('examples/ex4.csv', skiprows=[0, 2, 3])

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [108]:
# Handling missing value, which is either not present (empty string) or marked by some sentinel value
result = pd.read_csv('examples/ex5.csv')
result

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


In [109]:
pd.isnull(result)

Unnamed: 0,something,a,b,c,d,message
0,False,False,False,False,False,True
1,False,False,False,True,False,False
2,False,False,False,False,False,False


> The **na_values** option can take either a list or set of strings to consider missing values:

In [110]:
result = pd.read_csv('examples/ex5.csv', na_values=['NULL'])
result

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


> Different NA sentinels can be specified for each column in a dict:

In [111]:
sentinels = {'message': ['foo', 'NA'], 'something': ['two']}

pd.read_csv('examples/ex5.csv', na_values=sentinels)

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,,5,6,,8,world
2,three,9,10,11.0,12,


### 6.1.1 Reading Text Files in Pieces
<a id='611'></a>

In [112]:
# If you want to only read a small number of rows (avoiding reading the entire file), specify that with nrows
pd.read_csv('examples/ex5.csv', nrows=2)

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world


In [113]:
# To read a file in pieces, specify a chunksize as a number of rows:
chunker = pd.read_csv('examples/ex1.csv', chunksize=1)

# The TextParser object returned by read_csv allows you to iterate 
# over the parts of the file according to the chunksize.
chunker

<pandas.io.parsers.TextFileReader at 0x10f178dd8>

In [114]:
tot = pd.Series([]) 

for piece in chunker:
    tot = tot.add(piece['a'].value_counts(), fill_value=0)

tot = tot.sort_values(ascending=False)
tot

9    1.0
5    1.0
1    1.0
dtype: float64

### 6.1.2 Writing Data to Text Format
<a id='612'></a>

In [115]:
data = pd.read_csv('examples/ex5.csv')
data

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


In [116]:
data.to_csv('examples/out.csv')
!cat examples/out.csv

,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


In [117]:
import sys

# Other delimiters can be used, of course (writing to sys.stdout so it prints the text result to the console)
data.to_csv(sys.stdout, sep='|')

|something|a|b|c|d|message
0|one|1|2|3.0|4|
1|two|5|6||8|world
2|three|9|10|11.0|12|foo


In [118]:
# Missing values appear as empty strings in the output. 
# You might want to denote them by some other sentinel value:
data.to_csv(sys.stdout, na_rep='NULL')

,something,a,b,c,d,message
0,one,1,2,3.0,4,NULL
1,two,5,6,NULL,8,world
2,three,9,10,11.0,12,foo


In [119]:
# With no other options specified, both the row and column labels are written. Both of these can be disabled:
data.to_csv(sys.stdout, index=False, header=False)

one,1,2,3.0,4,
two,5,6,,8,world
three,9,10,11.0,12,foo


In [120]:
# You can also write only a subset of the columns, and in an order of your choosing
data.to_csv(sys.stdout, index=False, columns=['a','c','message'])

a,c,message
1,3.0,
5,,world
9,11.0,foo


In [121]:
dates = pd.date_range('1/1/2000', periods=7)
ts = pd.Series(np.arange(7), index=dates)
ts.to_csv('examples/tseries.csv')

!cat examples/tseries.csv

2000-01-01,0
2000-01-02,1
2000-01-03,2
2000-01-04,3
2000-01-05,4
2000-01-06,5
2000-01-07,6


### 6.1.3 Working with Delimited Formats
<a id='613'></a>

In [122]:
!cat examples/ex7.csv

"a","b","c"
"1","2","3"
"1","2","3"


In [123]:
import csv

# For any file with a single-character delimiter, you can use Python’s 
# built-in csv mod‐ ule. To use it, pass any open file or file-like object 
# to csv.reader
f = open('examples/ex7.csv')
reader = csv.reader(f)

for line in reader:
    print(line)

['a', 'b', 'c']
['1', '2', '3']
['1', '2', '3']


In [124]:
# From there, it’s up to you to do the wrangling necessary to put the data in the form that 
# you need it. Let’s take this step by step. First, we read the file into a list of lines
with open('examples/ex7.csv') as f:
    lines = list(csv.reader(f))

In [125]:
# Then, we split the lines into the header line and the data lines:
header, values = lines[0], lines[1:]

In [126]:
# Then we can create a dictionary of data columns using a dictionary comprehension
# and the expression zip(*values), which transposes rows to columns:
data_dict = {h: v for h, v in zip(header, zip(*values))}
data_dict

{'a': ('1', '1'), 'b': ('2', '2'), 'c': ('3', '3')}

### 6.1.4 JSON Data
<a id='614'></a>
JSON (short for JavaScript Object Notation)

In [127]:
obj = """
    {
     "name": "Wes",
     "places_lived": ["United States", "Spain", "Germany"],
     "pet": null,
     "siblings": [{"name": "Scott", "age": 30, 
                   "pets": ["Zeus", "Zuko"]},
                  {"name": "Katie", "age": 38,
                   "pets": ["Sixes", "Stache", "Cisco"]}]
    } 
"""

In [128]:
import json
result = json.loads(obj)
result

{'name': 'Wes',
 'pet': None,
 'places_lived': ['United States', 'Spain', 'Germany'],
 'siblings': [{'age': 30, 'name': 'Scott', 'pets': ['Zeus', 'Zuko']},
  {'age': 38, 'name': 'Katie', 'pets': ['Sixes', 'Stache', 'Cisco']}]}

In [129]:
# json.dumps, on the other hand, converts a Python object back to JSON
asjson = json.dumps(result)
asjson

'{"name": "Wes", "places_lived": ["United States", "Spain", "Germany"], "pet": null, "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]}, {"name": "Katie", "age": 38, "pets": ["Sixes", "Stache", "Cisco"]}]}'

In [130]:
siblings = pd.DataFrame(result['siblings'], columns=['name', 'age'])
siblings

Unnamed: 0,name,age
0,Scott,30
1,Katie,38


In [131]:
#The pandas.read_json can automatically convert JSON datasets in specific arrangements 
# into a Series or DataFrame. For example:
!cat examples/example.json

[{"a": 1, "b": 2, "c": 3},
 {"a": 4, "b": 5, "c": 6},
 {"a": 7, "b": 8, "c": 9}]


In [132]:
# The default options for pandas.read_json assume that each object in the JSON array is a row in the table
data = pd.read_json('examples/example.json')
data

Unnamed: 0,a,b,c
0,1,2,3
1,4,5,6
2,7,8,9


In [133]:
print(data.to_json())

{"a":{"0":1,"1":4,"2":7},"b":{"0":2,"1":5,"2":8},"c":{"0":3,"1":6,"2":9}}


### 6.1.6 XML and HTML: Web Scraping
<a id='616'></a>
pandas has a built-in function, ***read_html***, which uses libraries like lxml and Beauti‐ ful Soup to automatically parse tables out of HTML files as DataFrame objects.

The pandas.read_html function has a number of options, but by default it searches for and attempts to parse all tabular data contained within ***< table \>*** tags.

In [134]:
tables = pd.read_html('examples/fdic_failed_bank_list.html')

len(tables)

1

In [135]:
failures = tables[0]
failures.head()

Unnamed: 0,Bank Name,City,ST,CERT,Acquiring Institution,Closing Date,Updated Date
0,Washington Federal Bank for Savings,Chicago,IL,30570,Royal Savings Bank,"December 15, 2017","December 15, 2017"
1,The Farmers and Merchants State Bank of Argonia,Argonia,KS,17719,Conway Bank,"October 13, 2017","October 20, 2017"
2,Fayette County Bank,Saint Elmo,IL,1802,"United Fidelity Bank, fsb","May 26, 2017","July 26, 2017"
3,"Guaranty Bank, (d/b/a BestBank in Georgia & Mi...",Milwaukee,WI,30003,First-Citizens Bank & Trust Company,"May 5, 2017","July 26, 2017"
4,First NBC Bank,New Orleans,LA,58302,Whitney Bank,"April 28, 2017","December 5, 2017"


#### 6.1.6.1 Parsing XML with lxml.objectify
<a id='6161'></a>
XML (eXtensible Markup Language) is another common structured data format sup‐ porting hierarchical, nested data with metadata. The book you are currently reading was actually created from a series of large XML documents.

In [136]:
from lxml import objectify

path = 'examples/Performance_MNR.xml'
parsed = objectify.parse(open(path))
root = parsed.getroot()


data = []
skip_fields = ['PARENT_SEQ', 'INDICATOR_SEQ', 'DESIRED_CHANGE', 'DECIMAL_PLACES']

for elt in root: 
    el_data = {}
    for child in elt.getchildren(): 
        if child.tag in skip_fields:
            continue
        el_data[child.tag] = child.pyval
    data.append(el_data)
    
perf = pd.DataFrame(data)
perf.head()

Unnamed: 0,AGENCY_NAME,CATEGORY,DESCRIPTION,FREQUENCY,INDICATOR_NAME,INDICATOR_UNIT,MONTHLY_ACTUAL,MONTHLY_TARGET,PERIOD_MONTH,PERIOD_YEAR,YTD_ACTUAL,YTD_TARGET
0,Metro-North Railroad,Service Indicators,Percent of the time that escalators are operat...,M,Escalator Availability,%,,97.0,12,2011,,97.0


In [137]:
# XML data can get much more complicated than this example. Each tag can have metadata, too. 
# Consider an HTML link tag, which is also valid XML:

from io import StringIO
tag = '<a href="http://www.google.com">Google</a>' 
root = objectify.parse(StringIO(tag)).getroot()

root

<Element a at 0x10f662648>

In [138]:
root.get('href')

'http://www.google.com'

In [139]:
root.text

'Google'

## 6.2 Binary Data Formats
<a id='62'></a>
One of the easiest ways to store data (also known as *serialization*) efficiently in binary format is using Python’s built-in **pickle** serialization. pandas objects all have a ***to_pickle*** method that writes the data to disk in pickle format:

In [140]:
frame = pd.read_csv('examples/ex1.csv')

frame.to_pickle('examples/frame_pickle')

In [141]:
# You can read any “pickled” object stored in a file by using the built-in pickle directly,
# or even more conveniently using pandas.read_pickle:
pd.read_pickle('examples/frame_pickle')

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


> * pickle is only recommended as a short-term storage format.
> * The problem is that it is hard to guarantee that the format will be stable over time; an object pickled today may not unpickle with a later version of a library.
> * pandas has built-in support for two more binary data formats: HDF5 and MessagePack. 
> * There are also others such as: bcolz and Feather

### 6.2.1 Using HDF5 Format
<a id='621'></a>
HDF5 is a well-regarded file format intended for storing large quantities of scientific array data. It is available as a C library, and it has interfaces available in many other languages, including Java, Julia, MATLAB, and Python. The “HDF” in HDF5 stands for hierarchical data format. Each HDF5 file can store multiple datasets and support‐ ing metadata. Compared with simpler formats, HDF5 supports on-the-fly compres‐ sion with a variety of compression modes, enabling data with repeated patterns to be stored more efficiently. HDF5 can be a good choice for working with very large data‐ sets that don’t fit into memory, as you can efficiently read and write small sections of much larger arrays.

In [163]:
# While it’s possible to directly access HDF5 files using either the PyTables or h5py libraries, 
# pandas provides a high-level interface that simplifies storing Series and DataFrame object. 
# The HDFStore class works like a dict and handles the low-level details:
frame = pd.DataFrame({'a': np.random.randn(100)})
store = pd.HDFStore('mydata.h5')
store['obj1'] = frame
store['obj1_col'] = frame['a']
store

<class 'pandas.io.pytables.HDFStore'>
File path: mydata.h5

In [164]:
store['obj1']

Unnamed: 0,a
0,0.260328
1,0.063828
2,-1.873897
3,1.242962
4,1.122369
5,-0.486322
6,-0.699152
7,0.624996
8,0.746038
9,0.674452


In [165]:
# HDFStore supports two storage schemas, 'fixed' and 'table'. 
# The latter is generally slower, but it supports query operations using a special syntax:
store.put('obj2', frame, format='table')
store.select('obj2', where=['index >= 10 and index <= 15'])

Unnamed: 0,a
10,0.145903
11,2.226336
12,-0.923274
13,-0.92816
14,-0.070557
15,0.112741


In [166]:
# To close the file as file handler
store.close()

In [168]:
# The pandas.read_hdf function gives you a shortcut to these tools:
frame.to_hdf('mydata2.h5', 'obj3', format='table')
pd.read_hdf('mydata2.h5', 'obj3', where=['index < 5'])

Unnamed: 0,a
0,0.260328
1,0.063828
2,-1.873897
3,1.242962
4,1.122369


> *HDF5 is not a database. It is best suited for write-once, read-many datasets. While data can be added to a file at any time, if multiple writers do so simultaneously, the file can become corrupted.*

### 6.2.2 Reading Microsoft Excel Files
<a id='622'></a>
Supports reading tabular data stored in Excel 2003 (and higher) files using either the *ExcelFile* class or *pandas.read_excel* function. Internally these tools use the add-on packages *xlrd* and *openpyxl* to read **XLS** and **XLSX** files, respectively.

[Back to top](#index)