# Data Loading, Storage, and File Formats

The tools in this book are of little use if you can’t easily import and export data in
Python. I’m going to be focused on input and output with pandas objects, though there
are of course numerous tools in other libraries to aid in this process. NumPy, for example,
features low-level but extremely fast binary data loading and storage, including
support for memory-mapped array. See Chapter 12 for more on those.
Input and output typically falls into a few main categories: reading text files and other
more efficient on-disk formats, loading data from databases, and interacting with network
sources like web APIs.

## Reading and Writing Data in Text Format


Python has become a beloved language for text and file munging due to its simple syntax
for interacting with files, intuitive data structures, and convenient features like tuple
packing and unpacking.


pandas features a number of functions for reading tabular data as a DataFrame object.
Table 6-1 has a summary of all of them, though read_csv and read_table are likely the
ones you’ll use the most.


Table 6-1. Parsing functions in pandas

Function Description

read_csv Load delimited data from a file, URL, or file-like object. Use comma as default delimiter

read_table Load delimited data from a file, URL, or file-like object. Use tab ('\t') as default delimiter

read_fwf Read data in fixed-width column format (that is, no delimiters)

read_clipboard Version of read_table that reads data from the clipboard. Useful for converting tables from web pages

I’ll give an overview of the mechanics of these functions, which are meant to convert
text data into a DataFrame. The options for these functions fall into a few categories:

* Indexing: can treat one or more columns as the returned DataFrame, and whether to get column names from the file, the user, or not at all.

* Type inference and data conversion: this includes the user-defined value conversions and custom list of missing value markers.

* Datetime parsing: includes combining capability, including combining date and time information spread over multiple columns into a single column in the result.

* Iterating: support for iterating over chunks of very large files.

* Unclean data issues: skipping rows or a footer, comments, or other minor things like numeric data with thousands separated by commas.

Type inference is one of the more important features of these functions; that means you
don’t have to specify which columns are numeric, integer, boolean, or string. Handling
dates and other custom types requires a bit more effort, though. Let’s start with a small
comma-separated (CSV) text file:


In [3]:
!cat ex1.csv

a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

Since this is comma-delimited, we can use read_csv to read it into a DataFrame:

In [57]:
from pandas import DataFrame, Series

import pandas as pd

import sys

In [7]:
df = pd.read_csv('ex1.csv')

In [8]:
df

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


We could also have used read_table and specifying the delimiter:

In [12]:
pd.read_table('ex1.csv',  sep=',')

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


A file will not always have a header row. Consider this file:

In [13]:
!cat ex2.csv

1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

To read this in, you have a couple of options. You can allow pandas to assign default
column names, or you can specify names yourself:

In [14]:
pd.read_csv('ex2.csv', header=None)

Unnamed: 0,0,1,2,3,4
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [15]:
pd.read_csv('ex2.csv', names=['a', 'b', 'c', 'd', 'message'])

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


Suppose you wanted the message column to be the index of the returned DataFrame.
You can either indicate you want the column at index 4 or named 'message' using the
index_col argument:

In [16]:
names = ['a', 'b', 'c', 'd', 'message']

In [17]:
pd.read_csv('ex2.csv', names=names, index_col='message')

Unnamed: 0_level_0,a,b,c,d
message,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
hello,1,2,3,4
world,5,6,7,8
foo,9,10,11,12


In the event that you want to form a hierarchical index from multiple columns, just
pass a list of column numbers or names:

In [18]:
!cat csv_mindex.csv

key1,key2,value1,value2
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16


In [19]:
parsed = pd.read_csv('csv_mindex.csv', index_col=['key1', 'key2'])

In [20]:
parsed

Unnamed: 0_level_0,Unnamed: 1_level_0,value1,value2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16


In some cases, a table might not have a fixed delimiter, using whitespace or some other
pattern to separate fields. In these cases, you can pass a regular expression as a delimiter
for read_table. Consider a text file that looks like this:

In [21]:
list(open('ex3.csv'))

['            A         B         C\n',
 'aaa -0.264438 -1.026059 -0.619500\n',
 'bbb  0.927272  0.302904 -0.032399\n',
 'ccc -0.264273 -0.386314 -0.217601\n',
 'ddd -0.871858 -0.348382  1.100491']

While you could do some munging by hand, in this case fields are separated by a variable
amount of whitespace. This can be expressed by the regular expression \s+, so we have
then:

In [22]:
result = pd.read_table('ex3.txt', sep='\s+')

In [23]:
result

Unnamed: 0,A,B,C
aaa,-0.264438,-1.026059,-0.6195
bbb,0.927272,0.302904,-0.032399
ccc,-0.264273,-0.386314,-0.217601
ddd,-0.871858,-0.348382,1.100491


Because there was one fewer column name than the number of data rows, read_table
infers that the first column should be the DataFrame’s index in this special case.

The parser functions have many additional arguments to help you handle the wide
variety of exception file formats that occur (see Table 6-2). For example, you can skip
the first, third, and fourth rows of a file with skiprows:

In [24]:
!cat ex4.csv

# hey!
a,b,c,d,message
# just wanted to make things more difficult for you
# who reads CSV files with computers, anyway?
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

In [25]:
pd.read_csv('ex4.csv', skiprows=[0, 2, 3])

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


Handling missing values is an important and frequently nuanced part of the file parsing
process. Missing data is usually either not present (empty string) or marked by some
sentinel value. By default, pandas uses a set of commonly occurring sentinels, such as
NA, -1.#IND, and NULL:

In [26]:
!cat ex5.csv

something,a,b,c,d,message
one,1,2,3,4,NA
two,5,6,,8,world
three,9,10,11,12,foo

In [27]:
result = pd.read_csv('ex5.csv')

In [28]:
result

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


In [29]:
pd.isnull(result)

Unnamed: 0,something,a,b,c,d,message
0,False,False,False,False,False,True
1,False,False,False,True,False,False
2,False,False,False,False,False,False


The na_values option can take either a list or set of strings to consider missing values:

In [30]:
result = pd.read_csv('ex5.csv', na_values=['NULL'])

In [31]:
result

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


Different NA sentinels can be specified for each column in a dict:

In [33]:
sentinels = {'message': ['foo', 'NA'], 'something':['two']}

In [34]:
pd.read_csv('ex5.csv', na_values=sentinels)

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,,5,6,,8,world
2,three,9,10,11.0,12,


Table 6-2. read_csv /read_table function arguments

Argument Description

path String indicating filesystem location, URL, or file-like object

sep or delimiter Character sequence or regular expression to use to split fields in each row

header Row number to use as column names. Defaults to 0 (first row), but should be None if there is no header row

index_col Column numbers or names to use as the row index in the result. Can be a single name/number or a list
of them for a hierarchical index

names List of column names for result, combine with header=None

skiprows Number of rows at beginning of file to ignore or list of row numbers (starting from 0) to skip

na_values Sequence of values to replace with NA

comment Character or characters to split comments off the end of lines

parse_dates Attempt to parse data to datetime; False by default. If True, will attempt to parse all columns. Otherwise
can specify a list of column numbers or name to parse. If element of list is tuple or list, will combine
multiple columns together and parse to date (for example if date/time split across two columns)

keep_date_col If joining columns to parse date, drop the joined columns. Default True

converters Dict containing column number of name mapping to functions. For example {'foo': f} would apply
the function f to all values in the 'foo' column

dayfirst When parsing potentially ambiguous dates, treat as international format (e.g. 7/6/2012 -> June 7,
2012). Default False

date_parser Function to use to parse dates

nrows Number of rows to read from beginning of file

iterator Return a TextParser object for reading file piecemeal

chunksize For iteration, size of file chunks

skip_footer Number of lines to ignore at end of file

verbose Print various parser output information, like the number of 
missing values placed in non-numeric
columns

encoding Text encoding for unicode. For example 'utf-8' for UTF-8 encoded text

squeeze If the parsed data only contains one column return a Series

thousands Separator for thousands, e.g. ',' or '.'

## Reading Text Files in Pieces


When processing very large files or figuring out the right set of arguments to correctly
process a large file, you may only want to read in a small piece of a file or iterate through
smaller chunks of the file.

In [35]:
result = pd.read_csv('ex6.csv')

In [36]:
result

Unnamed: 0,one,two,three,four,key
0,0.467976,-0.038649,-0.295344,-1.824726,L
1,-0.358893,1.404453,0.704965,-0.200638,B
2,-0.501840,0.659254,-0.421691,-0.057688,G
3,0.204886,1.074134,1.388361,-0.982404,R
4,0.354628,-0.133116,0.283763,-0.837063,Q
5,1.817480,0.742273,0.419395,-2.251035,Q
6,-0.776764,0.935518,-0.332872,-1.875641,U
7,-0.913135,1.530624,-0.572657,0.477252,K
8,0.358480,-0.497572,-0.367016,0.507702,S
9,-1.740877,-1.160417,-1.637830,2.172201,G


If you want to only read out a small number of rows (avoiding reading the entire file),
specify that with nrows:

In [37]:
pd.read_csv('ex6.csv', nrows=5)

Unnamed: 0,one,two,three,four,key
0,0.467976,-0.038649,-0.295344,-1.824726,L
1,-0.358893,1.404453,0.704965,-0.200638,B
2,-0.50184,0.659254,-0.421691,-0.057688,G
3,0.204886,1.074134,1.388361,-0.982404,R
4,0.354628,-0.133116,0.283763,-0.837063,Q


To read out a file in pieces, specify a chunksize as a number of rows:

In [38]:
chunker = pd.read_csv('ex6.csv', chunksize=1000)

In [39]:
chunker

<pandas.io.parsers.TextFileReader at 0x10ef4bc18>

The TextParser object returned by read_csv allows you to iterate over the parts of the
file according to the chunksize. For example, we can iterate over ex6.csv, aggregating
the value counts in the 'key' column like so:

In [48]:
chunker = pd.read_csv('ex6.csv', chunksize=1000)

In [49]:
tot = Series([])
for piece in chunker:
    tot = tot.add(piece['key'].value_counts(), fill_value=0)
    
tot = tot.sort_values(ascending=False)

We have then:


In [50]:
tot[:10]

E    368.0
X    364.0
L    346.0
O    343.0
Q    340.0
M    338.0
J    337.0
F    335.0
K    334.0
H    330.0
dtype: float64

In [47]:
tot

Series([], dtype: float64)

TextParser is also equipped with a get_chunk method which enables you to read pieces
of an arbitrary size.

## Writing Data Out to Text Format

Data can also be exported to delimited format. Let’s consider one of the CSV files read
above:

In [51]:
data = pd.read_csv('ex5.csv')

In [53]:
data

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


Using DataFrame’s to_csv method, we can write the data out to a comma-separated file:

In [54]:
data.to_csv('out.csv')

In [55]:
!cat out.csv

,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


Other delimiters can be used, of course (writing to sys.stdout so it just prints the text
result):

In [58]:
data.to_csv(sys.stdout, sep='|')

|something|a|b|c|d|message
0|one|1|2|3.0|4|
1|two|5|6||8|world
2|three|9|10|11.0|12|foo
