# Reading and Writing Data in Text Format 

There are lots of functions in pandas for reading data in  a 
text format. These functions also have lots of arguments, as
particular tasks require particular reading setups.

Let's start with a small csv:

In [2]:
!cat ../pydata-book/examples/ex1.csv

a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

As it is a csv, we can use `pandas.read_csv`:

In [1]:
import pandas as pd
import numpy as np

In [4]:
df = pd.read_csv("../pydata-book/examples/ex1.csv")
df

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


Not all files have a header. Consider this file:

In [5]:
!cat ../pydata-book/examples/ex2.csv

1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

Here, we have two options: let Pandas assign column names
with `header=None`, or add them ourselves with `names=[names]`

In [7]:
pd.read_csv("../pydata-book/examples/ex2.csv", header=None)

Unnamed: 0,0,1,2,3,4
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [8]:
pd.read_csv("../pydata-book/examples/ex2.csv", names=["a", 'b', 'c', 'd', 'message'])

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


Suppose we **want a specif column, such as message, as the index**.
We can do that indicating either the integer 4 or the string "message"
in the `index_col` argument:

In [9]:
names=["a", 'b', 'c', 'd', 'message']
pd.read_csv("../pydata-book/examples/ex2.csv", names=names, index_col="message")

Unnamed: 0_level_0,a,b,c,d
message,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
hello,1,2,3,4
world,5,6,7,8
foo,9,10,11,12


The `index_col` argument accepts a list of indexes, as to make
a hierarchical index dataset: 

In [10]:
!cat ../pydata-book/examples/csv_mindex.csv

key1,key2,value1,value2
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16


In [12]:
parsed = pd.read_csv("../pydata-book/examples/csv_mindex.csv", index_col=["key1", "key2"])
parsed

Unnamed: 0_level_0,Unnamed: 1_level_0,value1,value2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16


Sometimes the delimiter is different than a comma. Consider
the following file:

In [13]:
!cat ../pydata-book/examples/ex3.txt

            A         B         C
aaa -0.264438 -1.026059 -0.619500
bbb  0.927272  0.302904 -0.032399
ccc -0.264273 -0.386314 -0.217601
ddd -0.871858 -0.348382  1.100491


Although one could try editing it by hand, we can pass an expression,
such as `\s+`, which will account for any number of whitespaces:

In [14]:
result = pd.read_csv("../pydata-book/examples/ex3.txt", sep="\s+")
result

  result = pd.read_csv("../pydata-book/examples/ex3.txt", sep="\s+")


Unnamed: 0,A,B,C
aaa,-0.264438,-1.026059,-0.6195
bbb,0.927272,0.302904,-0.032399
ccc,-0.264273,-0.386314,-0.217601
ddd,-0.871858,-0.348382,1.100491


... as the first line of this dataset has one fewer values than the
rest, pandas infers the first column to be the index.

One example of useful parsing argument is the `skiprows`, with accepts
a list of rows to skip:

In [1]:
!cat ../pydata-book/examples/ex4.csv

# hey!
a,b,c,d,message
# just wanted to make things more difficult for you
# who reads CSV files with computers, anyway?
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo


In [4]:
pd.read_csv("../pydata-book/examples/ex4.csv", skiprows=[0, 2, 3])

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


*Handling missing data* is a point of interest. Missing data is usually 
represented in the csv file as a placeholder, such as NULL or NONE, or 
simply an empty string. Pandas uses a default set of common ocurring 
sentinels to handle missing data:

In [5]:
!cat ../pydata-book/examples/ex5.csv

something,a,b,c,d,message
one,1,2,3,4,NA
two,5,6,,8,world
three,9,10,11,12,foo

this output has the NA and empty string as non-present values.

In [6]:
result = pd.read_csv("../pydata-book/examples/ex5.csv")
result

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


We can disable the default values set as NA with the `keep_default_na=false`
argument: 

In [8]:
result2 = pd.read_csv("../pydata-book/examples/ex5.csv", keep_default_na=False)
result2

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


In [9]:
result2.isna()

Unnamed: 0,something,a,b,c,d,message
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,False,False,False


In [10]:
result3 = pd.read_csv("../pydata-book/examples/ex5.csv", keep_default_na=False, na_values=["NA"])
result3

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


At last, we can specify different sentinels for each column:

In [11]:
sentinels = {"message" : ["foo", 'NA'], 'something' : ['two']}
pd.read_csv('../pydata-book/examples/ex5.csv', keep_default_na=False, na_values=sentinels)

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,,5,6,,8,world
2,three,9,10,11.0,12,


## Reading Text Files in Pieces

To process very large files, like perhaps those related 
to biological data, we may wish to read smaller parts of 
the file or to iterate through it in smaller chunks.

To do so, we may make pandas display settings more compact:

In [10]:
pd.options.display.max_rows = 10

Now we have:

In [3]:
result = pd.read_csv("../pydata-book/examples/ex6.csv")
result

Unnamed: 0,one,two,three,four,key
0,0.467976,-0.038649,-0.295344,-1.824726,L
1,-0.358893,1.404453,0.704965,-0.200638,B
2,-0.501840,0.659254,-0.421691,-0.057688,G
3,0.204886,1.074134,1.388361,-0.982404,R
4,0.354628,-0.133116,0.283763,-0.837063,Q
...,...,...,...,...,...
9995,2.311896,-0.417070,-1.409599,-0.515821,L
9996,-0.479893,-0.650419,0.745152,-0.646038,E
9997,0.523331,0.787112,0.486066,1.093156,K
9998,-0.362559,0.598894,-1.843201,0.887292,G


To read a smaller number of lines, we may indicate it with
`nrows` argument: 

In [4]:
pd.read_csv("../pydata-book/examples/ex6.csv", nrows=5)

Unnamed: 0,one,two,three,four,key
0,0.467976,-0.038649,-0.295344,-1.824726,L
1,-0.358893,1.404453,0.704965,-0.200638,B
2,-0.50184,0.659254,-0.421691,-0.057688,G
3,0.204886,1.074134,1.388361,-0.982404,R
4,0.354628,-0.133116,0.283763,-0.837063,Q


To read iterate a file in pieces, we use the `chunksize` argument:

In [12]:
chunker = result = pd.read_csv("../pydata-book/examples/ex6.csv", chunksize=1000)
type(chunker)

pandas.io.parsers.readers.TextFileReader

This is an iterable. Let's say we want to iterate through it counting
the amount of times any given key appears:

In [13]:
tot = pd.Series([], dtype='int64')

for piece in chunker:
  tot = tot.add(piece['key'].value_counts(), fill_value=0)

tot = tot.sort_values(ascending=False)
tot[:10]

key
E    368.0
X    364.0
L    346.0
O    343.0
Q    340.0
M    338.0
J    337.0
F    335.0
K    334.0
H    330.0
dtype: float64

The `TextFileReader` object also has a `get_chunk` method that enables you
to read pieces of an arbitrary size.

## Writing Data to Text Format

This is how we export data to a delimited format. Let's consider one of the files
previously used in our examples:


In [2]:
data = pd.read_csv("../pydata-book/examples/ex5.csv")
data

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


To export it with a custom delimiter:

In [3]:
data.to_csv('../pydata-book/examples/out.csv', sep='|')

In [4]:
!cat '../pydata-book/examples/out.csv'

|something|a|b|c|d|message
0|one|1|2|3.0|4|
1|two|5|6||8|world
2|three|9|10|11.0|12|foo


By default, empty values appear as empty strings. If
we wish to use another sentinel value, `na_rep` comes 
to the rescue!

In [5]:
import sys # we'll write to sys.stdout to avoiding writing in a file
data.to_csv(sys.stdout, na_rep='NULL')

,something,a,b,c,d,message
0,one,1,2,3.0,4,NULL
1,two,5,6,NULL,8,world
2,three,9,10,11.0,12,foo


By default, column and row labels are written. We
can also disable this feature:

In [6]:
data.to_csv(sys.stdout, index=False, header=False) 

one,1,2,3.0,4,
two,5,6,,8,world
three,9,10,11.0,12,foo


You may also write a subset of the columns in a chosen order:

In [7]:
data.to_csv(sys.stdout, index=False, columns=['c', 'message', 'd']) 

c,message,d
3.0,,4
,world,8
11.0,foo,12


## JSON data

Python has a json library built-in which we will use to read an
file example:

In [8]:
obj = """ {"name": "Wes",  "cities_lived": ["Akron", "Nashville", "New York", "San Francisco"],  "pet": null,  "siblings": [{"name": "Scott", "age": 34, "hobbies": ["guitars", "soccer"]},  {"name": "Katie", "age": 42, "hobbies": ["diving", "art"]}] } """

In [9]:
import json

result = json.loads(obj)

result

{'name': 'Wes',
 'cities_lived': ['Akron', 'Nashville', 'New York', 'San Francisco'],
 'pet': None,
 'siblings': [{'name': 'Scott', 'age': 34, 'hobbies': ['guitars', 'soccer']},
  {'name': 'Katie', 'age': 42, 'hobbies': ['diving', 'art']}]}

`json.dumps()` is the method for converting from python object
to json.

how to convert a json to DataFrame is up to the user. One convenient
way would be to pass a list of dictionaries (which where JSON) to the
DataFrame constructor and a subset of data fields:

In [10]:
siblings = pd.DataFrame(result['siblings'], columns=['name', 'hobbies'])
siblings

Unnamed: 0,name,hobbies
0,Scott,"[guitars, soccer]"
1,Katie,"[diving, art]"


pandas has a `read_json` method for automatically converting JSON 
datasets, although it should be a "well-behaved" json dataset. In the
following example we can see that the `read_json` method assumes each 
json object is a row in a dataset:

In [11]:
!cat ../pydata-book/examples/example.json

[{"a": 1, "b": 2, "c": 3},
 {"a": 4, "b": 5, "c": 6},
 {"a": 7, "b": 8, "c": 9}]


In [12]:
data = pd.read_json("../pydata-book/examples/example.json")
data

Unnamed: 0,a,b,c
0,1,2,3
1,4,5,6
2,7,8,9


To convert back to json, pandas also has a `to_json()` method:

In [13]:
data.to_json(sys.stdout)

{"a":{"0":1,"1":4,"2":7},"b":{"0":2,"1":5,"2":8},"c":{"0":3,"1":6,"2":9}}

In [14]:
data.to_json(sys.stdout, orient='records')

[{"a":1,"b":2,"c":3},{"a":4,"b":5,"c":6},{"a":7,"b":8,"c":9}]

## XML and HTML: Web Scraping

Python has many libraries for reading and writing data in HTML 
and XML formats. Pandas leverages these libraries to make it 
possible to efficiently extract data from these formats with 
the `read_html()` method. It has many options, but it's default
behavior is to search and fetch data inside `<table>` tags.

Let's analyse an example provided by the book:

In [15]:
tables = pd.read_html("../pydata-book/examples/fdic_failed_bank_list.html")
len(tables)

1

In [17]:
failures = tables[0]
failures.head()

Unnamed: 0,Bank Name,City,ST,CERT,Acquiring Institution,Closing Date,Updated Date
0,Allied Bank,Mulberry,AR,91,Today's Bank,"September 23, 2016","November 17, 2016"
1,The Woodbury Banking Company,Woodbury,GA,11297,United Bank,"August 19, 2016","November 17, 2016"
2,First CornerStone Bank,King of Prussia,PA,35312,First-Citizens Bank & Trust Company,"May 6, 2016","September 6, 2016"
3,Trust Company Bank,Memphis,TN,9956,The Bank of Fayette County,"April 29, 2016","September 6, 2016"
4,North Milwaukee State Bank,Milwaukee,WI,20364,First-Citizens Bank & Trust Company,"March 11, 2016","June 16, 2016"


As we will learn in the next chapters, from here it is possible to do some 
data cleaning and analysis, for example, finding the count of bank failures
per year:

In [18]:
close_timestamps = pd.to_datetime(failures["Closing Date"])
close_timestamps.dt.year.value_counts()

Closing Date
2010    157
2009    140
2011     92
2012     51
2008     25
2013     24
2014     18
2002     11
2015      8
2016      5
2004      4
2001      4
2007      3
2003      3
2000      2
Name: count, dtype: int64

Pandas also has the `read_xml()` function, which makes reading xml files
a lot easier than the alternative with the `lxml` library. The example
table is present in page 192 of the book.