# Chapter 6: Data Loading, Storage, and File Formats

Overview:
    * Reading and Writing Data in Text Format
    * Binary Data Formats
    * Interacting with HTML and Web APIs
    * Interacting with Databases

In [4]:
import pandas as pd
from pandas import DataFrame, Series

In [5]:
import numpy as np

# Reading and writing data in text format

Pandas provides us some methods to reading tabular data as DataFrame

| Function | Description |
|----------|-------------|
| read_csv | Load delimited data from a file, URL, or file-like object. Use comma as default delimiter|
| read_table | Load delimited data from a file, URL, or file-like object. Use tab ( '\t' ) as default delimiter |
| read_fwf | Read data in fixed-width column format (that is, no delimiters)|
| read_clipboard | Version of read_table that reads data from the clipboard. Useful for converting tables from web pages|

* **read_csv**: use comma as default delimiter

In [6]:
df =pd.read_csv('data/ex1.csv')
df

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


* **read_table**: can modified the delimiter

In [10]:
df = pd.read_table('data/ex1.csv', delimiter=',')
df

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


Some data does not have header like above, so we can allow pandas to assign default header

In [13]:
df = pd.read_csv('data/ex2.csv', header=None)
df

Unnamed: 0,0,1,2,3,4
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


Or we can specify names ourselves

In [14]:
df = pd.read_csv('data/ex2.csv', names=['a', 'b', 'c', 'd', 'message'])
df

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


Suppose you wanted the **message** column to be the index of the returned DataFrame.
You can either indicate you want the column at index 4 or named 'message' using the
**index_col** argument:

In [15]:
names = ['a', 'b', 'c', 'd', 'message']
names

['a', 'b', 'c', 'd', 'message']

In [18]:
pd.read_csv('data/ex2.csv', names=names, index_col='message')

Unnamed: 0_level_0,a,b,c,d
message,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
hello,1,2,3,4
world,5,6,7,8
foo,9,10,11,12


In the event that you want to form a **hierarchical index** from multiple columns, just
pass a list of column numbers or names:

We have a **csv_minindex.csv** file:
```
key1,key2,value1,value2
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16
```

In [19]:
parsed = pd.read_csv('data/csv_minindex.csv', index_col=['key1', 'key2'])
parsed

Unnamed: 0_level_0,Unnamed: 1_level_0,value1,value2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16


In some cases, the dilimiter is not commna, so we can use regular expression as a delimiter for **read_table**

In [23]:
df = pd.read_table('data/ex3.csv', sep = '\s+')
df

Unnamed: 0,A,B,C
aaa,-0.264438,-1.026059,-0.6195
bbb,0.927272,0.302904,-0.032399
ccc,-0.264273,-0.386314,-0.217601
ddd,-0.871858,-0.348382,1.100491


The parser functions have many additional arguments to help you handle the wide
variety of exception file formats that occur. For example, you can skip
the first, third, and fourth rows of a file with **skiprows** :

In [25]:
pd.read_csv('data/ex4.csv', skiprows=[0, 2, 3])

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


**Handling missing data**. By default, pandas use set of commonly occurring sentinels, such as NA, -1, #IND 

In [27]:
result = pd.read_csv('data/ex5.csv')
result

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


In [28]:
pd.isnull(result)

Unnamed: 0,something,a,b,c,d,message
0,False,False,False,False,False,True
1,False,False,False,True,False,False
2,False,False,False,False,False,False


We can specify NA for each column in a dict:

In [29]:
sentinels = {
    'message': ['foo', 'NA'], 
    'something': ['two']
}
pd.read_csv('data/ex5.csv', na_values=sentinels)

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,,5,6,,8,world
2,three,9,10,11.0,12,


## Reading Text Files in Pieces

Suppose we want to read small pieces of a large file, or we want to iterate through smaller chunks of the file. 

In [32]:
df = pd.read_csv('data/ex6.csv')
df

Unnamed: 0,yearID,teamID,lgID,playerID,G_all,GS,G_batting,G_defense,G_p,G_c,...,G_2b,G_3b,G_ss,G_lf,G_cf,G_rf,G_of,G_dh,G_ph,G_pr
0,1871,BS1,,barnero01,31.0,,31,31.0,0,0,...,16,0,15,0,0,0,0,,,
1,1871,BS1,,barrofr01,18.0,,18,18.0,0,0,...,1,0,0,13,0,4,17,,,
2,1871,BS1,,birdsda01,29.0,,29,29.0,0,7,...,0,0,0,0,0,27,27,,,
3,1871,BS1,,conefr01,19.0,,19,19.0,0,0,...,0,0,0,18,0,1,18,,,
4,1871,BS1,,gouldch01,31.0,,31,31.0,0,0,...,0,0,0,0,0,1,1,,,
5,1871,BS1,,jackssa01,16.0,,16,16.0,0,0,...,14,0,1,0,1,0,1,,,
6,1871,BS1,,mcveyca01,29.0,,29,29.0,0,29,...,0,1,0,0,0,5,5,,,
7,1871,BS1,,schafha01,31.0,,31,31.0,0,0,...,0,31,0,0,0,0,0,,,
8,1871,BS1,,spaldal01,31.0,,31,31.0,31,0,...,0,0,0,0,9,0,9,,,
9,1871,BS1,,wrighge01,16.0,,16,16.0,0,0,...,0,0,15,0,0,0,0,,,


If we want to read a small number of rows, we can use **nrow** option

In [33]:
pd.read_csv('data/ex6.csv', nrows=5)

Unnamed: 0,yearID,teamID,lgID,playerID,G_all,GS,G_batting,G_defense,G_p,G_c,...,G_2b,G_3b,G_ss,G_lf,G_cf,G_rf,G_of,G_dh,G_ph,G_pr
0,1871,BS1,,barnero01,31,,31,31,0,0,...,16,0,15,0,0,0,0,,,
1,1871,BS1,,barrofr01,18,,18,18,0,0,...,1,0,0,13,0,4,17,,,
2,1871,BS1,,birdsda01,29,,29,29,0,7,...,0,0,0,0,0,27,27,,,
3,1871,BS1,,conefr01,19,,19,19,0,0,...,0,0,0,18,0,1,18,,,
4,1871,BS1,,gouldch01,31,,31,31,0,0,...,0,0,0,0,0,1,1,,,


If we want to iterate through a file in smaller pieces, we can use **chunksize** option

In [34]:
chunker = pd.read_csv('data/ex6.csv', chunksize=1000)
chunker

<pandas.io.parsers.TextFileReader at 0x7f5e0e949610>

The **TextParser** object returned by **read_csv** allows you to iterate over the parts of the
file according to the **chunksize**

In [35]:
tot = Series([])

In [37]:
for piece in chunker:
    tot = tot.add(piece['playerID'].value_counts(), fill_value=0)
tot = tot.order(ascending=False)
tot

  app.launch_new_instance()


mcguide01    31.0
henderi01    29.0
newsobo01    29.0
johnto01     28.0
kaatji01     28.0
moyerja01    27.0
carltst01    27.0
ryanno01     27.0
baineha01    27.0
mulhote01    26.0
niekrph01    26.0
weathda01    26.0
houghch01    26.0
oroscje01    26.0
wilheho01    26.0
dempsri01    25.0
wallabo01    25.0
morgami01    25.0
perryga01    25.0
francju01    25.0
davisha01    25.0
darwida01    25.0
collied01    25.0
reussje01    25.0
eckerde01    25.0
maddugr01    25.0
sierrru01    25.0
bucknbi01    25.0
niekrjo01    25.0
hoytwa01     25.0
             ... 
sweigha01     1.0
headra01      1.0
headje01      1.0
hazledo01     1.0
hawked01      1.0
hatchch01     1.0
hathara01     1.0
hatlema01     1.0
hattijo01     1.0
taffjo01      1.0
haugear01     1.0
haughch01     1.0
haughga01     1.0
hautzch01     1.0
hawblry01     1.0
hawesro01     1.0
taborgr01     1.0
hazewdr01     1.0
hawlesc01     1.0
haworho01     1.0
haydele01     1.0
taberjo01     1.0
hayesji01     1.0
haynefr01     1.0
haynehe01 

## Writing Data Out to Text Format

Data can also be exported to delimited format

In [45]:
data = pd.read_csv('data/ex5.csv')
data

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


* **to_csv**: Write data to a comma-sperated file:

In [46]:
data.to_csv('data/out.csv')

pd.read_csv('data/out.csv')

Other delimiters can be used by using **sep** (sperated) option:

In [54]:
import sys
data.to_csv(sys.stdout, sep = '|')

|something|a|b|c|d|message
0|one|1|2|3.0|4|
1|two|5|6||8|world
2|three|9|10|11.0|12|foo


If we don't want to write headerr and index to file, we can use **index=False** and **header=False** to set that.

In [55]:
data.to_csv(sys.stdout, sep=',', index=False, header=False)

one,1,2,3.0,4,
two,5,6,,8,world
three,9,10,11.0,12,foo


We can alse write only a subset of columns and in an order of us.

In [63]:
data.to_csv(sys.stdout, sep=',' , index=False, cols=['a', 'b', 'c'])

something,a,b,c,d,message
one,1,2,3.0,4,
two,5,6,,8,world
three,9,10,11.0,12,foo


Series also has a **to_csv**

In [65]:
dates = pd.date_range('1/1/2000', periods=7)
dates

DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04',
               '2000-01-05', '2000-01-06', '2000-01-07'],
              dtype='datetime64[ns]', freq='D')

In [66]:
ts = Series(np.arange(7), index=dates)
ts

2000-01-01    0
2000-01-02    1
2000-01-03    2
2000-01-04    3
2000-01-05    4
2000-01-06    5
2000-01-07    6
Freq: D, dtype: int64

In [67]:
ts.to_csv('data/tseries.csv')

In [70]:
!cat data/tseries.csv

2000-01-01,0
2000-01-02,1
2000-01-03,2
2000-01-04,3
2000-01-05,4
2000-01-06,5
2000-01-07,6


## Manually Working with Delimited Formats

> Most forms of tabular data can be loaded from disk using functions like pan
das.read_table . In some cases, however, some manual processing may be necessary.
It’s not uncommon to receive a file with one or more malformed lines that trip up
read_table.

For any file with a single-character delimiter, you can use Python’s built-in csv module.
To use it, pass any open file or file-like object to **csv.reader** :

In [72]:
import csv

In [75]:
f = open('data/ex7.csv')
reader = csv.reader(f)
for line in reader:
    print line

['a', 'b', 'c']
['1', '2', '3']
['1', '2', '3', '4']


From there, it’s up to you to do the wrangling necessary to put the data in the form
that you need it. For example:

In [76]:
lines = list(csv.reader(open('data/ex7.csv')))
lines

[['a', 'b', 'c'], ['1', '2', '3'], ['1', '2', '3', '4']]

In [78]:
header, values = lines[0], lines[1]
header

['a', 'b', 'c']

In [79]:
data_dict = {h: v for h, v in zip(header, zip(*values))}
data_dict

{'a': ('1', '2', '3')}

## JSON Data

JSON (short for JavaScript Object Notation) has become one of the standard formats
for sending data by HTTP request between web browsers and other applications. It is
a much more flexible data format than a tabular text form like CSV

In [87]:
obj = """
{
    "name": "Wes",
    "places_lived": ["United States", "Spain", "Germany"],
    "pet": null,
    "siblings": [
        {
            "name": "Scott", 
            "age": 25, 
            "pet": "Zuko"
        },
        {
        "name": "Katie", 
        "age": 33, 
        "pet": "Cisco"
        }
    ]
}
"""
obj

'\n{\n    "name": "Wes",\n    "places_lived": ["United States", "Spain", "Germany"],\n    "pet": null,\n    "siblings": [\n        {\n            "name": "Scott", \n            "age": 25, \n            "pet": "Zuko"\n        },\n        {\n        "name": "Katie", \n        "age": 33, \n        "pet": "Cisco"\n        }\n    ]\n}\n'

In [88]:
import json

* **json.loads()**: Convert a JSON string to Python form

In [96]:
result = json.loads(obj)
result

{u'name': u'Wes',
 u'pet': None,
 u'places_lived': [u'United States', u'Spain', u'Germany'],
 u'siblings': [{u'age': 25, u'name': u'Scott', u'pet': u'Zuko'},
  {u'age': 33, u'name': u'Katie', u'pet': u'Cisco'}]}

In [97]:
type(result)

dict

* **json.dumps()**: converts a Python object back to JSON

In [99]:
asjson = json.dumps(result)
asjson

'{"pet": null, "siblings": [{"pet": "Zuko", "age": 25, "name": "Scott"}, {"pet": "Cisco", "age": 33, "name": "Katie"}], "name": "Wes", "places_lived": ["United States", "Spain", "Germany"]}'

How you convert a **JSON** object or list of objects to a **DataFrame** or some other data:

In [103]:
siblings = DataFrame(result['siblings'], columns=['name', 'age'])
siblings

Unnamed: 0,name,age
0,Scott,25
1,Katie,33


## XML and HTML: Web Scraping
> Python has many libraries for reading and writing data in the ubiquitous HTML and
XML formats. lxml (http://lxml.de) is one that has consistently strong performance in
parsing very large files. lxml has multiple programmer interfaces

To get started, find the URL you want to extract data from, open it with urllib2 and
parse the stream with lxml like so:

In [104]:
from lxml.html import parse
from urllib2 import urlopen

In [153]:
parsed = parse(urlopen('http://finance.yahoo.com/'))
parsed

<lxml.etree._ElementTree at 0x7f5e0d424518>

In [154]:
doc = parsed.getroot()
doc

<Element html at 0x7f5e0d393aa0>

Then we want to parse all **a** tags, we can use **findall**

In [155]:
links = doc.findall('.//a')
links

[<Element a at 0x7f5e0d393d60>,
 <Element a at 0x7f5e0d393ec0>,
 <Element a at 0x7f5e0d393f70>,
 <Element a at 0x7f5e0d393fc8>,
 <Element a at 0x7f5e0d3ae050>,
 <Element a at 0x7f5e0d3ae0a8>,
 <Element a at 0x7f5e0d3ae100>,
 <Element a at 0x7f5e0d3ae158>,
 <Element a at 0x7f5e0d3ae1b0>,
 <Element a at 0x7f5e0d3ae208>,
 <Element a at 0x7f5e0d3ae260>,
 <Element a at 0x7f5e0d3ae2b8>,
 <Element a at 0x7f5e0d3ae310>,
 <Element a at 0x7f5e0d3ae368>,
 <Element a at 0x7f5e0d3ae3c0>,
 <Element a at 0x7f5e0d3ae418>,
 <Element a at 0x7f5e0d3ae470>,
 <Element a at 0x7f5e0d3ae4c8>,
 <Element a at 0x7f5e0d3ae520>,
 <Element a at 0x7f5e0d3ae578>,
 <Element a at 0x7f5e0d3ae5d0>,
 <Element a at 0x7f5e0d3ae628>,
 <Element a at 0x7f5e0d3ae680>,
 <Element a at 0x7f5e0d3ae6d8>,
 <Element a at 0x7f5e0d3ae730>,
 <Element a at 0x7f5e0d3ae788>,
 <Element a at 0x7f5e0d3ae7e0>,
 <Element a at 0x7f5e0d3ae838>,
 <Element a at 0x7f5e0d3ae890>,
 <Element a at 0x7f5e0d3ae8e8>,
 <Element a at 0x7f5e0d3ae940>,
 <Elemen

In [156]:
lik = links[3]
lik.get('href')

'https://www.tumblr.com/'

In [157]:
lik.text_content()

'Tumblr'

Get all url:

In [158]:
urls = [lnk.get('href') for lnk in doc.findall('.//a')]
urls

['https://www.yahoo.com/',
 'https://mail.yahoo.com/?.intl=us&.lang=en-US',
 'https://www.flickr.com/',
 'https://www.tumblr.com/',
 'https://www.yahoo.com/news/',
 'http://sports.yahoo.com/',
 'http://finance.yahoo.com/',
 'https://www.yahoo.com/celebrity/',
 'https://answers.yahoo.com/',
 'https://groups.yahoo.com/',
 'https://mobile.yahoo.com/',
 'http://everything.yahoo.com/',
 'https://www.mozilla.org/firefox/new/?utm_source=yahoo&utm_medium=referral&utm_campaign=y-uh&utm_content=y-finance-try',
 'https://finance.yahoo.com/',
 'https://login.yahoo.com/config/login?.intl=us&.lang=en-US&.src=finance&.done=http%3A%2F%2Ffinance.yahoo.com%2F',
 'https://mail.yahoo.com/?.intl=us&.lang=en-US&.partner=none&.src=finance',
 '/',
 '/personal-finance',
 'https://www.yahoo.com/tech',
 '/screener',
 '/portfolios?bypass=true',
 '/chart/^GSPC',
 '/quote/^GSPC?p=^GSPC',
 '/chart/^DJI',
 '/quote/^DJI?p=^DJI',
 '/chart/^IXIC',
 '/quote/^IXIC?p=^IXIC',
 'http://yahoofinanceallmarketssummit.splashthat

Now, finding the right tables in the document can be a matter of trial and error; some
websites make it easier by giving a table of interest an id attribute

* **findall**: find all element by tag name

In [178]:
tables = doc.findall('.//table')
calls = tables[0]

In [180]:
rows = calls.findall('.//tr')
rows

[<Element tr at 0x7f5e0d399730>]

In [183]:
def _unpack(row, kind='td'):
    elts = row.findall('.//%s' % kind)
    return [val.text_content() for val in elts]
_unpack(rows[0], kind = 'td')

['', 'Search']

* **TextParser**

In [162]:
from pandas.io.parsers import TextParser
def parse_options_data(table):
    rows = table.findall('.//tr')
    header = _unpack(rows[0], kind='th')
    data = [_unpack(r) for r in rows[1:]]
    return TextParser(data, names=header).get_chunk()
call_data = parse_options_data()

TypeError: parse_options_data() takes exactly 1 argument (0 given)

## Parsing XML with lxml.objectify

In [189]:
from lxml import objectify
path = 'data/Performance_MNR.xml'
parsed = objectify.parse(open(path))
root = parsed.getroot()

'Escalator Availability'

**root.INDICATOR**: return a generator yielding each <INDICATOR> XML element

In [188]:
data = []
skip_fields = ['PARENT_SEQ', 'INDICATOR_SEQ','DESIRED_CHANGE', 'DECIMAL_PLACES']
for elt in root.INDICATOR:
    el_data = {}
    for child in elt.getchildren():
        if child.tag in skip_fields:
            continue
        el_data[child.tag] = child.pyval
    data.append(el_data)
data

AttributeError: no such child: INDICATOR