<h3>Reading and Writing Data in Text Format</h3>

Function Description<br>
read_csv Load delimited data from a file, URL, or file-like object. Use comma as default delimiter<br>
read_table Load delimited data from a file, URL, or file-like object. Use tab ('\t') as default delimiter<br>
read_fwf Read data in fixed-width column format (that is, no delimiters)<br>
read_clipboard Version of read_table that reads data from the clipboard. Useful for converting tables from web pages<br>

In [16]:
import pandas as pd
import numpy as np

In [7]:
df = pd.read_csv('ex1.csv')

In [8]:
df

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [9]:
pd.read_table('ex1.csv', sep=',')

  """Entry point for launching an IPython kernel.


Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [23]:
pd.read_csv('ex2.csv', header=None)

Unnamed: 0,0,1,2,3,4
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [24]:
pd.read_csv('ex2.csv', names=['a','b','c','d','message'])

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [25]:
names = ['a', 'b', 'c', 'd', 'message']

In [26]:
pd.read_csv('ex2.csv', names=names, index_col='message')

Unnamed: 0_level_0,a,b,c,d
message,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
hello,1,2,3,4
world,5,6,7,8
foo,9,10,11,12


In [28]:
parsed = pd.read_csv('csv_mindex.csv', index_col=['key1', 'key2'])

In [29]:
parsed

Unnamed: 0_level_0,Unnamed: 1_level_0,value1,value2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16


In [30]:
list(open('ch06/ex3.txt'))

['            A         B         C\n',
 'aaa -0.264438 -1.026059 -0.619500\n',
 'bbb  0.927272  0.302904 -0.032399\n',
 'ccc -0.264273 -0.386314 -0.217601\n',
 'ddd -0.871858 -0.348382  1.100491\n']

While you could do some munging by hand, in this case fields are separated by a variable
amount of whitespace. This can be expressed by the regular expression \s+, so we have
then:

In [31]:
result = pd.read_csv('ch06/ex3.csv', sep='\s+')

In [32]:
result

Unnamed: 0,A,B,C
aaa,-0.264438,-1.026059,-0.6195
bbb,0.927272,0.302904,-0.032399
ccc,-0.264273,-0.386314,-0.217601
ddd,-0.871858,-0.348382,1.100491


In [33]:
pd.read_csv('ch06/ex4.csv', skiprows=[0,2,3])

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [34]:
result = pd.read_csv('ch06/ex5.csv')

In [35]:
result

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


In [36]:
pd.isnull(result)

Unnamed: 0,something,a,b,c,d,message
0,False,False,False,False,False,True
1,False,False,False,True,False,False
2,False,False,False,False,False,False


The na_values option can take either a list or set of strings to consider missing values:

In [37]:
result = pd.read_csv('ch06/ex5.csv', na_values=['NULL'])

In [38]:
result

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


Different NA sentinels can be specified for each column in a dict:

In [39]:
sentinels = {'message': ['foo', 'NA'], 'something': ['two']}

In [40]:
pd.read_csv('ch06/ex5.csv', na_values=sentinels)

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,,5,6,,8,world
2,three,9,10,11.0,12,


Argument Description<br>
path String indicating filesystem location, URL, or file-like object<br>
sep or delimiter Character sequence or regular expression to use to split fields in each row<br>
header Row number to use as column names. Defaults to 0 (first row), but should be None if there is no header
row<br>
index_col Column numbers or names to use as the row index in the result. Can be a single name/number or a list
of them for a hierarchical index<br>
names List of column names for result, combine with header=None<br>
skiprows Number of rows at beginning of file to ignore or list of row numbers (starting from 0) to skip<br>
na_values Sequence of values to replace with NA<br>
comment Character or characters to split comments off the end of lines<br>
parse_dates Attempt to parse data to datetime; False by default. If True, will attempt to parse all columns. Otherwise
can specify a list of column numbers or name to parse. If element of list is tuple or list, will combine
multiple columns together and parse to date (for example if date/time split across two columns)<br>
keep_date_col If joining columns to parse date, drop the joined columns. Default True<br>
converters Dict containing column number of name mapping to functions. For example {'foo': f} would apply
the function f to all values in the 'foo' column<br>
dayfirst When parsing potentially ambiguous dates, treat as international format (e.g. 7/6/2012 -> June 7,
2012). Default False<br>
date_parser Function to use to parse dates<br>
nrows Number of rows to read from beginning of file<br>
iterator Return a TextParser object for reading file piecemeal<br>
chunksize For iteration, size of file chunks<br>
skip_footer Number of lines to ignore at end of file<br>
verbose Print various parser output information, like the number of missing values placed in non-numeric
columns<br>
encoding Text encoding for unicode. For example 'utf-8' for UTF-8 encoded text<br>
squeeze If the parsed data only contains one column return a Series<br>
thousands Separator for thousands, e.g. ',' or '.'<br>

<h3>Reading Text Files in Pieces</h3>

In [2]:
result = pd.read_csv('ch06/ex6.csv')
result

Unnamed: 0,one,two,three,four,key
0,0.467976,-0.038649,-0.295344,-1.824726,L
1,-0.358893,1.404453,0.704965,-0.200638,B
2,-0.501840,0.659254,-0.421691,-0.057688,G
3,0.204886,1.074134,1.388361,-0.982404,R
4,0.354628,-0.133116,0.283763,-0.837063,Q
5,1.817480,0.742273,0.419395,-2.251035,Q
6,-0.776764,0.935518,-0.332872,-1.875641,U
7,-0.913135,1.530624,-0.572657,0.477252,K
8,0.358480,-0.497572,-0.367016,0.507702,S
9,-1.740877,-1.160417,-1.637830,2.172201,G


If you want to only read out a small number of rows (avoiding reading the entire file),
specify that with nrows :

In [4]:
pd.read_csv('ch06/ex6.csv', nrows=5)

Unnamed: 0,one,two,three,four,key
0,0.467976,-0.038649,-0.295344,-1.824726,L
1,-0.358893,1.404453,0.704965,-0.200638,B
2,-0.50184,0.659254,-0.421691,-0.057688,G
3,0.204886,1.074134,1.388361,-0.982404,R
4,0.354628,-0.133116,0.283763,-0.837063,Q


To read out a file in pieces, specify a chunksize as a number of rows:

In [5]:
chunker = pd.read_csv('ch06/ex6.csv', chunksize=1000)

In [6]:
chunker

<pandas.io.parsers.TextFileReader at 0x7faf5d4869b0>

The TextParser object returned by read_csv allows you to iterate over the parts of the
file according to the chunksize . For example, we can iterate over ex6.csv , aggregating
the value counts in the 'key' column like so:

In [9]:
chunker = pd.read_csv('ch06/ex6.csv', chunksize=1000)

tot = pd.Series([])
for piece in chunker:
    tot = tot.add(piece['key'].value_counts(), fill_value=0)

tot = tot.sort_values(ascending=False)

In [10]:
tot[:10]

E    368.0
X    364.0
L    346.0
O    343.0
Q    340.0
M    338.0
J    337.0
F    335.0
K    334.0
H    330.0
dtype: float64

<h3>Writing Data Out to Text Format</h3>

In [13]:
data = pd.read_csv('ch06/ex5.csv', nrows=5)
data

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


Using DataFrame’s to_csv method, we can write the data out to a comma-separated file:

In [20]:
data.to_csv('ch06/out1.csv')

In [21]:
!cat ch06/out1.csv

,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


Other delimiters can be used, of course (writing to sys.stdout so it just prints the text
result):

In [24]:
data.to_csv('sys.stdout', sep='|')

In [25]:
!cat sys.stdout

|something|a|b|c|d|message
0|one|1|2|3.0|4|
1|two|5|6||8|world
2|three|9|10|11.0|12|foo


Missing values appear as empty strings in the output. You might want to denote them
by some other sentinel value:

In [26]:
data.to_csv('sys.stdout', na_rep='NULL')

In [27]:
!cat sys.stdout

,something,a,b,c,d,message
0,one,1,2,3.0,4,NULL
1,two,5,6,NULL,8,world
2,three,9,10,11.0,12,foo


With no other options specified, both the row and column labels are written. Both of
these can be disabled:

In [28]:
data.to_csv('sys.stdout', index=False, header = False)

In [29]:
!cat sys.stdout

one,1,2,3.0,4,
two,5,6,,8,world
three,9,10,11.0,12,foo


You can also write only a subset of the columns, and in an order of your choosing:

In [31]:
data.to_csv('sys.stdout', index=False, columns = ['a','b','c'])

In [32]:
!cat sys.stdout

a,b,c
1,2,3.0
5,6,
9,10,11.0


Series also has a to_csv method:

In [33]:
dates = pd.date_range('1/1/2000', periods=7)

In [35]:
ts = pd.Series(np.arange(7), index=dates)

In [36]:
ts.to_csv('ch06/tseries1.csv')

In [37]:
!cat ch06/tseries1.csv

2000-01-01,0
2000-01-02,1
2000-01-03,2
2000-01-04,3
2000-01-05,4
2000-01-06,5
2000-01-07,6


With a bit of wrangling (no header, first column as index), you can read a CSV version
of a Series with read_csv , but there is also a from_csv convenience method that makes
it a bit simpler:

In [39]:
pd.Series.from_csv('ch06/tseries1.csv', parse_dates=True)

  infer_datetime_format=infer_datetime_format)


2000-01-01    0
2000-01-02    1
2000-01-03    2
2000-01-04    3
2000-01-05    4
2000-01-06    5
2000-01-07    6
dtype: int64

<h3>Manually Working with Delimited Formats</h3>

In [3]:
import csv

In [4]:
f = open('ch06/ex7.csv')

In [5]:
reader = csv.reader(f)

In [6]:
for line in reader:
    print(line)

['a', 'b', 'c']
['1', '2', '3']
['1', '2', '3', '4']


In [7]:
lines = list(csv.reader(open('ch06/ex7.csv')))

In [8]:
header, values = lines[0], lines[1:]

In [9]:
data_dict = {h: v for h, v in zip(header, zip(*values))}
data_dict

{'a': ('1', '1'), 'b': ('2', '2'), 'c': ('3', '3')}

Argument Description<br>
delimiter One-character string to separate fields. Defaults to ','.<br>
lineterminator Line terminator for writing, defaults to '\r\n'. Reader ignores this and recognizes
cross-platform line terminators.<br>
quotechar Quote character for fields with special characters (like a delimiter). Default is '"'.<br>
quoting Quoting convention. Options include csv.QUOTE_ALL (quote all fields),<br>
csv.QUOTE_MINIMAL (only fields with special characters like the delimiter),
csv.QUOTE_NONNUMERIC, and csv.QUOTE_NON (no quoting). See Python’s
documentation for full details. Defaults to QUOTE_MINIMAL.<br>
skipinitialspace Ignore whitespace after each delimiter. Default False.<br>
doublequote How to handle quoting character inside a field. If True, it is doubled. See online
documentation for full detail and behavior.<br>
escapechar String to escape the delimiter if quoting is set to csv.QUOTE_NONE. Disabled by
default<br>

<h3>JSON Data</h3>

In [10]:
import json

In [11]:
obj = """
{"name": "Wes",
"places_lived": ["United States", "Spain", "Germany"],
"pet": null,
"siblings": [{"name": "Scott", "age": 25, "pet": "Zuko"},
{"name": "Katie", "age": 33, "pet": "Cisco"}]
}
"""

In [12]:
result = json.loads(obj)

In [13]:
result

{'name': 'Wes',
 'places_lived': ['United States', 'Spain', 'Germany'],
 'pet': None,
 'siblings': [{'name': 'Scott', 'age': 25, 'pet': 'Zuko'},
  {'name': 'Katie', 'age': 33, 'pet': 'Cisco'}]}

In [14]:
asjson = json.dumps(result)

In [17]:
siblings = pd.DataFrame(result['siblings'], columns=['name', 'age'])

In [18]:
siblings

Unnamed: 0,name,age
0,Scott,25
1,Katie,33


<h3>XML and HTML: Web Scraping</h3>