# Chapter 6. Data Loading, Storage, and File Formats
<a id='index'></a>
## Table of Content
- [6.1 Reading and Writing Data in Text Format](#61)

## 6.1 Reading and Writing Data in Text Format
<a id='61'></a>
Category
- Indexing
- Type inference and data conversion
- Datetime parsing
- Iterating
- Unclean data issues

In [11]:
import pandas as pd

In [13]:
try: 
    df = pd.read_csv('examples/ex1.csv')
except FileNotFoundError as e:
    print(e)
df

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [15]:
try: 
    df = pd.read_table('examples/ex1.csv', sep=',')
except FileNotFoundError as e:
    print(e)
df

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


> A file will not always have a header row

In [16]:
try: 
    df = pd.read_csv('examples/ex2.csv', header=None)
except FileNotFoundError as e:
    print(e)
df

Unnamed: 0,0,1,2,3,4
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [17]:
# Then...
try: 
    df = pd.read_csv('examples/ex2.csv', names=['a', 'b', 'c', 'd', 'message'])
except FileNotFoundError as e:
    print(e)
df

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [18]:
# If you want to select one of them as index, then...
try: 
    df = pd.read_csv('examples/ex2.csv', names=['a', 'b', 'c', 'd', 'message'], index_col='message')
except FileNotFoundError as e:
    print(e)
df

Unnamed: 0_level_0,a,b,c,d
message,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
hello,1,2,3,4
world,5,6,7,8
foo,9,10,11,12


In [19]:
parsed = pd.read_csv('examples/csv_mindex.csv', index_col=['key1', 'key2'])
parsed

Unnamed: 0_level_0,Unnamed: 1_level_0,value1,value2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16


In [23]:
# In some cases, a table might not have a fixed delimiter, using whitespace or 
# some other pattern to separate fields. Consider a text file that looks like this
list(open('examples/ex3.txt'))

['            A         B         C\n',
 'aaa -0.264438 -1.026059 -0.619500\n',
 'bbb  0.927272  0.302904 -0.032399\n',
 'ccc -0.264273 -0.386314 -0.217601\n',
 'ddd -0.871858 -0.348382  1.100491\n']

> While you could do some munging by hand, the fields here are separated by a variable amount of whitespace. In these cases, you can pass a regular expression as a delimiter for ***read_table***. This can be expressed by the regular expression ***\s+***, so we have then:

In [25]:
result = pd.read_table('examples/ex3.txt', sep='\s+')
result

Unnamed: 0,A,B,C
aaa,-0.264438,-1.026059,-0.6195
bbb,0.927272,0.302904,-0.032399
ccc,-0.264273,-0.386314,-0.217601
ddd,-0.871858,-0.348382,1.100491


In [26]:
# To omit certain rows
pd.read_csv('examples/ex4.csv', skiprows=[0, 2, 3])

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [27]:
# Handling missing value, which is either not present (empty string) or marked by some sentinel value
result = pd.read_csv('examples/ex5.csv')
result

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


In [28]:
pd.isnull(result)

Unnamed: 0,something,a,b,c,d,message
0,False,False,False,False,False,True
1,False,False,False,True,False,False
2,False,False,False,False,False,False


> The **na_values** option can take either a list or set of strings to consider missing values:

In [29]:
result = pd.read_csv('examples/ex5.csv', na_values=['NULL'])
result

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


> Different NA sentinels can be specified for each column in a dict:

In [30]:
sentinels = {'message': ['foo', 'NA'], 'something': ['two']}

pd.read_csv('examples/ex5.csv', na_values=sentinels)

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,,5,6,,8,world
2,three,9,10,11.0,12,


[Back to top](#index)