# Working With CSV Files

**Learning Objectives:** Learn how to import and export data as CSV files using Pandas.

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

## About CSV

A CSV file is a simple textual format:

* One row per line
* Column values separated a comma, space, etc.
* CSV = "comma separated value"

CSV files are one of the most universal and common formats for distributing table based data across programming languages.

## A CSV file with no header

Here is a simple CSV file that doesn't have a header that declares the column names:

In [2]:
%%writefile data1.csv
one,76,3.4
two,74,5.6
three,73,2.9

Overwriting data1.csv


We can look at its contents by using the `%pycat` magic:

In [3]:
%pycat data1.csv

[0mone[0m[1;33m,[0m[1;36m76[0m[1;33m,[0m[1;36m3.4[0m[1;33m[0m
[1;33m[0m[0mtwo[0m[1;33m,[0m[1;36m74[0m[1;33m,[0m[1;36m5.6[0m[1;33m[0m
[1;33m[0m[0mthree[0m[1;33m,[0m[1;36m73[0m[1;33m,[0m[1;36m2.9[0m[1;33m[0m[0m


CSV files can be read into Pandas `DataFrame` objects using the `read_csv` function. Here is the documentation for `read_csv`:

In [4]:
pd.read_csv?

[1;31mSignature: [0m[0mpd[0m[1;33m.[0m[0mread_csv[0m[1;33m([0m[0mfilepath_or_buffer[0m[1;33m,[0m [0msep[0m[1;33m=[0m[1;34m','[0m[1;33m,[0m [0mdialect[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mcompression[0m[1;33m=[0m[1;34m'infer'[0m[1;33m,[0m [0mdoublequote[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m [0mescapechar[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mquotechar[0m[1;33m=[0m[1;34m'"'[0m[1;33m,[0m [0mquoting[0m[1;33m=[0m[1;36m0[0m[1;33m,[0m [0mskipinitialspace[0m[1;33m=[0m[1;32mFalse[0m[1;33m,[0m [0mlineterminator[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mheader[0m[1;33m=[0m[1;34m'infer'[0m[1;33m,[0m [0mindex_col[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mnames[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mprefix[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mskiprows[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mskipfooter[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mskip_footer[0m[1;33m=[0m[1;

Usually the first step is to try reading the file with no extra arguments:

In [5]:
pd.read_csv('data1.csv')

Unnamed: 0,one,76,3.4
0,two,74,5.6
1,three,73,2.9


In this case, we don't want the first row to be treated as the column index, so we pass `header=None`:

In [6]:
df = pd.read_csv('data1.csv', header=None)
df

Unnamed: 0,0,1,2
0,one,76,3.4
1,two,74,5.6
2,three,73,2.9


To pass our own names for the column names, pass the `name` argument. If the number of names matches the number of columns, an integer index will be created:

In [7]:
pd.read_csv('data1.csv', header=None, names=['item','temp','rainfall'])

Unnamed: 0,item,temp,rainfall
0,one,76,3.4
1,two,74,5.6
2,three,73,2.9


Or if we pass fewer names than columns, the first column will be used as the row index:

In [8]:
pd.read_csv('data1.csv', header=None, names=['temp','rainfall'])

Unnamed: 0,temp,rainfall
one,76,3.4
two,74,5.6
three,73,2.9


We can also use the `index_col` argument to specify which column to use as the index:

In [9]:
pd.read_csv('data1.csv', header=None, names=['item','temp','rainfall'], index_col=1)

Unnamed: 0_level_0,item,rainfall
temp,Unnamed: 1_level_1,Unnamed: 2_level_1
76,one,3.4
74,two,5.6
73,three,2.9


## A CSV file with a header

Here is a CSV file that has a heade giving default column names:

In [10]:
%%writefile data2.csv
Person,Age,Income,Zip
one,20,55000,95050
two,24,65000,95054
three,26,75000,95014

Writing data2.csv


In this case, `read_csv` will use those as the column names:

In [11]:
pd.read_csv('data2.csv')

Unnamed: 0,Person,Age,Income,Zip
0,one,20,55000,95050
1,two,24,65000,95054
2,three,26,75000,95014


To override the column names, you can provide a `names` argument and `header=0`. If there are one fewer names that columns the first column will be used as the index. Or you can use `index_col` to specify which column is used as an index.

In [12]:
pd.read_csv('data2.csv', names=['age','income','zip'], header=0)

Unnamed: 0,age,income,zip
one,20,55000,95050
two,24,65000,95054
three,26,75000,95014


## Handling missing data

Here is an example of a CSV files with missing and non-tidy data:

In [13]:
%%writefile data3.csv
age,income,zip
32,NA,95050
-1,4500,
,50000,95054
24,NULL,0

Writing data3.csv


By default `read_csv` automatically converts missing values and certain (`NA`, `NULL`, `-1`) values to NaNs:

In [14]:
pd.read_csv('data3.csv')

Unnamed: 0,age,income,zip
0,32.0,,95050.0
1,-1.0,4500.0,
2,,50000.0,95054.0
3,24.0,,0.0


The `na_values` argument can be used to specify additional values for all columns that will be treated as NaNs:

In [15]:
pd.read_csv('data3.csv', na_values=[0])

Unnamed: 0,age,income,zip
0,32.0,,95050.0
1,-1.0,4500.0,
2,,50000.0,95054.0
3,24.0,,


Or you can pass `na_values` a ditionary providing NaN sentinals for each column:

In [16]:
pd.read_csv('data3.csv', na_values={'zip':[0,'NULL'], 'age': [-1]})

Unnamed: 0,age,income,zip
0,32.0,,95050.0
1,,4500.0,
2,,50000.0,95054.0
3,24.0,,


## Resources

* Pandas [CSV Documentation](http://pandas.pydata.org/pandas-docs/stable/io.html#csv-text-files).