# Working With CSV Files

**Learning Objectives:** Learn how to import and export data as CSV files using Pandas.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

## About CSV

A CSV file is a simple textual format:

* One row per line
* Column values separated a comma, space, etc.
* CSV = "comma separated value"

CSV files are one of the most universal and common formats for distributing table based data across programming languages.

## A CSV file with no header

Here is a simple CSV file that doesn't have a header that declares the column names:

In [None]:
%%writefile data1.csv
one,76,3.4
two,74,5.6
three,73,2.9

We can look at its contents by using the `%pycat` magic:

In [None]:
%pycat data1.csv

CSV files can be read into Pandas `DataFrame` objects using the `read_csv` function. Here is the documentation for `read_csv`:

In [None]:
pd.read_csv?

Usually the first step is to try reading the file with no extra arguments:

In [None]:
pd.read_csv('data1.csv')

In this case, we don't want the first row to be treated as the column index, so we pass `header=None`:

In [None]:
df = pd.read_csv('data1.csv', header=None)

To pass our own names for the column names, pass the `name` argument. If the number of names matches the number of columns, an integer index will be created:

In [None]:
pd.read_csv('data1.csv', header=None, names=['item','temp','rainfall'])

Or if we pass fewer names than columns, the first column will be used as the row index:

In [None]:
pd.read_csv('data1.csv', header=None, names=['temp','rainfall'])

We can also use the `index_col` argument to specify which column to use as the index:

In [None]:
pd.read_csv('data1.csv', header=None, names=['item','temp','rainfall'], index_col=1)

## A CSV file with a header

Here is a CSV file that has a heade giving default column names:

In [None]:
%%writefile data2.csv
Person,Age,Income,Zip
one,20,55000,95050
two,24,65000,95054
three,26,75000,95014

In this case, `read_csv` will use those as the column names:

In [None]:
pd.read_csv('data2.csv')

To override the column names, you can provide a `names` argument and `header=0`. If there are one fewer names that columns the first column will be used as the index. Or you can use `index_col` to specify which column is used as an index.

In [None]:
pd.read_csv('data2.csv', names=['age','income','zip'], header=0)

## Handling missing data

Here is an example of a CSV files with missing and non-tidy data:

In [None]:
%%writefile data3.csv
age,income,zip
32,NA,95050
-1,4500,
,50000,95054
24,NULL,0

By default `read_csv` automatically converts missing values and certain (`NA`, `NULL`, `-1`) values to NaNs:

In [None]:
pd.read_csv('data3.csv')

The `na_values` argument can be used to specify additional values for all columns that will be treated as NaNs:

In [None]:
pd.read_csv('data3.csv', na_values=[0])

Or you can pass `na_values` a ditionary providing NaN sentinals for each column:

In [None]:
pd.read_csv('data3.csv', na_values={'zip':[0,'NULL'], 'age': [-1]})

## Resources

* Pandas [CSV Documentation](http://pandas.pydata.org/pandas-docs/stable/io.html#csv-text-files).