# Reading CSV with Pandas

If it is available in your environment, the `Pandas` package provides a versatile, flexible, and fast reader and writer of CSV and other delimited files.  Moreover, when read, delimited files are read into a flexible data structure called a DataFrame that has numerous useful methods.  The Pandas library can perform a great deal of work for data processing and data manipulation, but most of that is outside the scope of this lesson.

Let us start out by loading the Pandas library and the CSV module.  Pandas is conventionally loaded as the short name `pd`.  Similarly, `NumPy` is conventionally loaded as `np`.

In [1]:
import numpy as np
import pandas as pd
import csv

# Basic Reading

In principle, Pandas provides a huge number of options for reading CSV or other delimited files.  In fact, it has readers for a huge number of entirely different data formats as well.  In the simple case, it could hardly be simpler.  Let us look at a CSV file then read it to a DataFrame.

In [2]:
!cat data/movie.csv

"Name","Evaluation","Rating","Age"
"Mia Johnson","The movie was excellent",9.5,25
"Liam Lopez","Didn't really like it",3.0,35
"Isabella Lee","Wow! That was great",8.0,45
"Olivia Gupta","Meh, not my thing",2.5,55


In [3]:
df = pd.read_csv("data/movie.csv")
df

Unnamed: 0,Name,Evaluation,Rating,Age
0,Mia Johnson,The movie was excellent,9.5,25
1,Liam Lopez,Didn't really like it,3.0,35
2,Isabella Lee,Wow! That was great,8.0,45
3,Olivia Gupta,"Meh, not my thing",2.5,55


## Data Types

Or interest here especially is the type inference that was performed by Pandas.  Things that look like integers get converted to integers, things that look like floats get converted to floats.

In [4]:
df.dtypes

Name           object
Evaluation     object
Rating        float64
Age             int64
dtype: object

## Explicit Typing

Pandas lets you specify the types of columns explicitly, inasmuch as datatypes make sense for a given column.  For the most part, this is useful only to encode in fewer bits or to explicitly use floats where a column might be inferred as integer.

In [5]:
df = pd.read_csv("data/movie.csv", 
            dtype={'Age': np.float16, 'Rating': np.float32})
df.dtypes

Name           object
Evaluation     object
Rating        float32
Age           float16
dtype: object

## Parsing Dates

Pandas goes further than the standard library `csv` module can in also optionally parsing dates.  The next example not only has dates in an extra field, but in fact encodes the dates in multiple different formats.  If parsing a column as a date is specified, Pandas will attempt a large collection of heuristic rules to guess at what format was intended.  One of the dates given is accompanied by a particular time as well, down to a fraction of a second.

In [6]:
!cat data/movie-date.csv

Name,Evaluation,Rating,Age,Date
Mia Johnson,The movie was excellent,9.5,25,2020-06-08
Liam Lopez,Didn't really like it,3.0,35,"June 8, 2020"
Isabella Lee,Wow! That was great,8.0,45,2020-06-07
Olivia Gupta,"Meh, not my thing",2.5,55,2020-06-06T12:12:12.1212


In [7]:
pd.read_csv('data/movie-date.csv', parse_dates=['Date'])

Unnamed: 0,Name,Evaluation,Rating,Age,Date
0,Mia Johnson,The movie was excellent,9.5,25,2020-06-08 00:00:00.000000
1,Liam Lopez,Didn't really like it,3.0,35,2020-06-08 00:00:00.000000
2,Isabella Lee,Wow! That was great,8.0,45,2020-06-07 00:00:00.000000
3,Olivia Gupta,"Meh, not my thing",2.5,55,2020-06-06 12:12:12.121200


# Format Variations

By choosing from various available parameters, the same `pd.read_csv()` function can read most delimited formats.  For example, we can read the pipe (`|`) delimited file created in the last lesson that also had embedded newlines and quotes. The special value `NaN` (Not a Number) is used to mark missing data.

In [8]:
!cat data/movie.txt

Name|Rating|Age
"Mia
Johnson"|9.5|25
Liam Lopez||35
"Isabella ""Bella"" Lee"|8.0|45


Some of the parameters used in the below example are simply their default values. They are shown to illustrate the range of options.

In [9]:
df = pd.read_csv('data/movie.txt', 
                 sep="|", 
                 nrows=100, 
                 skip_blank_lines=True, 
                 decimal='.', 
                 quotechar='"')
df

Unnamed: 0,Name,Rating,Age
0,Mia\nJohnson,9.5,25
1,Liam Lopez,,35
2,"Isabella ""Bella"" Lee",8.0,45


Sometimes you will encounter CSV or other delimited files without headers.  A few options can handle that.  If we do not give parameters to indicate this, the DataFrame will be confused.

In [10]:
pd.read_csv('data/movie-noheader.csv')

Unnamed: 0,Mia Johnson,The movie was excellent,9.5,25
0,Liam Lopez,Didn't really like it,3.0,35
1,Isabella Lee,Wow! That was great,8.0,45
2,Olivia Gupta,"Meh, not my thing",2.5,55


In [11]:
pd.read_csv('data/movie-noheader.csv', 
             names=['Person', 'Description', 'Score', 'Age'])

Unnamed: 0,Person,Description,Score,Age
0,Mia Johnson,The movie was excellent,9.5,25
1,Liam Lopez,Didn't really like it,3.0,35
2,Isabella Lee,Wow! That was great,8.0,45
3,Olivia Gupta,"Meh, not my thing",2.5,55


# Exporting to CSV

Once you *have* a Pandas DataFrame, whether constructed from scratch, read from any of numerous data formats, modified and filtered using Pandas methods, or whatever, it is easy to export it to a new CSV file.  This is not as completely general purpose as the Python `csv` module in that it is only a DataFrame that can do the writing, not arbitrary arrangements of data that you have manually programmed to write as records.  However, it is extremely straightforward, and allows generally the same numerous parameters as the reader.

In [12]:
# Notice automatic compression based on extension
df.to_csv('data/movie.tsv.gz', 
          sep='\t', 
          na_rep="N/A",
          quotechar="'")

In [13]:
!zcat data/movie.tsv.gz

	Name	Rating	Age
0	'Mia
Johnson'	9.5	25
1	Liam Lopez	N/A	35
2	Isabella "Bella" Lee	8.0	45
