# Reading CSV with Standard Library

Python provides a module in its standard library for reading and writing CSV or other delimited files.  It can be tempting to create or read such files using only Python's powerful string manipulation functionality.  Indeed, the author of this tutorial has done so far more often than he wishes to admit; however, it is a mistake to eschew the `csv` module which simply deals with many edge cases that are easy to overlook in quick scripts.

Let us start out by loading a few Python standard library modules that this lesson will utilize.

In [1]:
import csv
from pprint import pprint

# Doing it Wrong

In Python, the string methods `.split()` and `.join()` do 90% of what we need to in working with CSV.  The problem is, they do not do the other 10%.  Let's try a naive approach that goes bad.

In [2]:
fields = ["Name","Evaluation","Age"]
data = [
    ["Mia Johnson", "The movie was excellent", 25],
    ["Liam Lopez", "Didn't really like it", 35],
    ["Isabella Lee", "Wow! That was great", 45]
]

This is unremarkable data about several movie evaluations.  Let us try to serialize it.

In [3]:
with open('data/movie.csv', 'w') as movie:
    try:
        print(",".join(fields), file=movie)
        for record in data:
            print(",".join(record), file=movie)
    except Exception as err:
        print(err)

sequence item 2: expected str instance, int found


It is easy to see what went wrong.  The `.join()` method needs only strings in the iterable argument.  We can fix that fairly easily.  Python knows how to *stringify* all its objects.

In [4]:
with open('data/movie.csv', 'w') as movie:
    try:
        print(",".join(fields), file=movie)
        for record in data:
            print(",".join(str(r) for r in record), file=movie)
    except Exception as err:
        print(err)

Success! At least for now. Perhaps we want to read it back as a list of dictionaries.

In [5]:
with open('data/movie.csv') as movie:
    newdata = []
    keys = next(movie).split(',') # Header
    for line in movie:
        newdata.append(dict(zip(keys, line.split(','))))
    
pprint(newdata)

[{'Age\n': '25\n',
  'Evaluation': 'The movie was excellent',
  'Name': 'Mia Johnson'},
 {'Age\n': '35\n', 'Evaluation': "Didn't really like it", 'Name': 'Liam Lopez'},
 {'Age\n': '45\n', 'Evaluation': 'Wow! That was great', 'Name': 'Isabella Lee'}]


We did *pretty well*.  However, the last field of header and data have a trailing newline chacter we do not really want.  We can strip that, but other problems still arise.

In [6]:
with open('data/movie.csv') as movie:
    newdata = []
    line = next(movie).rstrip()  # Header
    keys = line.split(',') 
    for line in movie:
        line = line.rstrip()
        newdata.append(dict(zip(keys, line.split(','))))
    
pprint(newdata)

[{'Age': '25', 'Evaluation': 'The movie was excellent', 'Name': 'Mia Johnson'},
 {'Age': '35', 'Evaluation': "Didn't really like it", 'Name': 'Liam Lopez'},
 {'Age': '45', 'Evaluation': 'Wow! That was great', 'Name': 'Isabella Lee'}]


So far, our ad hoc reader and writer behave well enough.

Let us add an additional record and try again.

In [7]:
new_eval = ["Olivia Gupta", "Meh, not my thing", 55]
data.append(new_eval)

with open('data/movie.csv', 'w') as movie:
    try:
        print(",".join(fields), file=movie)
        for record in data:
            print(",".join(str(r) for r in record), file=movie)
    except Exception as err:
        print(err)

We can see that something is going to go wrong when a field can legitimately contain the delimiter.

In [8]:
!cat data/movie.csv

Name,Evaluation,Age
Mia Johnson,The movie was excellent,25
Liam Lopez,Didn't really like it,35
Isabella Lee,Wow! That was great,45
Olivia Gupta,Meh, not my thing,55


Let's use the idential ad hoc reader to read the data on disk again.

In [9]:
with open('data/movie.csv') as movie:
    newdata = []
    line = next(movie).rstrip()
    keys = line.split(',') # Header
    for line in movie:
        line = line.rstrip()
        newdata.append(dict(zip(keys, line.split(','))))
    
pprint(newdata)

[{'Age': '25', 'Evaluation': 'The movie was excellent', 'Name': 'Mia Johnson'},
 {'Age': '35', 'Evaluation': "Didn't really like it", 'Name': 'Liam Lopez'},
 {'Age': '45', 'Evaluation': 'Wow! That was great', 'Name': 'Isabella Lee'},
 {'Age': ' not my thing', 'Evaluation': 'Meh', 'Name': 'Olivia Gupta'}]


As written, nothing crashed.  But we also get data in the wrong fields sometimes. Another likely problem is handling embedded newlines in strings; a few other edge cases also occur. We could complicate matters with some additional code, and eventually get it right.  But the Python standard library does that for us.

# The `csv` Module

In the basic case, using the `csv` module gives us a largely file-like interface.  It merely handles a few things that can go wrong automatically.

In [10]:
with open('data/movies.csv', 'w') as fh:
    movies = csv.writer(fh, quoting=csv.QUOTE_MINIMAL)
    for record in [fields]+data:
        movies.writerow(record)
        
!cat data/movies.csv

Name,Evaluation,Age
Mia Johnson,The movie was excellent,25
Liam Lopez,Didn't really like it,35
Isabella Lee,Wow! That was great,45
Olivia Gupta,"Meh, not my thing",55


Reading the data back is similar, with quoting and escaping handled properly.

In [11]:
with open('data/movies.csv') as fh:
    movies = csv.reader(fh)
    for record in movies:
        print(record)

['Name', 'Evaluation', 'Age']
['Mia Johnson', 'The movie was excellent', '25']
['Liam Lopez', "Didn't really like it", '35']
['Isabella Lee', 'Wow! That was great', '45']
['Olivia Gupta', 'Meh, not my thing', '55']


## Data Typing

Unlike some other tools, the standard library `csv` module makes little attempt to impose datatypes.  During writing, it will, of course, stringify objects that are not strings.  It usually leaves the decision of casting to other types up to the programmer.  

In [12]:
with open('data/movies.csv', 'w') as fh:
    movies = csv.writer(fh, quoting=csv.QUOTE_NONNUMERIC)
    for record in [fields]+data:
        movies.writerow(record)
        
!cat data/movies.csv

"Name","Evaluation","Age"
"Mia Johnson","The movie was excellent",25
"Liam Lopez","Didn't really like it",35
"Isabella Lee","Wow! That was great",45
"Olivia Gupta","Meh, not my thing",55


The `csv` module provides a limited option to quote all strings and to infer that anything unquoted is a number instead.  The numeric type used is always a floating point for this rule.  If you wish to read in an int, or a Decimal or Fraction, or another numeric type, you still need to write more custom code.

In [13]:
with open('data/movies.csv') as fh:
    movies = csv.reader(fh, quoting=csv.QUOTE_NONNUMERIC)
    for record in movies:
        print(record)

['Name', 'Evaluation', 'Age']
['Mia Johnson', 'The movie was excellent', 25.0]
['Liam Lopez', "Didn't really like it", 35.0]
['Isabella Lee', 'Wow! That was great', 45.0]
['Olivia Gupta', 'Meh, not my thing', 55.0]
