# Reading a CSV File with the CSV Module

In [1]:
import csv
def read_csv_lists(filename):
    with open(filename) as file:
        reader = csv.reader(file)
        for row in reader:
            yield row

- If you compare this function with read_tsv_lists earlier, the addition of one reader object replaced stripping new lines and splitting the line into multiple values.
    - The latter splitting functionality would have been much more complicated than the simple TSV case.
- csv.reader has several options for tweaking the dialect read
    - Any changes to the dialect options will be made in the csv.reader call, and the following loop will be unchanged.

# Reading a CSV File into Dictionaries 

In [2]:
import csv
def read_csv_dictionaries():
    with open(filename) as file:
        reader = csv.DictReader(file)
        for row in reader:
            yield row

- The only change from the previous read_csv function is changing the call from csv.reader to csv.DictReader.
- csv.DictReader can be similarly configured with options to read different dialects.

# Reading TSV Files with the CSV Module

In [3]:
import csv
 
def read_tsv_dictionaries_2():
    with open(filename) as file:
        reader = csv.DictReader(file, dialect="excel-tab")
        for row in reader:
            yield row

- The only change is adding the dialect="excel-tab" option.
- You could also get this effect with delimiter="\t" since \t is the tab character.


In [4]:
import csv
csv.list_dialects()

['excel', 'excel-tab', 'unix']

In [5]:
dialect = csv.get_dialect("excel-tab")
dialect

<_csv.Dialect at 0x7181cda6d640>

In [6]:
dir(dialect)

['__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'delimiter',
 'doublequote',
 'escapechar',
 'lineterminator',
 'quotechar',
 'quoting',
 'skipinitialspace',
 'strict']

In [7]:
for attribute in dir(dialect):
    if not attribute.startswith("_"):
        print(attribute, repr(getattr(dialect, attribute)))

delimiter '\t'
doublequote True
escapechar None
lineterminator '\r\n'
quotechar '"'
quoting 0
skipinitialspace False
strict False


- csv.list_dialects returns a list of the available dialects. "excel-tab" is included.
- csv.get_dialect gives us a dialect object, but converting it into a string is not very helpful.
- The built-in function dir was mentioned in week 1, but we have not used it since. It returns a list of the attributes (including methods) of an object, and is handy when trying to learn about an unfamiliar object.
- The built-in function getattr returns an object's named attribute. What's an attribute? Anything you can access from the object using the dot notation. So getattr(o, "foo") is the same as o.foo. Here, it was used to programmatically look at an unfamiliar object where the attribute names weren't known beforehand.
- repr was used for clarity since some of these attributes were non-visible characters; repr changed them to the backslashed expressions.
- This dialect is a little fancier than a plain TSV file but is probably fine for most purposes.
    - Of particular note, it supports quoting fields with double quotes and the same double doublequote behavior that we saw looking at example CSV encodings.
    - Most of the time, these just do not come up at all.
    - When they come up, you'll have to decide on a case by case basis whether you want this support or not.

# Handling Different Data Types

In [8]:
import csv
 
def read_mango_data(filename):
    with open(filename) as file:
        reader = csv.DictReader(file, dialect="excel-tab")
        for row in reader:
            for column_name in row:
                if column_name != "mango_id":
                    try:
                        row[column_name] = float(row[column_name])
                    except:
                        row[column_name] = None
 
            yield row

- You may want to convert individual columns instead of looping over all columns, depending on what your data looks like.
- float is a reasonable default for most numeric columns, but occasionally int will be preferred.