# Using the `csv` Module

When you are designing an analysis program, you will need to read information from a file. There are many ways to do this; in CPSC 103 we recommend that you use Python's `csv` module. 

This notebook will provide some explanation of how you can use the `csv` module. For our example, we'll use the information in the `csv` ("Comma-Separated Values") file named `csv_module_example.csv`. The file contains some information about children, their ages, and their favourite fruit. You should download it from edX (it's in the same unit as this notebook) and upload it to Jupyter. Make sure it's in the same Jupyter folder as this notebook, otherwise you'll get an error when you try to read the file.

Here is the information that is in the file:

    Name        Age    Favourite Fruit
    Alyaa       12     apple
    Benjamin    3      
    Qian        7      banana
    Ying        4      nectarine
    Jae         9      cantaloupe
    Helen       10.2   watermelon

By the time you get to Step 2b (design a function to read the information and store it as data in your program), you will already have done a lot of work for your analysis program. For example, you will have already defined the data definitions that you'll use to store the information as data in your program. 

Note that this sample information is not perfectly clean. It's missing Benjamin's favourite fruit and Helen's age is listed as `10.2` whereas all the other ages are listed in whole numbers. When we designed our data definitions, we had to decide how we would store the information as data. We chose to store the age as an `Optional[int]`. With this choice, if we don't get the kind of data we expect, we'll store `None`. We chose to store the `fav_fruit` as a string. If one of the pieces of information is missing (as it is for Benjamin), we'll store the empty string. We could have made different choices. For example, if the information was really key to the problem that we were trying to solve we could have chosen to discard any row that was missing the information. 

Here are the two data definitions that we'll need.

In [1]:
from typing import List, NamedTuple, Optional
import csv
from cs103 import *

Child = NamedTuple('Child', [('name', str),
                             ('age', Optional[int]),
                             ('fav_fruit', str)])
# interp. a child with his/her name, age, and favourite fruit

ALYAA = Child('Alyaa', 12, 'apple')
BENJAMIN = Child('Benjamin', 3, '')

# template based on compound
@typecheck
def fn_for_child(c: Child) -> ...:
    return ...(c.name, 
               c.age,
               c.fav_fruit)


# List[Child]
# interp. a list of children

LOC0 = []
LOC1 = [ALYAA, BENJAMIN]

# template based on arbitrary-sized and the reference rule
@typecheck
def fn_for_loc(loc: List[Child]) -> ...:
    # description of the acc
    acc = ... # type: ...
    for c in loc:
        ...(acc, fn_for_child(c))
    return ...(acc)

Now that we have some sample data definitions that we can use to store the information from our file as data in our program, let's read the data!

Here is the template for read from the How to Design Analysis Programs template:

```python
@typecheck
def read(filename: str) -> List[Consumed]:
    """    
    reads information from the specified file and returns ...
    """
    #return []  #stub
    # Template from HtDAP
    # loc contains the result so far
    loc = [] # type: List[Consumed]

    with open(filename) as csvfile:
        
        reader = csv.reader(csvfile, delimiter=',')
        next(reader) # skip header line

        for row in reader:
            # you may not need to store all the strings in the 
            # current row, and you may need to convert some
            # of the strings to other types
            c = Consumed(row[0], ... ,row[n])
            loc.append(c)
    
    return loc
```

**Note:** if you run that code in a cell you will get an error since the type Consumed isn't defined. That's ok because when we update the function we'll update the return type to match our specific example.

We'll take a few steps to fill out this template for our particular information source.

**Set Up the Accumulator**

The first thing we need to update is the type, and possibly the name, of the accumulator. This is the variable that you'll use to store the data as you read the information in from the file. In our case, we're reading information about children and want our function to return a list of children.

So, we'll use the type `List[Child]` for our accumulator.

**Skip Header Lines**

We next need to look at our information file and check how many header lines it contains. A header line is a line that does not contain information that we want to store as data in our program. There is usually one header line that contains the headings for each column in the csv file, but some files have more header lines and some have none. If our file had no header lines, we'd want to remove `next(reader) # skip header line` from the template. If our file had three header lines, we'd need to add in two more calls to `next(reader)` so that we'd skip all the header lines.

Our csv file has one header line; so, we'll leave the single `next(reader)` line in place.

**Convert Each Needed Piece of Information**

Now we need to decide what we need to do for each individual piece of information. The csv reader allows us to loop over the rows in the file with a for loop, just as we usually do with arbitrary-sized data. It gives us access to each row in the file as a `List[str]` and we can access individual pieces of information by indexing into the list. For example, we can access the name of a child with `row[0]`, since the name is the first piece of information in each row. Similarly, we can access the age of a child, represented as a string, with `row[1]`.  Since the age of a child is a number we need to use the `parse_int` function to convert the string in `row[1]` to an `int`. If it can't be converted to an `int`, `parse_int` will return `None`. There may be more pieces of information in each row of the file than you actually need. For example, even if your information file contains 10 columns, you may only want to store four of them as data in your program. If that's the case, you just need to carefully choose which elements from the list you need to access.

In our case, we need all three columns but use `parse_int` to convert the second column (`row[1]`) into an integer.

**Create a `Consumed` and Append It to the Accumulator**

Finally, for each row in the file, we need to create data of the type we use to represent one row.

In our case, our version of the `Consumed` type is `Child`. We make one and append it to our accumulator.

In [2]:
@typecheck
def read(filename: str) -> List[Child]:
    """    
    reads information from the specified file and returns a list of children
    """
    #return []  #stub
    # Template from HtDAP
    # loc contains the result so far
    loc = [] # type: List[Child]

    with open(filename) as csvfile:
        
        reader = csv.reader(csvfile)
        next(reader) # skip header line

        for row in reader:
            # Create a Child with the first column left as a string,
            # the second column (age) converted to an Optional[int],
            # and the third column also left as a string, which might
            # be empty.
            c = Child(row[0], parse_int(row[1]), row[2])
            
            # Append the Child to the accumulator.
            loc.append(c)
    
    return loc

You may run into other issues when you're reading information from your file (e.g. unexpected formatting). If you have trouble with your read function, please come to office hours or post on the discussion forum so that we can help you.

Don't forget to test your read function! Create two or three small csv files that illustrate the likely types of data and variance in data that your program will have to handle. For example, your files should contain some missing values as it's very common for subsets of information to be missing.

For example, here is a test for the sample file distributed with this notebook:

In [3]:
start_testing()

expect(read('csv_module_example.csv'), [Child('Alyaa', 12, 'apple'),
                                        Child('Benjamin', 3, ''),
                                        Child('Qian', 7, 'banana'),
                                        Child('Ying', 4, 'nectarine'),
                                        Child('Jae', 9, 'cantaloupe'),
                                        Child('Helen', None, 'watermelon')])

summary()

[92m1 of 1 tests passed[0m
