# How to use and manipulate CSV files in Python

## Part 1 - What is a CSV file

A csv file is a type of file where the data is structured into rows and columns, using commas and new lines. They're commonly used to represent data in a spreadsheet - such as from Microsoft Excel or Google Sheets.

For example, if I had this spreadsheet on Google Sheets:

![http://i.imgur.com/9ig18Kt.png](http://i.imgur.com/9ig18Kt.png)

And then I went to File->Download As->Comma-separated values and downloaded the file, it would look like this:

In this example, the first row has three things separated by commas. For each of the following row, the first item correspons to the first item in the first row (so Bag of carrots is an item, 3 is a quantity, and $3 is a price per item. 

P.S. If you can't remember which are rows and which are columns, you can think of columns like the roman columns (going up and down), and rows like running (left to right).

This shows probably the biggest reason we care about CSV files. They're structured, so they're easy to read programmatically, but they're also easy to just give to someone who knows nothing about code, so they can just open it in their favorite spreadsheet program. Many times when downloading large data sets from online, they will give you CSV files to read.

## Part 2 - Reading CSV Files

### Part 2a - Reading them manually

Remember that CSV files are just regular files with a standard formatting.
Therefore, you can just read them like a regular file, and get the data you want.

For example, if we want to figure out how much we will have to pay in the end, we have to go through the rows, and multiply the price by the quantity, and add them all up.

In [2]:
csv_text = open("shopping_list.csv").read() # How to turn a file into a string
print (csv_text)

Item,Quantity,Price per item
Bag of carrots,3,$3
Box of cookies,2,$4
Brie Cheese,1,$4


In [3]:
# Now we have to split it on the new lines, so each 
csv_text_split = csv_text.split("\n")
print(csv_text_split)

['Item,Quantity,Price per item', 'Bag of carrots,3,$3', 'Box of cookies,2,$4', 'Brie Cheese,1,$4']


In [4]:
# We can get rid of the first line, since that's just the headers. We know that quantity is index 1, and price is index 2.
csv_text_split.pop(0) # Delete the line at index 0 - since it's the headers

'Item,Quantity,Price per item'

In [5]:
# And now we just iterate through each line, and take out the information we want. 
total_price = 0
for line in csv_text_split: # For each line
    new_line = line.split(",") # split the line on the comma, so the line is now a list [Item,Quantity,Price]
    quantity = int(new_line[1]) # turn the thing at index 1 into an int
    price_per_item = int(new_line[2][1:]) # take the dollar sign off the thing at index 2, and turn it into an int
    
    total_price += quantity*price_per_item

print("Total Price: ${}".format(total_price)) 

Total Price: $21


### Part 2b - Using the CSV library

That wasn't awful - but there's a lot of code in there that would be repeated in all CSVs. In addition, our code doesn't handle some special cases (What if there are commas in the item, for example?)

Because of this, Python comes with a CSV library that makes it extremely easy to turn a CSV file into a list of lists, so that you can parse it more easily. Let me show you how it works:

In [6]:
# First step: import the library
import csv

# Second step: pass the open file into csv.reader
csv_lists = csv.reader(open("shopping_list.csv"))

# Third step: iterate through the file you created:
for line in csv_lists: 
    print(line)

['Item', 'Quantity', 'Price per item']
['Bag of carrots', '3', '$3']
['Box of cookies', '2', '$4']
['Brie Cheese', '1', '$4']


If we want to rewrite the code for finding the total price:

In [7]:
csv_lists = csv.reader(open("shopping_list.csv"))
total_cost = 0
next(csv_lists) # this advances the csv_lists by one

for line in csv_lists:
    quantity = int(line[1])
    price = int(line[2][1:]) # the 1: takes off the dollar sign
    total_cost += quantity*price
print("Total cost is: ${}".format(total_cost))

Total cost is: $21


Something to keep in mind is that csv.reader isn't exactly a list of lists. It just goes through each line, and then becomes empty. So you can't read from csv_lists twice - the second time it will just be empty. This also means you can't do indexing on it. 

In [8]:
for line in csv_lists:
    print(line)

The reason the csv library does this is in case you had a very large CSV file - this way, you don't have to store it all in memory, you can just read it line by line.

To read from it more than once, you can convert it into a list after reading it. This will store the entire list in your computer's memory, and allow you to use it like a list of lists. 

In [18]:
csv_lists = list(csv.reader(open("shopping_list.csv")))
print(csv_lists)

print(csv_lists[1][2])

[['Item', 'Quantity', 'Price per item'], ['Bag of carrots', '3', '$3'], ['Box of cookies', '2', '$4'], ['Brie Cheese', '1', '$4']]
$3


With it as a list, we can also use a List Comprehension to get the total price.

In [19]:
csv_lists = csv_lists[1:] # Take off the headers

In [21]:
totals = [int(line[1])*int(line[2][1:]) for line in csv_lists]
print(totals)
print(sum(totals))

[9, 8, 4]
21


Now you have a list of lists, which is the data from your CSV.

Try writing code that goes through csv_lists, and prints out the item you're spending the most money on.

In [11]:
# answer - will be blank for students:
def most_expensive(csv_lists):
    max_item = ""
    max_price = -1
    for line in csv_lists:
        total_price = int(line[1])*int(line[2][1:])
        if total_price>max_price:
            max_price = total_price
            max_item = line[0]
    
    return max_item
most_expensive(csv_lists[1:])

'Bag of carrots'

You might have noticed a lot of annoying things about working with this library while working with it. For one, you have to drop the first row, since it doesn't contain any data you want. Secondly, you have to refer to the items by index, which means you have to know the index of what you want.

These issues can be solved with the DictReader module of the csv library. Let me show you how that one works, and what it produces:

In [13]:
# csv library is already imported
csv_file = csv.DictReader(open("shopping_list.csv"))

# now let's see what's inside
for line in csv_file:
    print(line)

OrderedDict([('Item', 'Bag of carrots'), ('Quantity', '3'), ('Price per item', '$3')])
OrderedDict([('Item', 'Box of cookies'), ('Quantity', '2'), ('Price per item', '$4')])
OrderedDict([('Item', 'Brie Cheese'), ('Quantity', '1'), ('Price per item', '$4')])


As you can see, the DictReader takes in a CSV file, and gives you a bunch of dictionaries, where the key is the header, and the value is the value at that line. This makes it easy to write very readable code, as you can use the name of the header to get what you want. For example, to rewrite the "total cost" code:

In [15]:
csv_file = csv.DictReader(open("shopping_list.csv"))
total_cost = 0
for line in csv_file:
    quantity = int(line['Quantity'])
    price = int(line['Price per item'][1:])
    total_cost += quantity*price
print("Total cost is: ${}".format(total_cost))

Total cost is: $21


The list comprehension version looks like this

In [24]:
csv_file = csv.DictReader(open("shopping_list.csv"))
print(sum(
    [int(line['Quantity'])*int(line['Price per item'][1:]) for line in csv_file]
))

21


It's up to you which version you want to use - whatever you're more comfortable with and you think looks the best.

## Part 3: Writing a CSV file

Like reading CSV files, we don't necissarily need the CSV library to create a CSV. However, It makes it a lot easier. In fact, I will only go over how to create one using the CSV library.

Let's say we want to create a CSV of the first 100 numbers, and their values at x^2, x^3, and sqrt(x)
Just like there's a csv.reader and a csv.DictReader, there's also a csv.writer and a csv.DictWriter. I'll show both ways of using them.

This also shows the "with" method of opening a file.

In [18]:
import math
# First way, using CSV writer
with open("number_values.csv", "w") as new_csv: # we add the extra 'w' parameter for saying this file will be written to
    writer = csv.writer(new_csv)
    writer.writerow(["Number", "Number Squared", "Number Cubed", "Square Root of Number"])
    for i in range(1,101):
        writer.writerow([i, i**2, i**3, round(math.sqrt(i),2)])
# Notice there's no "close" statement

In [19]:
with open("number_values.csv") as f:
    print(f.read())

Number,Number Squared,Number Cubed,Square Root of Number

1,1,1,1.0

2,4,8,1.41

3,9,27,1.73

4,16,64,2.0

5,25,125,2.24

6,36,216,2.45

7,49,343,2.65

8,64,512,2.83

9,81,729,3.0

10,100,1000,3.16

11,121,1331,3.32

12,144,1728,3.46

13,169,2197,3.61

14,196,2744,3.74

15,225,3375,3.87

16,256,4096,4.0

17,289,4913,4.12

18,324,5832,4.24

19,361,6859,4.36

20,400,8000,4.47

21,441,9261,4.58

22,484,10648,4.69

23,529,12167,4.8

24,576,13824,4.9

25,625,15625,5.0

26,676,17576,5.1

27,729,19683,5.2

28,784,21952,5.29

29,841,24389,5.39

30,900,27000,5.48

31,961,29791,5.57

32,1024,32768,5.66

33,1089,35937,5.74

34,1156,39304,5.83

35,1225,42875,5.92

36,1296,46656,6.0

37,1369,50653,6.08

38,1444,54872,6.16

39,1521,59319,6.24

40,1600,64000,6.32

41,1681,68921,6.4

42,1764,74088,6.48

43,1849,79507,6.56

44,1936,85184,6.63

45,2025,91125,6.71

46,2116,97336,6.78

47,2209,103823,6.86

48,2304,110592,6.93

49,2401,117649,7.0

50,2500,125000,7.07

51,2601,132651,7.14

52,2704,140608,7.

The other way to do this is to use a DictWriter - I'll show you how to do that below.
Remember that the way the DictWriter worked was that each line was a dictionary mapping the header to its value at that line. The writer will work similarly, for each line, we will write a dictionary.

In [20]:
with open("number_values2.csv", "w") as new_csv:
    # we have to tell the writer what our top fields are
    writer = csv.DictWriter(new_csv, fieldnames=["num", "squared", "cubed", "sqrt"])
    writer.writeheader() # to write the header
    for i in range(1,101):
        writer.writerow({"num": i, "squared": i**2, "cubed": i**3, "sqrt": round(math.sqrt(i), 2)})
with open("number_values2.csv") as f:
    print(f.read())
    

num,squared,cubed,sqrt

1,1,1,1.0

2,4,8,1.41

3,9,27,1.73

4,16,64,2.0

5,25,125,2.24

6,36,216,2.45

7,49,343,2.65

8,64,512,2.83

9,81,729,3.0

10,100,1000,3.16

11,121,1331,3.32

12,144,1728,3.46

13,169,2197,3.61

14,196,2744,3.74

15,225,3375,3.87

16,256,4096,4.0

17,289,4913,4.12

18,324,5832,4.24

19,361,6859,4.36

20,400,8000,4.47

21,441,9261,4.58

22,484,10648,4.69

23,529,12167,4.8

24,576,13824,4.9

25,625,15625,5.0

26,676,17576,5.1

27,729,19683,5.2

28,784,21952,5.29

29,841,24389,5.39

30,900,27000,5.48

31,961,29791,5.57

32,1024,32768,5.66

33,1089,35937,5.74

34,1156,39304,5.83

35,1225,42875,5.92

36,1296,46656,6.0

37,1369,50653,6.08

38,1444,54872,6.16

39,1521,59319,6.24

40,1600,64000,6.32

41,1681,68921,6.4

42,1764,74088,6.48

43,1849,79507,6.56

44,1936,85184,6.63

45,2025,91125,6.71

46,2116,97336,6.78

47,2209,103823,6.86

48,2304,110592,6.93

49,2401,117649,7.0

50,2500,125000,7.07

51,2601,132651,7.14

52,2704,140608,7.21

53,2809,148877,7.28

54,2916,1

As you see, they produce the same output. It's up to you which one you want to do, depending on the type of CSV file you're trying to read/write from.

## Exercises 

### Number 1:

Download this CSV file: http://samplecsvs.s3.amazonaws.com/Sacramentorealestatetransactions.csv

Found from here: https://support.spatialkey.com/spatialkey-sample-csv-data/

"The Sacramento real estate transactions file is a list of 985 real estate transactions in the Sacramento area reported over a five-day period, as reported by the Sacramento Bee. Note that this file has address level information that you can choose to geocode, or you can use the existing latitude/longitude in the file."

Finish these functions - the parameter will be a string which is the file name.

In [None]:
# What is the average price of house sold?
def zipcode_with_most_sales(csv_file):
    pass # write your code here

In [21]:
# Which zipcode has the most sales?
def zipcode_with_most_sales(csv_file):
    pass # write your code here

In [22]:
# Which zipcode has the most expensive house sold?
def zipcode_of_most_expensive_house(csv_file):
    pass # write your code here

In [23]:
# Which house had the best ratio of square feet to price
# In other words: which house was the cheapest per square foot?
# Return the address of the houes
def zipcode_of_most_expensive_house(csv_file):
    pass # write your code here

### Number 2:

Download an interesting CSV file online, and write code to find an interesting fact about it!

Here is an example of somewhere you can get an interesting CSV file: https://catalog.data.gov/dataset?res_format=CSV