## Reading and Writing Files 

Let's practice reading and writing files with Python. Let's use a csv file that I downloaded from here: <https://www.ssa.gov/oact/babynames/limits.html>. It's a database of (almost) all the baby names in the US since the year 1880. First, we will load a file from 2012 and examine the information in it. 

The full file path is: R:\Psy407_9\Class_Data\datasets\baby_names_2012.txt

Read the file into Python using `with` and the `readlines` method. Save all lines into a variable called `lines`. What is the data type of `lines`? Print the first 10 lines. 

In [1]:

#you would have to do the full path R:\Psy407_9\Class_Data\datasets\baby_names_2012.txt
with open('../../datasets/baby_names_2012.txt') as f:
    lines = f.readlines()
    
    
print lines[:10]

['name    count_2012\r\n', 'Aabha   13\r\n', 'Aabriella   5\r\n', 'Aaden   5\r\n', 'Aadhira 6\r\n', 'Aadhya  218\r\n', 'Aadi    10\r\n', 'Aadison 11\r\n', 'Aaditri 10\r\n', 'Aadya   292\r\n']


Ugh, messy, right? Let's try to clean it up. Remember when we cleaned up the data from the story? Use the `strip` method to clean up all elements in `lines`. Save the result in `cleanlines`. 

In [2]:
cleanlines = []

for l in lines:
    cleanlines.append(l.strip())

print cleanlines[:10]

['name    count_2012', 'Aabha   13', 'Aabriella   5', 'Aaden   5', 'Aadhira 6', 'Aadhya  218', 'Aadi    10', 'Aadison 11', 'Aaditri 10', 'Aadya   292']


Notice that each element consists of 2 bits of text, separated by some spaces. The first bit of text is a name, and the second bit is a number. The number corresponds to the frequency that the name occurred in the year 2012. 

Let's figure out how this file is configured. We can tell it's not a csv file because things aren't separated by commas. Maybe there are a certain number of spaces between the name and the frequency? We can use the `count` method to count the number of space characters in each element. Give it a try. Loop through `cleanlines` and count the number of spaces occur. Save the result into `spaces`. Do we have the same number of spaces between the pieces of information? (check the first 20 elements)

In [3]:
spaces = []

for l in cleanlines:
    spaces.append(l.count(' ')) #notice the space!
    
print spaces[:20]


[4, 3, 3, 3, 1, 2, 4, 1, 1, 3, 3, 1, 3, 2, 4, 3, 3, 4, 1, 3]


Nope! We have a different number of spaces between them. Why don't you open the file with Notepad++ and take a look at it. Does it look like a delimited file? 

In [4]:
#No, it's just a name, some spaces (which are variable), and a number on each line

It turns out this is a weird file that doesn't match a predefined format. Try opening it in Excel. What happens?

We can work with it in Python, but let's save it as a standard format so we can use it in other programs. Let's process the data and save it as a .csv file.

Even though each line has varying number of spaces, I bet that the `split` function can split the names from the frequencies. Give it a try!

In [5]:
#Excel doesn't format it well either. 

splitlines = []

for l in cleanlines:
    splitlines.append(l.split()) #with no arguments, it defaults to spaces

print splitlines[:10]



[['name', 'count_2012'], ['Aabha', '13'], ['Aabriella', '5'], ['Aaden', '5'], ['Aadhira', '6'], ['Aadhya', '218'], ['Aadi', '10'], ['Aadison', '11'], ['Aaditri', '10'], ['Aadya', '292']]


Now use the `csv` package and `csv.writer` to write the file to your folder. Save it here: 

R:\Psy407_9\Student_Data\MYDUCKID\baby_names_2012.csv

After it's done, open it with Excel and see if it worked the way you want. 

In [6]:
import csv

with open('baby_names_2012.csv','wb') as f:
    writer = csv.writer(f,delimiter=',')
    
    writer.writerow(splitlines[0]) #write the header!
    
    for row in splitlines[1:]: #skip the first line (the header)
        writer.writerow(row)

Let's organize our information a different way, using a Python dictionary. Here we will make a dictionary where each key is a name, and its corresponding value is the frequency. Call the new dictionary `dnames`. You'll notice a problem, though. Some of the names don't have matching frequencies (I'm not sure why this is-- this is what you get with real data!). Give these ones the value of `None`, which is a special Python data type that means "null" or "missing". Also make sure to: 

* Captialize the names
* Save the frequency as an integer, not a string!

Careful, if you just to `print dnames` to check your work, it will be very slow! It's a huge list! Try printing just the first few keys, and the first few values

In [7]:

dnames = {}

for row in splitlines[1:]: #skipping the header!
    name = row[0].upper()
    
    if len(row)<2:
        freq = None
    else:
        freq = int(row[1])
       
    dnames[name] = freq
    
    
keys = dnames.keys()
values = dnames.values()

print keys[:10]
print values[:10]



['KARMELA', 'KOLLEEN', 'LEONARA', 'SANGITA', 'EMMAROSE', 'JOHNNY', 'DYANDRA', 'THALASSA', 'DANELLA', 'GENECIS']
[108, 2621, 58, 187, 353, 13668, 41, 12, 2101, 170]


Now let's save the data for later. Remember, it's a dictionary, so a JSON file will work nicely. Use the `json.dump` function to save it to the file: 

R:\Psy407_9\Student_Data\MYDUCKID\baby_names_2012.json

Then look at the file in Notepad++ to see what it looks like. 



In [8]:
import json
#obviously the file path will correspond to your own folder
with open('baby_names_2012.json','w') as f:
    json.dump(dnames,f)

Now try reading the same json file using `json.load`, and saving it into a variable `dnames2`. Using `==`, see if dnames and dnames2 are the same

In [9]:
with open('baby_names_2012.json','r') as f:
    dnames2 = json.load(f)
    
    
dnames==dnames2

True

Now let's come full circle and take our dictionary that we loaded from the JSON file, and save it as a csv file. We can do this using part of the csv package, using csv.writer and writerow(). Use the `items` method to loop through the dictionary to write each row.

Save the new file as:

R:\Psy407_9\Student_Data\MYDUCKID\baby_names_2012_capital.csv

In [10]:
with open('baby_names_2012_capital.csv', 'wb') as f:
    writer = csv.writer(f,delimiter=',')
    
    for key,value in dnames2.items():
        writer.writerow([key,value])
