## Reading and Writing Files

It is pretty rare that you type all the data you want into Python. Most of the time, you will be getting data from different sources. Likewise, you will need to be able to save your data so you can use it in other programs (Excel, SPSS, R, etc.). This section will teach you the basics of reading and writing files so you can move data between Python and these other sources. 

### Common File Formats
There are various formats that people use for saving data. The easiest thing to deal with is plain text, because it can be used everywhere. BUT, there are different conventions people use for saving data in plain text. 

* The most common is probably **Comma Separated Values (.csv)** files. These are for saving tabular data. The rule is very simple: each row of the data is on a separate line, and each column of data is separated by a comma character ","

* Another common format is a **tab delimited** file. These usually have the .txt or .dat file extension. They are like csv files, except each column is separated by a tab character (like when you hit tab on the keyboard). In plain text files, tab is represented as `\t`. When you open a file in Notepad, Word, etc. then it hides these characters from you, but they are still there!

* A popular format for passing data over the internet is **Javascript Object Notation (JSON)**. These are still plain text, but the information is stored a lot like a Python dictionary. There are named fields with matching values. They can also store lists. The benefit of this format is that it can handle data that is not a simple spreadsheet. 
    * Think about the David Bowie dataset-- you have a single album with a single title and year, then a track list that contains several titles. This can easily be represented in a JSON file. 
    
* Another format that's popular on the web is **XML** or **HTML**. Again, this is plain text, but information can be represented hierarchically. The basic rule is that different types of information are surrounded by "tags". Each tag is inside of brackets `<tag>` and has a matching closing tag with a slash `</tag>`. For instance, I could organize the music information like this: 

```xml
<Artist>
    <name>David Bowie</name>
    <album>Heroes
        <year>1977</year>
        <tracks>
            <song>Heroes</song>
            <song>Beauty and the Beast</song>
            ...
        </tracks>
    </album>
<Artist>    
```

### Taking a Look

* First let's look at some data. Navigate to the folder **R:\Class_Data\datasets\sample_data**, and double-click on the "sample_data.csv" file. This will open it in Excel. That's because Excel knows how to read csv files. Notice that it looks like a normal excel sheet. The data are from a set of baby names that we will use later. There are 3 columns of data: a name, a gender (m/f) and the number of occurrences of that name in the year 2012. 

    * Now open the file using the program Notepad++ (open the program first, then drag the file into it). 

    * Notice how in Excel it displays things in different columns, but in Notepad++ you just see commas between values. 

* Now do the same thing with sample_data.txt. How is this different? 


### Loading files into Python

There are 2 basic steps to loading or saving files. For the first step, you open the file (or create it, if it doesn't exist yet) and you create a **file handle**. This is just a variable, but it tells Python how to find the file on your hard drive. 

Then you use other functions to read or write information to/from the file. This varies based on the file type and what you want to do with it. 

First see how we open a file below, using the `open` function. We call our file handle `f`. 

The second argument to `open` is the *mode*. This tells Python what we want to do with the file. 

* 'r' means we will only read from the file
* 'w' means we will read *and* write to the file
    * if the file exists already, it is erased!
* 'a' (think "append") opens it for reading and writing, BUT any writing is appended to the end of the file if it already exists. 

* you may notice code where people specify 'w+' or 'a+'. These are subtle differences that you probably don't need to know about. You can get more detail [here](http://stackoverflow.com/questions/1466000/python-open-built-in-function-difference-between-modes-a-a-w-w-and-r) though. 


In [None]:
f = open('../datasets/sample_data/sample_data.csv','r') #open the file for reading

#do some stuff with the file

f.close() #close the file when you're done

Kind of uneventful, right? Let's do something else. Let's use the `readlines` function to read in all lines of the file, which are stored in a list. Take a look at `content` as well as the first 10 elements of it. How is the information stored? 

In [None]:
f = open('../datasets/sample_data/sample_data.csv','r') #open the file for reading

content = f.readlines()
content = content[:10] #let's just grab the first 10 lines


f.close() #close the file when you're done

See if you can loop through `content` and remove the newline characters (`\r\n`) from each element using `strip`. Save the result back into the `content` variable. 

Next take each line of `content` and split it into separate columns based on the delimiter, which is comma ",". Just print the result. Notice how many elements are in each list now. 

### The `with` function

Notice in the example above, we have to call f.close() when we're done with the file. If we don't do this, it takes up memory and can cause other problems in your script. You only want the files open as long as you need them. If you do a complicated operation with your file, it is easy to forget to do this. 

That's why they created the `with`...`as` function. It works a lot like a loop, but it just specifies what you do to a file. When it's done, it closes the file for you, so you don't have to do f.close().

See the example below. Notice how it has the colon an lines are indented, just like in an `if` statment or a `for` loop. In English, the statement reads like this: 

"With this file that I'm opening, call it `f`, do stuff to it (readlines), then close it when you're done."



In [None]:
with open('../data/sample_data/sample_data.csv','r') as f:
    content = f.readlines()
    
print content[:10]


This is just a cleaner and more concise way to open your files, and you don't have to remember to close the file when you're done. 

### The csv package

Since csv files are so common, Python has a `csv` package for dealing with these files. In this example, we read in a file row-by-row (instead of all-at-once with `readlines`). 

In [4]:
import csv

with open('./datasets/sample_data/sample_data.csv','r') as f:
    csvreader = csv.reader(f,delimiter=',')
    
    for row in csvreader:
        print row

['Sophia', 'F', '22267']
['Emma', 'F', '20902']
['Isabella', 'F', '19058']
['Olivia', 'F', '17277']
['Ava', 'F', '15512']
['Emily', 'F', '13619']
['Abigail', 'F', '12662']
['Mia', 'F', '11998']
['Madison', 'F', '11374']
['Elizabeth', 'F', '9674']
['Chloe', 'F', '9641']
['Ella', 'F', '9177']
['Avery', 'F', '8309']
['Addison', 'F', '8165']
['Aubrey', 'F', '8037']
['Lily', 'F', '7945']
['Natalie', 'F', '7885']
['Sofia', 'F', '7820']
['Charlotte', 'F', '7468']
['Zoey', 'F', '7452']
['Grace', 'F', '7359']
['Hannah', 'F', '7261']
['Amelia', 'F', '7235']
['Harper', 'F', '7179']
['Lillian', 'F', '7143']
['Samantha', 'F', '6913']
['Evelyn', 'F', '6869']
['Victoria', 'F', '6849']
['Brooklyn', 'F', '6769']
['Zoe', 'F', '6435']
['Layla', 'F', '6251']
['Hailey', 'F', '5901']
['Leah', 'F', '5758']
['Kaylee', 'F', '5602']
['Anna', 'F', '5600']
['Aaliyah', 'F', '5495']
['Gabriella', 'F', '5487']
['Allison', 'F', '5415']
['Nevaeh', 'F', '5356']
['Alexis', 'F', '5337']
['Audrey', 'F', '5291']
['Savannah

Notice that we first make a variable `csvreader` using `csv.reader`. We can then loop through each row using this variable. See how it already splits up our data into a list with 3 elements? This saves us from doing `strip` and `split`. 

Now try the same thing with our tab-delimited file `sample_data.txt`. Even though this isn't a .csv file, it's OK. The package is made for reading any delimited file. The only difference is that the delimiter is a tab (\t) instead of a comma (,). 

Notice that we add a "U" to the mode when we call `open`. This enables "universal newline mode". This has to do with the way that Windows and Mac sometimes use different characters to specify a new line, which can cause problems in particular scenarios. I don't expect you to remember this. You'll know if you need it because you'll get an error that says: 

"Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?"

So, if you use "r" and you get that error, try "rU". 

In [7]:
with open('./datasets/sample_data/sample_data.txt','rU') as f:
    csvreader = csv.reader(f,delimiter=',')
    for row in csvreader:
        print row

['Sophia\tF\t22267']
['Emma\tF\t20902']
['Isabella\tF\t19058']
['Olivia\tF\t17277']
['Ava\tF\t15512']
['Emily\tF\t13619']
['Abigail\tF\t12662']
['Mia\tF\t11998']
['Madison\tF\t11374']
['Elizabeth\tF\t9674']
['Chloe\tF\t9641']
['Ella\tF\t9177']
['Avery\tF\t8309']
['Addison\tF\t8165']
['Aubrey\tF\t8037']
['Lily\tF\t7945']
['Natalie\tF\t7885']
['Sofia\tF\t7820']
['Charlotte\tF\t7468']
['Zoey\tF\t7452']
['Grace\tF\t7359']
['Hannah\tF\t7261']
['Amelia\tF\t7235']
['Harper\tF\t7179']
['Lillian\tF\t7143']
['Samantha\tF\t6913']
['Evelyn\tF\t6869']
['Victoria\tF\t6849']
['Brooklyn\tF\t6769']
['Zoe\tF\t6435']
['Layla\tF\t6251']
['Hailey\tF\t5901']
['Leah\tF\t5758']
['Kaylee\tF\t5602']
['Anna\tF\t5600']
['Aaliyah\tF\t5495']
['Gabriella\tF\t5487']
['Allison\tF\t5415']
['Nevaeh\tF\t5356']
['Alexis\tF\t5337']
['Audrey\tF\t5291']
['Savannah\tF\t5176']
['Sarah\tF\t5167']
['Alyssa\tF\t5078']
['Claire\tF\t4941']
['Taylor\tF\t4852']
['Riley\tF\t4816']
['Camila\tF\t4789']
['Arianna\tF\t4708']
['Ashley\tF\t

Notice it doesn't look the way we want. it has all these `\t` characters. This is because we still told it that columns are delimited by a comma. Change the "delimiter" argument to "\t" and see if that works. 

### Writing csv files

Now let's write our data to a file. First, read in the file sample_text.csv, but this time only read in the first 25 lines. You will have to modify the `for` loop to loop 25 times, and at each loop call `csvreader.next()` and append it to a list called `subdata`. The `next` method just reads the next line in the file. When it's done, print the contents of `subdata`

In [None]:
subdata = []

with open('../datasets/sample_data/sample_data.csv','r') as f:
    #fill in the rest

Now copy and paste your code from above and modify it slightly to save the contents of `subdata` to your own csv file. You will save the data into a csv file in **R:\Student_Data\your_duckid\test_data.csv**. 

You will make 3 changes. 
* First, change the mode for `open` to `w`
* Instead of csvreader, you will create a variable `csvwriter` using the `csv.writer` function
* Loop through each row of `subdata`, and use the function `csvwriter.writerow(rowdata)`

Open the file in Excel to confirm that it worked. 

### JSON Files

Now let's load the same data that's stored in a JSON file. First, open the file in Notepad++ to see how it's stored. Notice it looks a lot like how you would type it into Python! Now we will use the `json` package to load from that file into a Python variable. What is the datatype of `namedata`? 

In [None]:
import json

with open('../datasets/sample_data/sample_data.json','r') as f:
    namedata = json.load(f)

JSON files are very useful for saving Python dictionary and list data. Let's use our contact list example from a previous lecture. Notice that `contacts` is a list, and each element in the list is a dictionary. 

In [None]:
person1 = {'Name' : 'John Q. Taxpayer', 
         'Phone' : '541-555-1234',
         'Email' : 'johnq@yahoo.com'}

person2 = {'Name' : 'Barack Obama', 
        'Email' : 'president@whitehouse.gov',
        'Phone' : '555-123-4567'}

person3 = {'Name': 'Jason Hubbard',
          'Email': 'hubbard3@uoregon.edu',
          'Phone': '555-123-7712'}

contacts = [person1,person2,person3]


Now let's save it into a JSON file under **R:\Student_Data\your_duckid\test_data.json** using `json.dump`. Now open the file in Notepad++ and see what it looks like. 

In [None]:
with open('R:\Student_Data\MYDUCKID\test_data.json','w') as f:
     json.dump(contacts,f)

## Binary Files

Plain text is great because you can move your data to other programs easily. But sometimes you want to save your data for Python only. Plain text files are also not efficient for storing large amounts of information. So if you have lots of data that you know you only need in python, you can save it using the **pickle** package (yes, pickle). Files have the `.pkl` extension. Here we load in a .pkl file that has the same info as our contact list we created above. We save it as a new variable name, loaded_contacts

In [None]:
import pickle
with open('../datasets/sample_data/sample_data.pkl','r') as f:
    loaded_contacts = pickle.load(f)

We save the data in a similar way, just using `pickle.dump`. Here we just save a variable called `x` that's a list. 

In [None]:
x = [1,2,3]

with open('R:\Student_Data\MYDUCKID\saved_x.pkl','w') as f:
     pickle.dump(x,f)




In [None]:
#now load the saved data, saving the result into a variable y
with open('R:\Student_Data\MYDUCKID\saved_x.pkl','r') as f:
     y = pickle.load(f)


Now try to open your .pkl file in Notepad++. What does it look like? 