# Files

So far we've focused on learning how to do things totally within python.  In the real world, though, we need to interact with data outside of our programs.  One of the most direct way to interact with outside data is through files.  But files are trickly, both in how they are represented and how python accesses them.

## Working example -- working with open text on Project Gutenberg

There are a lot of books available for free, in the public domain.  We can use this text to explore working with files in python. For instance, Fitzgerald's Gatsby is available on Project Gutenberg AU

http://gutenberg.net.au/ebooks02/0200041.txt

In [None]:
fileHandle = open('gatsby.txt')

In [None]:
pwd

In [None]:
fileHandle = open('files/gatsby.txt')

In [None]:

numberOfLines = 0
for line in fileHandle :
    numberOfLines += 1
    
print(numberOfLines)

This is an aweful error message, but let's look closely... it's causing an error in a "codec" called utf-8.  This is geek speak for specification on how the binary form of the file is encoded... meaning the format of the 1s and 0s of the file.

We can find the encoding, on a Mac by calling: 

file -I {filename} in the terminal.

or by doing it directly in the notebook:

In [None]:
!file -I files/gatsby.txt

WTF is [ISO-8859-1](https://en.wikipedia.org/wiki/ISO/IEC_8859-1) and why do we care?

Not everything was... or is unicode.  While it will get better, it will never go away.

In [None]:
help(open)

In [None]:
fileHandle = open('files/gatsby.txt', encoding='iso-8859-1')

In [None]:
numberOfLines = 0
for line in fileHandle :
    numberOfLines += 1
    
print(numberOfLines)

![VHS](https://media4.giphy.com/media/pWsz9pgd1X1Re/giphy.gif?cid=ecf05e47m6aq07s21e9wqy9o5aqvmzru61pj94cctpnneypx&rid=giphy.gif)

The fileHandle is a like a VHS tape... the iterator has to be rewound:

In [None]:
fileHandle.seek(0)

In [None]:
numberOfChapters = 0
for line in fileHandle :
    line = line.lstrip()
    if line.startswith('Chapter') :
        numberOfChapters += 1

print(numberOfChapters)

Wait, wait, wait... what's a "line" and all these line methods?

In [None]:
help(str)

In [None]:
someString = "                        this is a string with a lot of whitespace"
print(someString)
print(someString.lstrip())

In [None]:
countJay = 0
countDaisy = 0

fileHandle.seek(0)
for line in fileHandle :
    if line.find('Jay') != -1 :
        countJay += 1
    elif line.find('Daisy') != -1 :
        countDaisy += 1

print(countJay)
print(countDaisy)

When done using a file, be sure to close it.  You do that by calling close of the file handle

In [None]:
fileHandle.close()

## Writing files

In [None]:
help(open)

In [None]:
fileHandle = open('files/lectureFile.txt', mode='w')

In [None]:
fileHandle.write("This is a test file write")

In [None]:
fileHandle.seek(0)
for line in fileHandle :
    print(line)

File was not open for "reading"!!  Let's close it and reopen to read.

In [None]:
fileHandle.close()

fileHandle = open('files/lectureFile.txt', mode='r')
fileHandle.seek(0)
for line in fileHandle :
    print(line)

In [None]:
fileHandle = open('files/lectureFile.txt', mode='w')
fileHandle.write("Careful with your modes!! the W is DESTRUCTIVE!!\n")
fileHandle.close()

fileHandle = open('files/lectureFile.txt', mode='r')
fileHandle.seek(0)
for line in fileHandle :
    print(line)

In [None]:
fileHandle = open('files/lectureFile.txt', mode='a')
fileHandle.write("Using mode 'a' is an append, it will add to the end of the file\n")
fileHandle.close()

fileHandle = open('files/lectureFile.txt', mode='r')
fileHandle.seek(0)
for line in fileHandle :
    print(line)

## Structured text files, such as CSV files

Data that we put into files can also be structured, for instance in table form.

For instance, Census data is available online:

http://census.ire.org/data/bulkdata.html

In [None]:
import csv

csvHandle = open('files/all_050_in_42.P17.csv')
csvReader = csv.reader(csvHandle)
row0 = next(csvReader)
print(row0)

In [None]:
highest = -float('inf')
lowest = float('inf')

largestFamiliesCounty = ''
smallestFamiliesCounty = ''

for row in csvReader :
    if float(row[13]) > highest :
        highest = float(row[13])
        largestFamiliesCounty = row[8] + ' ' + row[13]
    
    if float(row[13]) < lowest :
        lowest = float(row[13])
        smallestFamiliesCounty = row[8] + ' ' + row[13]
        

print(largestFamiliesCounty)
print(smallestFamiliesCounty)      


In [None]:
csvHandle.close()

In [None]:
csvHandle = open('files/all_050_in_42.P17.csv')
reader = csv.reader(csvHandle)
d = {} ## create the new dictionary
next(reader) ## skip the header
for row in reader:
   k = row[8]
   v = row[13]
   d[k] = v

In [None]:
max(d.items())

In [None]:
import operator
max(d.items(), key=operator.itemgetter(1))

In [None]:
min(d.items(), key=operator.itemgetter(1))