# Files

You met these in COMP103.

While a program is running, its data is stored in random access memory (RAM). RAM is fast and inexpensive, but it is also volatile, which means that when the program ends, or the computer shuts down, data in RAM disappears. To make data available the next time the computer is turned on and the program is started, it has to be written to a non-volatile storage medium, such a hard drive, usb drive, or CD-RW.

Data on non-volatile storage media is stored in named locations on the media called files. By reading and writing files, programs can save information between program runs.

Working with files is a lot like working with a notebook. To use a notebook, it has to be opened. When done, it has to be closed. While the notebook is open, it can either be read from or written to. In either case, the notebook holder knows where they are. They can read the whole notebook in its natural order or they can skip around.

All of this applies to files as well. To open a file, we specify its name and indicate whether we want to read or write.


###  Writing our first file

Let’s begin with a simple program that writes three lines of text into a file (the file will be created in the same directory you were in when you started Python from the command line) :

In [None]:
myfile = open("test.txt", "w")
myfile.write("My first file written from Python\n") # python will convert \n to os.linesep
myfile.write("---------------------------------\n")
myfile.write("Hello, world!\n")
myfile.close()

In [None]:
%cat test.txt 


Opening a file creates what we call a file handle. In this example, the variable myfile refers to the new handle object. Our program calls methods on the handle, and this makes changes to the actual file which is usually located on our disk.

On line 1, the open function takes two arguments. The first is the name of the file, and the second is the mode. Mode "w" means that we are opening the file for writing.

With mode "w", if there is no file named test.txt on the disk, it will be created. If there already is one, it will be replaced by the file we are writing.

To put data in the file we invoke the write method on the handle, shown in lines 2, 3 and 4 above. In bigger programs, lines 2–4 will usually be replaced by a loop that writes many more lines into the file.

Closing the file handle (line 5) tells the system that we are done writing and makes the disk file available for reading by other programs (or by our own program).

### Reading a file line-at-a-time

Now that the file exists on our disk, we can open it, this time for reading, and read all the lines in the file, one at a time. This time, the mode argument is "r" for reading:

### The longish way to do it:

In [None]:
fin = open('test.txt', "r")
for line in fin:
    print (line, end="") # suppress end of line 
fin.close()

### The nicer way to do it:
Use the 'with' key word. It handles closing the file after use.

In [None]:
with open('test.txt', "r") as myfile:
    for line in myfile:
        print (line, end="")

This is a handy pattern for our toolbox. 

In bigger programs, we’d squeeze more extensive logic into the body of the loop.

Note that the ```print``` command suppresses the newline character that print usually appends to our strings. Why? This is because the string already has its own newline. 

### Turning a file into a list of lines

It is often useful to fetch data from a disk file and turn it into a list of lines. Suppose we have a file containing our friends and their email addresses, one per line in the file. But we’d like the lines sorted into alphabetical order. A good plan is to read everything into a list of lines, then sort the list, and then write the sorted list back to another file:

In [None]:
# first lets create a file to sort, note that if it already exists we overwrite it!
with open("friends.txt", "w") as myfile:
    myfile.write("Harley\n")
    myfile.write("Olivia\n")
    myfile.write("Charlotte\n")
    myfile.write("Emily\n")
    myfile.write("Isla\n")
    myfile.write("Vincent\n")
    myfile.write("Phoenix\n")
    myfile.write("Cohen\n")

In [None]:
%cat friends.txt

In [11]:
# now lets sort it
xs = ''
with open("friends.txt", "r") as f:
    xs = f.readlines()

print (xs)
xs.sort()

with open("sortedfriends.txt", "w") as g:
    for v in xs:
        g.write(v)

['Harley\n', 'Olivia\n', 'Charlotte\n', 'Emily\n', 'Isla\n', 'Vincent\n', 'Phoenix\n', 'Cohen\n']


In [12]:
%cat sortedfriends.txt

Charlotte
Cohen
Emily
Harley
Isla
Olivia
Phoenix
Vincent


*Useful tip!*

The readlines method in line 2 reads all the lines and returns a list of the strings.

We could have used the template from the previous section to read each line one-at-a-time, and to build up the list ourselves, but it is a lot easier to use the method that the Python implementors gave us!

### Reading the whole file at once

Another way of working with text files is to read the complete contents of the file into a string, and then to use our string-processing skills to work with the contents.

We’d normally use this method of processing files if we were not interested in the line structure of the file. 

We read the whole file into a string and the use the split method to covert it into a list of words.

In [14]:
# this loads the entire file into a string
with open("friends.txt") as f:
    content = f.read()

print(content)

words = content.split()
print("There are {0} words in the file.".format(len(words)))
for word in words:
    print(word)

Harley
Olivia
Charlotte
Emily
Isla
Vincent
Phoenix
Cohen

There are 8 words in the file.
Harley
Olivia
Charlotte
Emily
Isla
Vincent
Phoenix
Cohen


Notice here that we left out the "r" mode in line 1. By default, if we don’t supply the mode, Python opens the file for reading.

Your file paths may need to be explicitly named.

In the above example, we’re assuming that the file somefile.txt is in the same directory as your Python source code. If this is not the case, you may need to provide a full or a relative path to the file. On Windows, a full path could look like "C:\\temp\\somefile.txt", while on a Unix system the full path could be "/home/jimmy/somefile.txt".

### Directories

Files on non-volatile storage media are organized by a set of rules known as a file system. File systems are made up of files and directories, which are containers for both files and other directories.

When we create a new file by opening it and writing, the new file goes in the current directory (wherever we were when we ran the program). Similarly, when we open a file for reading, Python looks for it in the current directory.

If we want to open a file somewhere else, we have to specify the path to the file, which is the name of the directory (or folder) where the file is located:

In [15]:
with open("/home/ryan/Dropbox/University/Lecturing/NWEN241/notebooks/friends.txt", "r") as wordsfile:
    wordlist = wordsfile.readlines()
    print(type(wordlist))
    print(wordlist[:5])

<class 'list'>
['Harley\n', 'Olivia\n', 'Charlotte\n', 'Emily\n', 'Isla\n']


This (Unix) example opens a file named friends that resides in a directory named notebooks, which resides somewhere in my filesystem. It then reads in each line into a list using readlines, and prints out the first 5 elements from that list.

A Windows path might be "c:/temp/words.txt" or "c:\\temp\\words.txt". Because backslashes are used to escape things like newlines and tabs, we need to write two backslashes in a literal string to get one! So the length of these two strings is the same!

We cannot use / or \ as part of a filename; they are reserved as a delimiter between directory and filenames.

The file /home/ryan/Dropbox/University/Lecturing/NWEN241/notebooks/friends.txt should exist and contains a list of words.

### OS Module

Provides functionality for determining current directory and navigating between directories in an operating system independnt manner.

In [16]:
import os
cwd = os.getcwd()
print (cwd)

/home/ryan/Dropbox/University/Lecturing/NWEN241/notebooks


Useful functions:

```abspath``` returns the absolute path to a given file.

```os.path.exists``` checks if the file exists.

In [17]:
os.path.abspath("test.txt")

'/home/ryan/Dropbox/University/Lecturing/NWEN241/notebooks/test.txt'

In [18]:
os.path.exists("test.txt")

True

In [19]:
os.path.exists("bad_file")

False

### Handling Exceptions

What if a file doesn't exist?

We can catch errors if we wrap it in a try and except clause. 

In this example we don't care exactly what the error is ...

In [20]:
try:
    fin = open('test.txt', "r")
    for line in fin:
        print (line)
    line.this_is_a_bad_methid()
    fin.close()
except:
    print("Something went wrong.")

My first file written from Python

---------------------------------

Hello, world!

Something went wrong.


In [21]:
try:
    fin = open('test.txt', "r")
    for line in fin:
        print (line)
    line.this_is_a_bad_methid()
    fin.close()
except Exception as e: 
    print("Something went wrong.")
    print (e)

My first file written from Python

---------------------------------

Hello, world!

Something went wrong.
'str' object has no attribute 'this_is_a_bad_methid'


If we try to open a file that doesn’t exist, we get an error:

### Structured Text Files

With simple text files, the only level of organization is the line. Sometimes, you want more structure than that. You might want to save data for our program to use later, or send data to another program.

There are many formats, and here’s how you can distinguish them:

* A separator, or delimiter, character like tab ('\t'), comma (','), or vertical bar ('|'). This is an example of the comma-separated values (CSV) format. '<' and '>' around tags. Examples include XML and HTML.
* Punctuation. An example is JavaScript Object Notation (JSON).
* Indentation. An example is YAML (which depending on the source you use means “YAML Ain’t Markup Language;” you’ll need to research that one yourself).
* Miscellaneous, such as configuration files for programs. 

Each of these structured file formats can be read and written by at least one Python module.

Let's look at ONE example : CSV

#### CSV

Delimited files are often used as an exchange format for spreadsheets and databases. You could read CSV files manually, a line at a time, splitting each line into fields at comma separators, and adding the results to data structures such as lists and dictionaries. But it’s better to use the standard csv module, because parsing these files can get more complicated than you think.

* Some have alternate delimiters besides a comma: '|' and '\t' (tab) are common.
* Some have escape sequences. If the delimiter character can occur within a field, the entire field might be surrounded by quote characters or preceded by some escape character.
* Files have different line-ending characters. Unix uses '\n', Microsoft uses '\r\n', and Apple used to use '\r' but now uses '\n'.
* There can be column names in the first line. 

First, we’ll see how to read and write a list of rows, each containing a list of columns:

In [22]:
import csv
villains = [['Doctor', 'No'],['Rosa', 'Klebb'], ['Mister', 'Big'], ['Auric', 'Goldfinger'], ['Ernst', 'Blofeld'] ]
print(villains)
with open('villains.csv', 'w') as fout:  # a context manager
    csvout = csv.writer(fout)
    csvout.writerows(villains)
%cat villains.csv

[['Doctor', 'No'], ['Rosa', 'Klebb'], ['Mister', 'Big'], ['Auric', 'Goldfinger'], ['Ernst', 'Blofeld']]
Doctor,No
Rosa,Klebb
Mister,Big
Auric,Goldfinger
Ernst,Blofeld


Now, we’ll try to read it back in:

In [23]:
import csv
villains = []
with open('villains.csv', 'r') as fin:  # context manager
    cin = csv.reader(fin)
    for row in cin:
        print(row)
        villains.extend([row])
print(villains)

['Doctor', 'No']
['Rosa', 'Klebb']
['Mister', 'Big']
['Auric', 'Goldfinger']
['Ernst', 'Blofeld']
[['Doctor', 'No'], ['Rosa', 'Klebb'], ['Mister', 'Big'], ['Auric', 'Goldfinger'], ['Ernst', 'Blofeld']]


The data can be a list of dictionaries rather than a list of lists. Let’s read the villains file again, this time using the new ```DictReader()``` function and specifying the column names:

In [24]:
import csv
villains = []
with open('villains.csv', 'r') as fin:
    cin = csv.DictReader(fin, fieldnames=['first', 'last'])
    for row in cin:
        print(row)
        villains.extend([row])
print(villains)

{'first': 'Doctor', 'last': 'No'}
{'first': 'Rosa', 'last': 'Klebb'}
{'first': 'Mister', 'last': 'Big'}
{'first': 'Auric', 'last': 'Goldfinger'}
{'first': 'Ernst', 'last': 'Blofeld'}
[{'first': 'Doctor', 'last': 'No'}, {'first': 'Rosa', 'last': 'Klebb'}, {'first': 'Mister', 'last': 'Big'}, {'first': 'Auric', 'last': 'Goldfinger'}, {'first': 'Ernst', 'last': 'Blofeld'}]


In [25]:
villains[4]['last']

'Blofeld'

Let’s rewrite the CSV file by using the new ```DictWriter()``` function. We’ll also call writeheader() to write an initial line of column names to the CSV file:

In [26]:
import csv
villains = [{'first': 'Doctor', 'last': 'No'}, {'first': 'Rosa', 'last': 'Klebb'},{'first': 'Mister', 'last': 'Big'},{'first': 'Auric', 'last': 'Goldfinger'},{'first': 'Ernst', 'last': 'Blofeld'},]
with open('villains.txt', 'w') as fout:
    cout = csv.DictWriter(fout, ['first', 'last'])
    cout.writeheader()
    cout.writerows(villains)
%more villains.txt

That creates a villains file with a header line:

```first,last
Doctor,No
Rosa,Klebb
Mister,Big
Auric,Goldfinger
Ernst,Blofeld```

### What about fetching something from the web?

The Python libraries are pretty messy in places. But here is a very simple example that copies the contents at some web URL to a local file.

In [27]:
import urllib.request

url = "https://ecs.victoria.ac.nz/foswiki/pub/Courses/NWEN241_2015T1/LectureSchedule/rfc2616.txt"
destination_filename = "rfc2616.txt"

urllib.request.urlretrieve(url, destination_filename)
%more rfc2616.txt

The urlretrieve function — just one call — could be used to download any kind of content from the Internet.

We’ll need to get a few things right before this works:

* The resource we’re trying to fetch must exist! Check this using a browser.

* We’ll need permission to write to the destination filename, and the file will be created in the “current directory” - i.e. the same folder that the Python program is saved in.

* ECS use a proxy server for requests that go beyond our local web (all our requests go via this, it avoids downloading things twice). We've avoided that complication by using a file we placed with ECS. This isn't a problem when connected via ITS wifi.

Here is a slightly different example. Rather than save the web resource to our local disk, we read it directly into a string, and return it:

In [28]:
import urllib.request

URL = "https://ecs.victoria.ac.nz/foswiki/pub/Courses/NWEN241_2015T1/LectureSchedule/rfc2616.txt"

# get a socket handle
response = urllib.request.urlopen(URL)

# print out line by line
for line in response:
    print(line.decode('UTF-8'))





<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-us" lang="en-us">

<head>



<title>Login | ECS | Victoria University of Wellington</title>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

<link rel="icon" href="/favicon.ico" type="image/x-icon" />

<link rel="shortcut icon" href="/favicon.ico" type="image/x-icon" />

<meta name="TEXT_NUM_TOPICS" content="Number of topics:" />

<meta name="TEXT_MODIFY_SEARCH" content="Modify search" />

<link rel="stylesheet" type="text/css" href="https://www.victoria.ac.nz/__data/assets/file/0014/86/reset.css" media="all" />

<link rel="stylesheet" type="text/css" href="https://www.victoria.ac.nz/__data/assets/file/0009/90/subsite-layout.css" media="screen" />

<link rel="stylesheet" type="text/css" href="https://www.victoria.ac.nz/__data/assets/file/0016/88/subsite-base.css" media="all" />

<link 

Opening the remote url returns what we call a socket. This is a handle to our end of the connection between our program and the remote web server. We can call read, write, and close methods on the socket object in much the same way as we can work with a file handle.

Note the use if the ```decode``` function. This is required because when we iterate over the socket object we are returned byte strings. We have to convert this into a human readable string and this is done using the decode function. This function takes an "encoding", basically the mapping between the bytes and the character set. In *many* cases the standard is UTF-8, which has been used here.

More info about byte strings versus strings is here (optional -- from https://docs.python.org/3.1/library/stdtypes.html):

>>> While string objects are sequences of characters (represented by strings of length 1), bytes and bytearray objects are sequences of integers (between 0 and 255), representing the ASCII value of single bytes. That means that for a bytes or bytearray object b, b[0] will be an integer, while b[0:1] will be a bytes or bytearray object of length 1. The representation of bytes objects uses the literal format (b'...') since it is generally more useful than e.g. bytes([50, 19, 100]). You can always convert a bytes object into a list of integers using list(b).

>>> Also, while in previous Python versions, byte strings and Unicode strings could be exchanged for each other rather freely (barring encoding issues), strings and bytes are now completely separate concepts. There’s no implicit en-/decoding if you pass an object of the wrong type. A string always compares unequal to a bytes or bytearray object.

### Headers and Body 

Web pages are made up of two parts, the headers and the body. The headers tell the client useful information such as the language used in the body, while the body is the content of the web page.

You can access the headers using the ```.info()``` keyword. In the example below we see the encoding used for the webpage.

In [29]:
import urllib.request

#URL = "https://ecs.victoria.ac.nz/foswiki/pub/Courses/NWEN241_2015T1/LectureSchedule/rfc2616.txt"
URL="http://www.stuff.co.nz"

# display the headers
response = urllib.request.urlopen(URL)
print(response.info())


Server: NZCMS
Content-Length: 212354
Content-Type: text/html;charset=utf-8
X-Come-Hack-With-Us: https://technology.fairfaxmedia.co.nz/work-here/
Vary: Accept-Encoding
X-Varnish: 423740787 429588772
x-url: /content/desktop/stuff.html
x-host: www.stuff.co.nz
X-FFX-B: azcmsppu189a
X-ESI-Enable: 1
Cache-Control: max-age=30
Expires: Mon, 30 May 2016 07:56:56 GMT
Date: Mon, 30 May 2016 07:56:26 GMT
Connection: close


