# Files

## External data
Programs that only dealt with values we **hard coded** into the code or which could only take text input from the user (like using `input()`) would be of very limited use. 

* **Files**, which provide a universal way to store and transfer information between programs.

# Files: Why?


###  To persist data
When you turn off a computer, the data in memory disappears. In the shorter term, when you exit a program, the information stored in memory by that program is freed by the operating system and is lost.

The way to store information so that it can be used again later (to make it **persistent**, i.e. not go away) is to write it into a file. The file can be read at any later date and used as needed.

### To interchange data
Files can be sent to other programs (e.g. you might write a program to produce and image, which you could load in a photo editor). Files can be uploaded to servers and downloaded from them. You can shared files on GitHub or Dropbox or via email.

### To organise data
The **filesystem** of a computer is a very simple but very powerful and flexible way of organising information. By placing data into individual files, and then putting these into directories, we have a simple, universal way of organising data.

### When computations won't fit in memory
The computer I'm using has quite a lot of memory (16GB). But I often have to run computations that use need much more than this. A collection of images could easily be 100s of GB. If I want to, for example, crop all of them to remove the border, I can't possibly load them all into memory and crop them. The only way to do this sensibly is to keep data on disk in files, and process it a bit at a time.

### Why not files?
File access is slower than accessing data stored in memory (e.g. values in variables). Much, much, much slower. Sometimes a **million times slower** or more.

And, most importantly, files are not just big collections of variables you can read and write. Files provide a **stream** of data, essentially just a very long string, which you can only work with in order.  We (to some extent) only have **serial** access to files: we can read one thing after another, and that's it.

This means that converting from this string into values you can use requires work. Doing this well is a major aspect of computing; this is why we have databases and standardised file formats `(.jpg, .png, .doc, .wav, .zip)`. Each of these lays out information in a precise way so it can be used again later.

Converting variables into a form where they can be written into a file (or sent over a network) is called **serialization** and converting back to variables is called **deserialization**.

### Streams
The stream model of file I/O (I/O means **input-output** and file I/O just means reading and writing files) dates back to the days when computers used big tape reels for file storage.

### Using context managers
Files must be **opened** to read from them or write to them. Once a program is done with a file, it must be **closed**. This is so the operating system can keep track of which files are being used (e.g. to assist caching, or to prevent multiple processes writing to the same file at once).

Imagine **open**ing a bottle, pouring out a **stream** of water, then **closing** the lid back on. Forgetting to put the lid back on isn't immediately harmful, but you'd want to remember eventually or everything will evaporate!

In [4]:
f = open("words.txt")
print(f)
print("here")
print(f.readline())
f.close()

<_io.TextIOWrapper name='words.txt' mode='r' encoding='cp1252'>
here
a



This is the same as doing

In [6]:
with open("words.txt") as f:
    print(f) # f will be open here
    print("here")
    print(f.readline()) # we can read from it

# it'll be closed here
print(f)

<_io.TextIOWrapper name='words.txt' mode='r' encoding='cp1252'>
a

<_io.TextIOWrapper name='words.txt' mode='r' encoding='cp1252'>


ValueError: I/O operation on closed file.

In [None]:
# this will cause an IOError
print(f.readline())

Let's look at this in more detail:

* The `open()` call actually opens the file. 
* `as f` assigns the result of the `open` call to a new variable `f`. *Everything inside the `with` block has access to the file. After the `with` block, the file is automatically closed. 
* `readline()` reads one line of text from the file.
* `read()` will read the entire file in one go.

It is the same as doing this:

    f = open("sentences.txt")
    print(f)
    print(f.readline())
    f.close()
    
but it no matter what happens, it will always close the file correctly -- and the indentation will show you where the file is open.


### Reading/writing
We're only going to consider reading and writing text files in order in CS1P. You'll see random access (skipping to different parts of files) and binary files in CS1PX.

As far as we're concerned, a file is a big long string, and we can read it starting from the start, or write to the end of it. But not both at the same time (this is possible, but we aren't going to look at it here).

To open a file so we can write to it, we use a second parameter to `open()`. This called the **mode** and it should be `"w"` to write to a file. By default it is `"r"` which means to read from a file.

**Note: opening a file for writing will overwrite the contents of that file!**

You can also **append** to an existing file by opening in mode `"a"` (append). This opens for writing, but *does not* overwrite the contents, and will start writing data at the end of the file.

`write()` writes data to a file. Succesive calls to `write()` will insert data at the end of a file.

In [17]:
# open in write mode, reseting the file
with open("write_out.txt", "w") as out_file:
    out_file.write("hello")
    out_file.write("from")

# now open in append mode and append another line
with open("write_out.txt", "a") as out_file:
    out_file.write("CS1P")

In [20]:
# now read the file back and print it out
with open("write_out.txt", "r") as out_file:
    print(out_file.read())




In the example below the file `"JackJill.txt"` can be opened for reading the data using `"r"`
The variable `f` now refers to that file.
A line can be read from the file using `f.readline()`
Once we have finished with the file, the file should be closed.

In [None]:
f = open("JackJill.txt","r")
line = f.readline()
print(line)
f.close()

Alternatively in python the file file can be opened with the word with. The file remains open within the indentation. This code reads a line from the code just like the code above.

In [None]:
with open("JackJill.txt","r") as f:
    line = f.readline()
    print(line)

The code below causes an error as we attempt to read in a line not in the indentation.

In [None]:
with open("JackJill.txt","r") as f:
    line = f.readline()
    print(line)
line = f.readline()

    `f.readline()` reads a single line from a file
    `f.readlines()` reads all the lines from a file
Each line can then be accessed from all the lines.

In [None]:
with open("JackJill.txt","r") as f:
    lines = f.readlines()
    for line in lines:
        print(line)

We can access each line without having to read them in.

In [None]:
with open("JackJill.txt","r") as f:
    for line in f:
        print(line)

In the following example we are checking to see if a `dog` is in the `animals.txt`. Check the `animals.txt` file to see if it exists.

In [None]:
animal = "dog"
with open("animals.txt") as f:
    found = False
    for line in f:
        if animal == line:
            found = True

if found:
    print("%s is in the file." %(animal))
else:
    print("%s was not found." %(animal))

Why was it not found?

It is because there is an newline symbol at the end of every line in the file. 

    `"dog"` not the same as `"dog\n"`
Before we can check we need to strip any white space from each line in the file (using .strip())

In [None]:
animal = "dog"
with open("animals.txt") as f:
    found = False
    for line in f:
        if animal == line.strip():
            found = True

if found:
    print("%s is in the file." %(animal))
else:
    print("%s was not found." %(animal))

Using a `while` loop allows the code to stop searching when the dog is found.

In [None]:
animal = "dog"
with open("animals.txt") as f:
    line =  f.readline().strip()
    while line != animal and line != "":
        line =  f.readline().strip()
    
if line == animal:
    print("%s is in the file." %(animal))
else:
    print("%s was not found." %(animal))

### Newlines in writing

Note: unlike print, which automatically inserted a newline at the end of a print statement, we have to explicitly tell `write()` to write newline. If we don't do this, all of the output will appear as a single line!

Including the character sequence `\n` in a string will insert a newline. For example, this works if I print a string with \n inside it:

In [None]:
print("one\ntwo\nthree")

So to write a line of output to a file, you would use something like:

    f.write("some output here\n")
    f.write("this is the next line\n")
    
If you forget to put in the newline, you will end up with 

    some output herethis is the next line

### Iterating by lines
There is a nice **idiom** (idiom: a common pattern of speech or a common way of doing things) in Python to read a file line by line, because we very often want to work with text files in this form.  For example, we might have a list of movies from IMDB, one movie per line:

    ...
    Melody of Clock and Arrow (2006)		2006
    Melody of Cradle (2013)					2013
    Melody of Death (1922)					1922
    Melody of Fate (1911)					1911
    Melody of Funhouse (2013)				2013
    Melody of Life (2004) (V)				2004
    Melody of Love (1928)					1928
    Melody of My Heart (1936)				1936
    Melody of Noise (2016)					2016
    Melody of San Francisco (2009)			2009
    Melody of Subways (2012)				2012
    Melody of the Plains (1937)				1937
    Melody on Parade (1936)					1936
    ...


We can use a `for` loop, and use the file directly as the iterator:

In [10]:
with open("elements.txt") as f:
    for line in f:
        print("-->", line, end=' ')
        
print("End of file")

FileNotFoundError: [Errno 2] No such file or directory: 'elements.txt'

This will return one line at a time until the end of the file.

### readlines and writelines
We can also read an entire file into a list of lines, using `readlines()`. This gives us a list of strings.

In [None]:
with open("elements.txt") as f:
    print(f.readlines())

`readlines()` is exactly the same as iterating through the lines of the file and storing them in a list as we go.

To do the opposite, we can write out a list of strings to a file, one line per liste element, using `writelines()`

In [None]:
with open("line_test.txt", "w") as f:
    f.writelines(["one", "two", "three"])

## One line at a time
We can also read one line at a time, using `readline()`. Instead of `for`, we can use `while`, but it is more awkward. However, if we need to conditionally read lines (unusual but possible) this is a more flexible way of reading in data.

In [14]:
with open("words.txt") as f:
    line = f.readline()
    while len(line)>0:
        #print(line, end=' ')
        line = f.readline()
    #print(line, end=' ')

### Stripping

One annoying thing that happens when reading in text from files is extraneous **whitespace**. For example, reading a line includes the **newline** at the end. If you have a file `chocolate_products.txt` which contains lines like:

    dark chocolate
    milk chocolate
    white chocolate
    cocoa 
    cocoa butter

you might try and iterate over each line and see if any line matches `cocoa`

In [None]:
# won't work
with open("chocolate_products.txt") as choc_file:
    for line in choc_file:
        print(line)
        if line=="cocoa":
            # why isn't this ever called?!
            print("    ***Found cocoa***")

This is because there is a **newline character** at the end of the line. There must be one, in fact, because otherwise it wouldn't be a separate line at all!

We can see it easily if we print out the character code for the whole line:

In [None]:
# notice the 10 on the end of each line? That's the newline
# 32 is space
with open("chocolate_products.txt") as choc_file:
    for line in choc_file:
        for char in line:
            print("%02X" % ord(char), end=' ')
        print()

Add tag
Notice all the 0A at the end of each line? Those are character 10 (0x0A in hex), the newline character.

We can strip off all whitespace from the end of a string using `strip()`. This just trims off all space characters (space, newline, tabs). If we want to trim whitespace from the start, use `rstrip()` (right-strip).

In [None]:
# will work
with open("chocolate_products.txt") as choc_file:
    for line in choc_file:
        print(line.strip())
        if line.strip()=="cocoa":
            # this works now
            print("    ***Found cocoa!***")

## Finding patterns
We can, for example, find every line that contains a string and write it to another file:

In [1]:
# note that we nest the two with statements.
# the order isn't important here, because we never do anything to the files 
# until both files are open

pattern = "saw"
with open("matched.txt", "w") as out_file:
    with open("words.txt") as in_file:
        for line in in_file:
            # check if we match the pattern
            if pattern in line:
                out_file.write(line)

# find out what we wrote to the matched.txt file
with open("matched.txt") as in_file:
    print(in_file.read())

foresaw
fretsaw
fretsaws
hacksaw
hacksaws
handsaw
jigsaw
jigsaws
oversaw
ripsaw
saw
sawdust
sawed
sawfish
sawfly
sawing
sawmill
sawmills
sawn
saws
sawtooth
sawyer
seesaw
seesaws
unsawed
warsaw



Write a program to write out which lines the word `"Jack"` appears in `"JackJill.txt"`.

There are two ways to write to a file.

    `"w"` opens a file and writes to the beginning of it.
          if the file doesn't exist it creates it.
          if the file exist it overwrites the previous content,
    `"a"` appends to an existing file, adding to the end of the file.
          if the file doesn't exist it creates it.
          
Just as read has a newline character attached to it, a newline character must be added to the end of every line.

In [None]:
# open in write mode, reseting the file
def readFile(inFile):
    with open(inFile, "r") as inF:
        for line in inF:
            print(line)
    
def writeNoNewline(outFile, message):
    with open(outFile, "w") as outF:
        outF.write(message)
    
def writeNewline(outFile, message):
    with open(outFile, "w") as outF:
        outF.write(message)
        outF.write("\n")
        
def appendFile(outFile, message):
    with open(outFile, "a") as outF:
        outF.write(message)
        outF.write("\n")

In [None]:
writeNoNewline("testing.txt","Write no newline. ")
readFile("testing.txt")

In [None]:
appendFile("testing.txt","Append to same line. ")
readFile("testing.txt")

In [5]:
writeNewline("testing.txt","Overwrites previous file but newline added. ")
readFile("testing.txt")

NameError: name 'writeNewline' is not defined

In [None]:
appendFile("testing.txt","So append to a new line.")
readFile("testing.txt")

Write a program to add the following list to the `"anaimals.txt"`, one animal per line.

In [None]:
moreAnimals = ["kitten","lamb","monkey","nit","owl"]


Write a function that takes a `file` and a `number` and writes the times tables of that number to the file.

Call the function to write the time tables for all the tables between 1-10.

Write a program that writes the first verse of `Jack and Jill` to a separate file.

Run the code below. What happens? Can you fix it not to crash when trying to read form a file that doesn't exist.

    `hint: use the try....except from yesterday`

In [None]:
with open("noFile.txt", "r") as in_file:
    for line in in_file:
        print(line)

In the code below, information regarding all the birthdays is read in from a file, converted into a dictionary and then written to a file.

    .split() splits a string up by the spaces and stores the result in a list.

In [6]:
# split splits the string by the parameter.
# if no parameter is provided it defaults to split by spaces.

date = "27/6/18"
splitDate = date.split("/")
print(splitDate)

['27', '6', '18']


In [None]:
# function to open any file and return the whole contents of the file.

def readFile(inFile):
    try:
        with open(inFile, "r") as inF:
            fileData = inF.readlines()
            return fileData
    except:
        print("File does not exist")
        
# This function takes the data from the file and builds a dictionary similar 
# to the one used in the dictionary exercises.
    
def extractData(data):
    bdays = {}
    for line in data:
        print("Line as it is read in - " + line)
        
        cleanLine = line.strip()
        words = cleanLine.split()
        print("Split line - ", end = "")
        print(words)
        print()
              
        bdays[words[0]] = {"day":words[1],"month":words[2]}
              
    return bdays

# This function extracts all the data from the dictionary and writes it as a string to a file.

def dataFile(outFile,data):
    with open(outFile, "w") as outF:
        for name in data:
            #output = ""
            outF.write("%s has a birthday on %s %s." %(name,data[name]["month"],data[name]["day"]))
            outF.write("\n")

In [None]:
d = readFile("birthdays.txt")
b = extractData(d)
print(b)
b["Fred"] = {"day":"3","month":"June"}
dataFile("new.txt",b)

Write a program that reads in the information from the file `"exams.txt"`, stores it in a data structure and then writes the total mark and average mark of each student to a file (include the name with the results)

### Spellcheck
* There is a dictionary of (lowercase) words in the file "words.txt". 
* There is one word per line.
* Write a program that asks for a word from the user, and if the word is in the dictionary, prints "Spelled OK", otherwise prints "Misspelled"
* Note: to get the string to work correctly, you must **strip** the newline off of it:

        line = line.strip()
        
will return just the printable text of the line.
* Your spell checker should not care about the case of the word the user enters.


### A.4 Matching lines in a file
* Write a program which checks each line of text in a file to see if it begins with a given string.
* You can use `long_string.startswith("pattern")` to see if `pattern` occurs at the start of `long_string`.
* If the line contains the string, print it out.

Apply your program to the file `romeo_juliet.txt` looking for the pattern `"  Rom."` (note there are two spaces at the start!) and writing the line to `Romeos_lines.txt`. This will only copy the *first* of Romeo's lines in any passage.

Repeat this process to copy Juliet's (first) lines to `Juliets_lines.txt`.

*Note that we did a very similar example in the lecture notes!*

**Optional extension (slightly hard):**
Make the code copy *all* of Romeo's (or Juliet's) lines, not just the first line in any block of dialogue. You will need to look at `romeo_juliet.txt` to see how you could find this from the structure of the file.