# Working with Files

The real power of a programming language such as Python is its ability to process large amounts of data quickly. In some instances, you want to run code to calculate a single number and can simply print this out, but more likely, you want to process a large amount of data and compute a large number of individual data points. In this case, it often makes sense to save the output of your code to a file. In particular, since any data stored in computer memory is deleted once your programs stops, files provide a means of persistent storage (i.e. storage that persists across time, including when a computer is switched on and off), which you will be able to access later, send to others, etc.

The ability to work with files is very important when dealing with larger datasets, in terms of reading data in from files, and writing data back out to files. As such, file manipulation is often called "file input/output" or **file IO**.

A **file** is a linear sequence of data that is stored on persistent storage such as a hard drive. To process the data in a file, you have to perform the following steps:

1. Open the file;
2. Read data from a file or write data to a file;
3. Close the file.

The following analogy might help: to process a document in a drawer, you have to first open the drawer, then process the document (i.e. read it), and finally close the drawer. We will discuss each of the file operations in the following slides.


# Opening Files

It is necessary to open a file before performing other operations like reading, writing or appending to a file. Python provides a built-in function `open()` to open a file, that takes the file name as an argument (in the form of a string), and returns a file object, that you can manipulate to read the actual content of the file:


In [None]:
file_object = open("quotes.txt")


`open()` takes an optional second argument in the form of a string, which can be used to stipulate the "mode" in which to open the file. By default (without the second argument) the mode is **read only**, meaning you can read the content of the file but not modify it any way. There is a close analogy with immutable types such as strings, in that you can access the content of the file but not change the original (but can of course create a new file based on the content of the original, as we will come to in a bit). In practice, you can stipulate read mode with an `"r"`, as in:


In [None]:
file_object = open("quotes.txt", "r")


which is identical in functionality to the first call.

The other two commonly-used modes are **write** (= `"w"` as a mode string) and **append** (= `"a"` as mode string). With write mode, if the file of that name pre-existed, we delete the original contents when we write to it, and if it didn't pre-exist, it is created first. In append mode, if the file pre-existed, we leave the original content intact and write extra content to the end of the file, and if it didn't pre-exist, it is created first (and append functions identically to write). Both the write and append modes can be combined with the read mode.

We cover each of these modes in more detail in the following slides, with examples.

# Closing Files

Having opened a file, it is good to get into the habit of closing it when you have finished with it, with the `.close()` method:


In [None]:
file_object = open("quotes.txt")
file_object.close()


When you close a file, two things happen: (1) all data that has been written/appended to the file is "flushed" through to the file, and it is closed on the computer's file system; and (2) the file object associated with the file can no longer be used to manipulate the file. This second thing can be a [good](http://www.independent.co.uk/news/boaty-mcboatface-could-be-the-name-of-200m-research-vessel-after-public-vote-a6942551.html) way of safeguarding against inadvertently modifying a file after you have finished with it.

When a file is opened, some system resources such as memory are allocated to allow for file processing. It is important to free these system resources on completion of the file processing.

The `.close()` method in Python closes the file and writes the actual data to the disk. It prevents further access to the content of the file until it is opened again.

# Reading Files

To read the contents of a file once you have opened it, Python file objects provide a range of methods. We discuss the `.read()` method below. You can find general information on other methods in the [Python IO object documentation](https://docs.python.org/3/library/io.html#io.IOBase) and information specific to reading plain text files in the [Text IO documentation](https://docs.python.org/3/library/io.html#io.TextIOBase).

Say we have a text file on our computer named `"quotes.txt"`, as follows.



The `.read()` method reads the entire content of the file object associated with that file and returns it as a `str`, as we see in the following program:


In [2]:
fp = open("quotes.txt")
content = fp.read()
fp.close()
print(content)

'Twas brillig, and the slithy toves, Did gyre and gimble in the wabe: All mimsy were the borogoves, And the mome raths outgrabe.


# Reading Files a Line at a Time

Another useful and commonly-used method for reading files is `.readlines()`, which returns an object which allows us to iterature over the lines in the file.

Let's modify the file `"quotes.txt"` slightly, as follows, to be split over 4 lines:


```
'Twas brillig, and the slithy toves
Did gyre and gimble in the wabe:
All mimsy were the borogoves,
And the mome raths outgrabe.
```


We can now iterate over the lines one at a time (which can be much more memory-efficient for large files!) as follows:


In [3]:
fp = open("jabberwocky1.txt")

lineno = 1
for line in fp.readlines():
    print(f"{lineno}: {line}", end="")
    lineno += 1
    
fp.close()

1: 'Twas brillig, and the slithy toves
2: Did gyre and gimble in the wabe:
3: All mimsy were the borogoves,
4: And the mome raths outgrabe.


Note the use of the `end` keyword in each call to `print()`, to suppress the insertion of a newline character, as each line in the file is, by definition, terminated by a newline.

You can also iterate over the lines in the file by iterating directly over the file handle:


```
"Beware the Jabberwock, my son!
The jaws that bite, the claws that catch!
Beware the Jubjub bird, and shun
The frumious Bandersnatch!"
```

In [4]:
fp = open("jabberwocky2.txt")

lineno = 1
for line in fp:  # note no readlines()
    print(f"{lineno}: {line}", end="")
    lineno += 1
    
fp.close()

1: "Beware the Jabberwock, my son!
2: The jaws that bite, the claws that catch!
3: Beware the Jubjub bird, and shun
4: The frumious Bandersnatch!"

# Appending to Files

A file must be opened in either write or append mode if you want to alter its contents. In append mode (denoted by `'a'`), the new data is added to the end of the existing contents of the file, whereas in write mode (denoted by `'w'`), the new data overwrites the old data resulting in the loss of the original contents of the file. You should be careful when choosing the mode to edit a file. In the following discussion, we use the term "editing" to refer to both writing and appending data to a file.

Having opened the file, you can then use the `.write(text)` method over a file handle to write the string `text` to that file, or `.writelines(str_list)` to write the list of strings `str_list` to the file.

Let's put this into practice. Say there is a file named `quotes2.txt` with the following content:



Based on what we told you above, think about what the following code will do, and then try running it (noting what happens to the original file on running the code):


In [5]:
fp = open("quotes2.txt", "a")
fp.write("\n\n-Albert Einstein")
fp.close()

fp = open("quotes2.txt", "r")
print(fp.read())
fp.close()

If A is success in life, then A equals x plus y plus z. Work is x; y is play; and z is keeping your mouth shut.

-Albert Einstein



> ## End-of-line Characters
> `\n` is a special character called an **newline** character that, when printed, generates a line break. In this case, the string `\n\n-Albert Einstein` translates into *print two blank lines, then print `-Albert Einstein`*.


# Writing Files

In the previous slide, the file was opened in append mode, which means that the string was added to the end of the file. If the file were opened in write mode, its original content would be overwritten after the execution of the `.write()` method and the resulting file would only contain the string `\n\n-AlbertEinstein`.

Say a file named `"quotes3.txt"` has the following content:


```If A is success in life, then A equals x plus y plus z. Work is x; y is play; and z is keeping your mouth shut.```


Try to predict what will happen when you run the following code, then run it to test your hypothesis. Again, play careful attention to what happens to the original file as part of this:


In [6]:
fp = open("quotes3.txt", "w")
fp.write("\n\n-Albert Einstein")
fp.close()

fp = open("quotes3.txt", "r")
print(fp.read())
fp.close()



-Albert Einstein


# File Creation

If you try to open a file that doesn't exist for reading, you get an error. The reason is that you cannot read a non-existent file.


In [None]:
# Try running the last example here


How do you create a new file then? Python does not provide a special function for file creation. Instead, it uses  the `open()` function in write or append mode:


In [None]:
fp = open("a_text_file.txt", "w")
fp.close()
fp = open("a_text_file.txt", "r")
fp.close()
print("See, no error!")


Note that you do not get any feedback on whether or not the file you opened for editing is a new file. You can add data to the new file as discussed above or leave the file empty by closing the file immediately after its creation.


# Summary
For opening files:
- `open(filename, 'r')`: Read-only
- `open(filename, 'w')`: Write (overwrites the file if it exists)
- `open(filename, 'a')`: Append (append at the end of the file like `list.append()`)

Special Cases:
- `open(filename, 'wb')`: Write binary outputs (covered later with `XML`)

Reading file values:
- `f.read()`: Read the whole file as a **single string**
- `f.readlines()`: Returns a `list` of `string`, where each `string` represents a single line in the file.

Although `pandas` will cover most data formats in an easy-to-code format, you will need to use `open()` from time-to-time for the other formats such as `XML` and `JSON`!