# Working with Files

The real power of a programming language such as Python is its ability to process large amounts of data quickly. In some instances, you want to run code to calculate a single number and can simply print this out, but more likely, you want to process a large amount of data and compute a large number of individual data points. In this case, it often makes sense to save the output of your code to a file. In particular, since any data stored in computer memory is deleted once your programs stops, files provide a means of persistent storage (i.e. storage that persists across time, including when a computer is switched on and off), which you will be able to access later, send to others, etc.

The ability to work with files is very important when dealing with larger datasets, in terms of reading data in from files, and writing data back out to files. As such, file manipulation is often called "file input/output" or **file IO**.

A **file** is a linear sequence of data that is stored on persistent storage such as a hard drive. To process the data in a file, you have to perform the following steps:

1. Open the file;
2. Read data from a file or write data to a file;
3. Close the file.

The following analogy might help: to process a document in a drawer, you have to first open the drawer, then process the document (i.e. read it), and finally close the drawer. We will discuss each of the file operations in the following slides.


# Opening Files

It is necessary to open a file before performing other operations like reading, writing or appending to a file. Python provides a built-in function `open()` to open a file, that takes the file name as an argument (in the form of a string), and returns a file object, that you can manipulate to read the actual content of the file:

```norun;
file_object = open("quotes.txt")
```

`open()` takes an optional second argument in the form of a string, which can be used to stipulate the "mode" in which to open the file. By default (without the second argument) the mode is **read only**, meaning you can read the content of the file but not modify it any way. There is a close analogy with immutable types such as strings, in that you can access the content of the file but not change the original (but can of course create a new file based on the content of the original, as we will come to in a bit). In practice, you can stipulate read mode with an `"r"`, as in:

```norun;
file_object = open("quotes.txt", "r")
```

which is identical in functionality to the first call.

The other two commonly-used modes are **write** (= `"w"` as a mode string) and **append** (= `"a"` as mode string). With write mode, if the file of that name pre-existed, we delete the original contents when we write to it, and if it didn't pre-exist, it is created first. In append mode, if the file pre-existed, we leave the original content intact and write extra content to the end of the file, and if it didn't pre-exist, it is created first (and append functions identically to write). Both the write and append modes can be combined with the read mode.

We cover each of these modes in more detail in the following slides, with examples.

# Closing Files

Having opened a file, it is good to get into the habit of closing it when you have finished with it, with the `.close()` method:

```
file_object = open("quotes.txt")
file_object.close()
```

When you close a file, two things happen: (1) all data that has been written/appended to the file is "flushed" through to the file, and it is closed on the computer's file system; and (2) the file object associated with the file can no longer be used to manipulate the file. This second thing can be a [good](http://www.independent.co.uk/news/boaty-mcboatface-could-be-the-name-of-200m-research-vessel-after-public-vote-a6942551.html) way of safeguarding against inadvertently modifying a file after you have finished with it.

When a file is opened, some system resources such as memory are allocated to allow for file processing. It is important to free these system resources on completion of the file processing.

The `.close()` method in Python closes the file and writes the actual data to the disk. It prevents further access to the content of the file until it is opened again.

> ## Function vs. Method
> The question of whether something should be a function or a method has come up a couple of times already, but there is an interesting asymmetry here, with `open()` being a function and `.close()` a method. Why do you think this is? Would it make sense for `open()` to be a method or `.close()` to be a function?
>
> One reason why `.close()` is useful as a method is because it works on a specific file object. This way, there's no need to add any arguments or worry about any errors specifying which file you need to close. Maybe it would be nice if `open()` were a method too, but there's only one problem: we need to open the file before we can use the file object!

# Reading Files

To read the contents of a file once you have opened it, Python file objects provide a range of methods. We discuss the `.read()` method below. You can find general information on other methods in the [Python IO object documentation](https://docs.python.org/3/library/io.html#io.IOBase) and information specific to reading plain text files in the [Text IO documentation](https://docs.python.org/3/library/io.html#io.TextIOBase).

Say we have a text file on our computer named `"quotes.txt"`, as follows.

```path:quotes.txt;
'Twas brillig, and the slithy toves, Did gyre and gimble in the wabe: All mimsy were the borogoves, And the mome raths outgrabe.
```

The `.read()` method reads the entire content of the file object associated with that file and returns it as a `str`, as we see in the following program:

```eg:last;
fp = open("quotes.txt")
content = fp.read()
fp.close()
print(content)
```


# Reading Files a Line at a Time

Another useful and commonly-used method for reading files is `.readlines()`, which returns an object which allows us to iterature over the lines in the file.

Let's modify the file `"quotes.txt"` slightly, as follows, to be split over 4 lines:

```path:jabberwocky1.txt;
'Twas brillig, and the slithy toves
Did gyre and gimble in the wabe:
All mimsy were the borogoves,
And the mome raths outgrabe.
```

We can now iterate over the lines one at a time (which can be much more memory-efficient for large files!) as follows:

```eg:last;
fp = open("jabberwocky1.txt")
lineno = 1
for line in fp.readlines():
    print(f"{lineno}: {line}", end="")
    lineno += 1
fp.close()
```

Note the use of the `end` keyword in each call to `print()`, to suppress the insertion of a newline character, as each line in the file is, by definition, terminated by a newline.

You can also iterate over the lines in the file by iterating directly over the file handle:

```path:jabberwocky2.txt;
"Beware the Jabberwock, my son!
The jaws that bite, the claws that catch!
Beware the Jubjub bird, and shun
The frumious Bandersnatch!"
```

```eg:last;
fp = open("jabberwocky2.txt")
lineno = 1
for line in fp:  # note no readlines()
    print(f"{lineno}: {line}", end="")
    lineno += 1
fp.close()
```


# Problem: Forgetful Karaoke

**Life hack:** if you're really bad at karaoke and can't remember the words, you can just repeatedly sing one word.
If it's the most common word in the song, you'll be right more often than you might think (and may get away with it!).

Write a function `approximate\_song(filename)` that reads the lyrics of the song in the file of name `filename`, and returns the most common word in the song.
In the event of a tie, your function should return the word that comes first alphabetically. Assume that words are whitespace-delimited, and use `.split()` with no stripping of punctuation or folding of case to extract the words from the text.

We have provided lyrics for three songs for you to test your function: `somebody.txt`, `barbrastreisand.txt`, and `fakesong.txt`.
Note these are not the only files we will use to test your code. You can see the contents of these files by clicking on the tabs at the top-right of the page.
Outputs should be as below:

```norun;readonly;
>>> approximate\_song('somebody.txt')
'that'
>>> approximate\_song('fakesong.txt')
'dum'
>>> approximate\_song('barbrastreisand.txt')
'whooo-oo'
```

> ## Dictionaries
> This is very similar to the Top-5 Frequent words problem in Worksheet 11. Feel free to reuse your solution!

# Appending to Files

A file must be opened in either write or append mode if you want to alter its contents. In append mode (denoted by `'a'`), the new data is added to the end of the existing contents of the file, whereas in write mode (denoted by `'w'`), the new data overwrites the old data resulting in the loss of the original contents of the file. You should be careful when choosing the mode to edit a file. In the following discussion, we use the term "editing" to refer to both writing and appending data to a file.

Having opened the file, you can then use the `.write(text)` method over a file handle to write the string `text` to that file, or `.writelines(str_list)` to write the list of strings `str_list` to the file.

Let's put this into practice. Say there is a file named `quotes.txt` with the following content:

```path:quotes.txt;
If A is success in life, then A equals x plus y plus z. Work is x; y is play; and z is keeping your mouth shut.
```

Based on what we told you above, think about what the following code will do, and then try running it (noting what happens to the original file on running the code):

```eg:last;
fp = open("quotes.txt", "a")
fp.write("\n\n-Albert Einstein")
fp.close()
fp = open("quotes.txt", "r")
print(fp.read())
fp.close()
```

> ## End-of-line Characters
> `\n` is a special character called an **newline** character that, when printed, generates a line break. In this case, the string `\n\n-Albert Einstein` translates into *print two blank lines, then print `-Albert Einstein`*.


# Writing Files

In the previous slide, the file was opened in append mode, which means that the string was added to the end of the file. If the file were opened in write mode, its original content would be overwritten after the execution of the `.write()` method and the resulting file would only contain the string `\n\n-AlbertEinstein`.

Say a file named `"quotes.txt"` has the following content:

```path:quotes.txt;
If A is success in life, then A equals x plus y plus z. Work is x; y is play; and z is keeping your mouth shut.
```

Try to predict what will happen when you run the following code, then run it to test your hypothesis. Again, play careful attention to what happens to the original file as part of this:

```eg:last;
fp = open("quotes.txt", "w")
fp.write("\n\n-Albert Einstein")
fp.close()
fp = open("quotes.txt", "r")
print(fp.read())
fp.close()
```


# File Creation

If you try to open a file that doesn't exist for reading, you get an error. The reason is that you cannot read a non-existent file.

```
fp = open("a_text_file.txt",  "r")
```

```eg:last;terminal;
Traceback (most recent call last):
  File "program.py", line 1, in <module>
    fp = open("a_text_file.txt", "r")
FileNotFoundError: [Errno 2] No such file or directory: 'a_text_file.txt'
```

How do you create a new file then? Python does not provide a special function for file creation. Instead, it uses  the `open()` function in write or append mode:

```
fp = open("a_text_file.txt", "w")
fp.close()
fp = open("a_text_file.txt", "r")
fp.close()
print("See, no error!")
```

Note that you do not get any feedback on whether or not the file you opened for editing is a new file. You can add data to the new file as discussed above or leave the file empty by closing the file immediately after its creation.


# Problem: Concatenate Files

Write a function `concatenate\_files(filename1, filename2, new\_filename)` that concatenates the text from two source files such that the text from the file named by argument `filename2` follows the text from `filename1`.
The concatenated text is written to a new file with the name given by `new\_filename`.
Your function must not return anything.

We have provided sample input files named `part1.txt` and `part2.txt` containing a portion of the text from the novel *Alice in Wonderland* to test your function.

> ## Closing Files
> Do not forget to close your files!

# The CSV Format

The Comma-Separated Values (**CSV**) data format is a well-known format for representing tabular data. It is one of the most common formats for importing and exporting data between spreadsheets and databases. Its name emanates from the fact that each field or value in the data file is separated by a comma (and there is a closely related format called TSV where values are separated by tab characters, although CSV is often used to refer to this file type too). Here is an example CSV file containing patient records:

```path:patients.csv;norun;
Patient ID,Name,Date admitted,Contact number
126780,Alex McLeod,20/02/08,93761837
126781,Cynthia Roberts,20/02/08,98557624
```

Each line in the data is called a **row** or a **record**. The values in each column (called **field values**) are separated by a comma. The first row is generally a header row, which lays out the number and order of the field values in the data, along with a textual description of each. For example, the above data has four fields in the order `Patient ID`, `Name`, `Date admitted` and `Contact number`. Therefore, the last record corresponds to the patient with ID `126781`, name `Cynthia Roberts`, and so on.

# The <code data-lang="py3">csv</code> Library

The easiest and most robust way to process CSV files is with the `csv` library.

The simplest way to iterate through the rows in a CSV file is with the `csv.reader()` function. It takes an open file containing CSV data as its argument, and returns a special object called a CSV reader. The `reader` object is a bit like `.readlines()` for iterating through files in that each row corresponds to a line in the CSV file, but it additionally automatically splits lines into a list of strings. As part of this, it has the ability to determine whether a comma is a delimiter or part of the text within a field, as in the example:

```path:patients.csv;norun;
Name,Patient ID
"Smith, Kim",123456
```

In this file, the comma in `Smith, Kim` isn't considered to be a delimiter as it is contained within a quoted string. The ability to identify this gives `csv.reader()` a big advantage over manual processing using standard file and string processing methods such as `.readlines()` and `.split()`.

The following is an example CSV file named `"vic_visitors.csv"` that stores the number of visitors to regions in Victoria during 2004 and 2007 (this data was obtained from [Tourism Victoria](https://www.tourism.vic.gov.au/images/stories/international-visitation-regions-of-victoria-ye-Dec-2007.pdf)):

```path:vic_visitors.csv;
Victoria's Regions,2004,2005,2006,2007
Gippsland,63354,47083,51517,54872
Goldfields,42625,36358,30358,36486
Grampians,64092,41773,29102,38058
Great Ocean Road,185456,153925,150268,167458
Melbourne,1236417,1263118,1357800,1377291
```

Let's first print the contents of the CSV file:

```eg:last;
import csv
visitors = open("vic_visitors.csv")
for line in csv.reader(visitors):
    print(line)
visitors.close()
```

Note that each line is a list that contains a single row of the CSV data.


# Column Headings

The first row of a CSV file is often a header. In our example in the previous slide, the CSV data described the number of visitors to different regions of Victoria for the years 2004 to 2007. The first line, however, was different to the other lines. It is called the **header row** and provides a textual label for each column of the file. According to the header row, the first column lists the regions of Victoria, and the subsequent columns indicate the number of visitors in each year. If you wanted to retrieve the header, you could use the `next()` function, which causes the iterable to return its first element, and iterate onto the next row of data.

Initially, no row has been processed, so the first call of `next()` will simply return the header as in the following example:

```path:vic_visitors.csv;
Victoria's Regions,2004,2005,2006,2007
Gippsland,63354,47083,51517,54872
Goldfields,42625,36358,30358,36486
Grampians,64092,41773,29102,38058
Great Ocean Road,185456,153925,150268,167458
Melbourne,1236417,1263118,1357800,1377291
```

```eg:last;
import csv
visitors = open("vic_visitors.csv")
data = csv.reader(visitors)
print(next(data))
visitors.close()
```


# A Complete Example

We can also process the data in a CSV file, i.e., perform some computation over the individual entries. Assume, for example, that you would like to know the average number of visitors per annum for each region. In this case, we have to add the number of visitors in a row for each year, and divide the resulting number by the total number of years. Note that we have to exclude the first row (the header), and the first column of every subsequent row, as it contains the name of the region. As you have seen a number of times already, you can exclude the first element of a list by slicing the row from item 1, as in the following example:

```path:vic_visitors.csv;
Victoria's Regions,2004,2005,2006,2007
Gippsland,63354,47083,51517,54872
Goldfields,42625,36358,30358,36486
Grampians,64092,41773,29102,38058
Great Ocean Road,185456,153925,150268,167458
Melbourne,1236417,1263118,1357800,1377291
```

```eg:last;
import csv
visitors = open("vic_visitors.csv")
data = csv.reader(visitors)
header = next(data)
for row in data:
    total = 0
    cols = 0
    for entry in row[1:]:
        total = total + float(entry)
        cols += 1
    print(f"{row[0]}: {total/cols:.0f}")
visitors.close()
```


# CSV Files as 2D Data

You can also interpret CSV data as two-dimensional data. Two-dimensional data has a row and a column index (in that order). If you would like to print out the number of visitors of the Grampians in the year 2005, you would have to access the third row and second column (remember that we start counting from 0). To convert the variable data into two-dimensional form you have to convert the CSV `reader` into a `list` as in the following example:

```path:vic_visitors.csv;
Victoria's Regions,2004,2005,2006,2007
Gippsland,63354,47083,51517,54872
Goldfields,42625,36358,30358,36486
Grampians,64092,41773,29102,38058
Great Ocean Road,185456,153925,150268,167458
Melbourne,1236417,1263118,1357800,1377291
```

```eg:last;
import csv
visitors = open("vic_visitors.csv")
data = csv.reader(visitors)
data_2d = list(data)
print(data_2d[3][2])
visitors.close()
```


# Naming the Elements of a Row

One issue with `csv.reader()` is that you have to hard-code which column is which by its column index, or save the header row as a list and double-check which row is which via the labels in list. For CSV files with header rows, a more convenient and direct way of accessing the individual elements within a row is via `csv.DictReader()`, whereby each row is returned as a dictionary rather than a list, and a value can be referenced directly by its column label (assuming each column is uniquely labelled; you might like to think why this is).

To see whether this can be important, consider the follow example involving the monthly rainfall measurements for the different state and territory capitals in a given year:

```path:city_rainfall.csv;
city,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec
Melbourne,32.8,24.6,42.2,23,38.8,46.8,62.6,15,18.6,17.4,66.2,71.2
Brisbane,184.4,216.8,20.6,3.2,39,112.6,1,91.2,32.4,55.2,61.8,79.2
Darwin,515.2,670,689.8,18,18.8,4.2,0,0.4,43.6,13.2,143.2,247.8
Perth,0,47.6,6.2,76.4,61.4,60,179.4,130.8,100.6,42.4,5.2,18
Adelaide,9,3,24.6,82.4,45,64.8,65.8,25.8,24,25.2,29.6,37.6
Canberra,43.8,64.6,36.6,27.8,41.6,92.8,18.8,11.6,14.6,22.6,94,101
Hobart,2.8,64.8,24.2,26.2,31.2,29.6,36.8,59,62.8,44,7,83.4
Sydney,57,258.4,65.4,179.6,9.8,510.6,67.2,152.2,41.2,27,170,123.2
```

If we wanted to calculate the average monthly rainfall over the winter months in each case, we could do this as follows:

```eg:last;
import csv
cities = open("city_rainfall.csv")
for row in csv.DictReader(cities):
    total = 0
    for month in ("Jun", "Jul", "Aug"):
        total = total + float(row[month])
    print(f"{row['city']}: {total/3:.0f}")
cities.close()
```


# Writing CSV Files

If you need to store your CSV data in a file, you can use the `csv.writer()` function. It takes a file object opened in write mode (`'w'`) and returns a special object called a CSV writer. The CSV writer has methods for converting your lists of values to string lines. One such method is the `.writerows()` method. This method assumes that your data is represented as a nested list. The following example writes a small 2D data structure to the CSV file `"2d-data.csv"`. It then prints out the contents of the file so that you can see how the 2D data is represented in a CSV file.

```eg:eg-g1-writing-csv-files-0;
import csv
data_2d = [[1, 2, 3], [4, 5, 6]]
csv_file = open("2d-data.csv", "w")    
writer = csv.writer(csv_file)
writer.writerows(data_2d)
csv_file.close()

csv_file = open("2d-data.csv", "r")
print(csv_file.read())
csv_file.close()

```

The CSV writer also has a `.writerow()` method that takes a single list as its argument for writing a single row.


# Problem: Sorting CSV Records

Write a function `sort\_records(csv\_filename, new\_filename)` that sorts the records of a CSV file and writes the results as a new CSV file.
The first column of the CSV file will be the city name.
The rest of the columns will be months of the year.
The first row of the CSV file will take the form of the column headings.
Here is an example file fragment:

```path:max\_temp.csv;
city/month,Jan,Feb,Mar,Apr
Melbourne,41.2,35.5,37.4,29.3
Brisbane,31.3,40.2,37.9,29
Darwin,34,34,33.2,34.5
```

Note that your code will be tested over different CSV files, with different ranges of months in them.
Irrespective of the exact months contained in the file, you may assume that the city name will always be in the first column, and the months in the remaining columns.

You must sort the data in alphabetical order according to the city name (stored in the first column).
Your program should write the sorted records to a new file with the name given by the argument `new\_filename`.

Here is an example of how `sort\_records()` should work. 'program.py' is the program and below is its terminal output.

```eg:last;norun;
sort\_records('max\_temp.csv', 'sorted.csv')
result = open('sorted.csv')
print(result.read())
result.close()
```

```eg:last;terminal;
city/month,Jan,Feb,Mar,Apr
Brisbane,31.3,40.2,37.9,29
Darwin,34,34,33.2,34.5
Melbourne,41.2,35.5,37.4,29.3

```

Note that the row for Melbourne has been sorted below the rows for Brisbane and Darwin because `Melbourne` comes later than `Brisbane` and `Darwin`, based on alphabetical ordering.

> ## Test File
> So you can test your answer, we have provided a full year of data for many Australian cities in a file called `max\_temp.csv`. The data was obtained from the [Bureau of Meteorology website](http://www.bom.gov.au/climate/averages/).

# Problem: Hottest Month

Write a function `max\_city\_temp(csv\_filename, city)` which analyses temperatures recorded in a CSV file, and returns the maximum temperature recorded for the named city.

The first column of the CSV file will be the city name.
The rest of the columns will be months of the year.
The first row of the CSV file will provide the column headings.
Here is an example file fragment (the actual file has all of the months of the year):

```norun;path:max\_temp.csv;
city/month,Jan,Feb,Mar,Apr
Melbourne,41.2,35.5,37.4,29.3
Brisbane,31.3,40.2,37.9,29
Darwin,34,34,33.2,34.5
```

Here is an example of how `max\_city\_temp()` should work:

```norun;readonly
>>> max\_city\_temp('max\_temp.csv', 'Brisbane')
40.2
```

> ## Test File
> So you can test your code, we have provided a full year of data for many Australian cities in a file called `max\_temp.csv`.
> The data was obtained from the [Bureau of Meteorology website](http://www.bom.gov.au/climate/averages/).

# Problem: Hottest City

Write a function `hottest\_city(csv\_filename)` that analyses the temperatures recorded in a CSV file, and returns a 2-tuple made up of the maximum temperature in the whole dataset along with a sorted list of the names of cities where that temperature was recorded.

The first column of the CSV file will contain the city name.
The rest of the columns will be months of the year.
The first row of the CSV files will provide column headings.
Here is an example file (with an incomplete set of months):

```norun;path:max\_temp\_tiny.csv;
city/month,Jan,Feb,Mar,Apr
Melbourne,41.2,35.5,37.4,29.3
Brisbane,31.3,40.2,37.9,29
Darwin,34,34,33.2,34.5
```

Here is an example of how `hottest\_city()` should work:

```norun;readonly;
>>> hottest\_city('max\_temp\_tiny.csv')
(41.2, ['Melbourne'])
```

> ## Test File
> So you can test your answer, we have provided a full year of data for many Australian cities in a file called `max\_temp.csv`.
> The data was obtained from the [Bureau of Meteorology website](http://www.bom.gov.au/climate/averages/).