# Week 7 - Reading and Writing Files

This week we're going to look at how to get your data in and out of Python. If you're analysing data, moving files around, or saving your output for later - Python has some clever tools to allow this to happen.

It may sound obvious, but all the files on your computer are just bytes, and readable by Python in one way or another. The file format should have a specification that defines how the file is organised; sometimes this can give away some interesting properties. For example, if you change the extension on a Word document file to `.zip`, you'll find that it's actuall just a zip folder of `.xml` files!

Later on in your programming journey, you might have to study and/or write these filetype specification (I have!). In this session, we'll look at some of the easiest to work with - text files, csv and json files. The file types used in programming applications are usually pretty simple and human readable anyway!

## Vanilla Python Reading and Writing of Data

Without any external packages, we have a few options for opening files in Python. The following syntax used to be the way to do it:

SAM NOTE: In the session, time for a `!wget` intro! Otherwise how are the students going to get the files 🤔

In [1]:
my_file = open("./hello.txt")

data = my_file.read()
print(data)

my_file.close()

Hello world!

This is a test document for the Practical Python course.


Simple right? It's just a case of using the `open` function to read the data into Python, calling the `.read()` function to get the data into a Python object, printing it, then closing the file.

We have one problem however - it's very easy to forget to close the file at the end of opening it! This can cause multiple problems - the most obvious being running out of resources (imagine opening accidently opening 1000 files at once instead of one at a time!). Therefore, it's best practice to use the following syntax:

In [2]:
with open("./hello.txt") as my_file:
    data = my_file.read()
    print(data)

Hello world!

This is a test document for the Practical Python course.


The two code snippets above are identical - one just auto-closes the file for us after we're done!

### A Note About Absolute vs Relative File Paths

In all of programming, we have two kinds of filepaths - absolute and relative. 

*Absolute* filepaths give the full location of the file - for example, `C://Users/stmball/Documents/Code/Continuing Education/Fundamentals/Week 7 - Reading and Writing Files/hello.txt`. This is the full location of the file - however as you can see, in a well organised system may be a bit unwieldy.

*Relative* filepaths give the location of the target file relative to the file you're currently in. For example, as this file is in the same folder as `hello.txt`, we can either reference it just by the filename or by `./hello.txt`. Two rules for relative filepaths are that `./` refers to the current folder and `../` refers to the parent folder - so I can sneakily navigate my filesystem by chaining `../` and folder names together!

### A Note on File Permissions

On your operating system, it's extremely likely you have different file permission for different users and user groups - for example some system files will only be readable and writable by administrators (or the `root` user), whereas other files might be read-only for all users. As such, we need to tell Python what we are doing with the file we are opening. By default, Python opens the file in read mode - meaning we can view the file but not write to it. It's also good practice to only give Python the minimum permissisons it needs for working with files to avoid overwriting data!

For example, let's see what happens if we try and write data to a file normally:

In [1]:
with open("./hello.txt") as my_file:
    my_file.write("Hello world!")

UnsupportedOperation: not writable

We get an error!

We actually have a number of modes for changing the way we read file - the `open` function accepts an additional argument that tells Python what we want to do with the file. The `"rb"` argument will mean the file gets read as raw bytes:

In [2]:
with open("./hello.txt", "rb") as my_file:
    data = my_file.read()
    print(data)

b'Hello world!\n\nThis is a test document for the Practical Python course.'


You'll see this includes the \n character which Python will then convert to a new line when transformed into a string. This is particularly helpful when building custom parsers.

More applicably, when writing to files we have two options - `"w"` and `"a"`.

`"w"` will create or overwrite the file completely - when we open the file using the `with` context, if the file exists it is emptied even if we don't make any `write` calls in the code. We can call the `write` method multiple times in the code and it will write multiple things to the file, but if we closed the file and opened it again (perhaps as part of a loop), we'd keep emptying the contents of the file!

In [9]:
with open("./new_file.txt", "w") as my_file:
    my_file.write("Hello world!\n")
    my_file.write("Hello world again!\n")

The `"a"` argument stands for "append" and will make the `write` method just add text to the end of the file. This means your changes are non destructive but can end up with much larger files if you are not careful.

In [10]:
with open("./new_file.txt", "a") as my_file:
    my_file.write("Hello world from an append method!")

What method you use will depend on what you are trying to do, and there are usually ways of rewriting your code to use the other method - for example, if we are training a deep learning model and want to keep a log of how well this model is doing, we have two options; we can either keep the file open in memory using the append method and write to it periodically, or we can save the metrics in a Python list and save them at the end. There are pros and cons to both - the append method will use more resources as the file has to be in memory at all times, but if training is interrupted halfway through, we still have access to the metrics up to that point.

## Three Helper Libraries - `csv`, `json`, `pickle`

We now know how to read and write from raw text files - this is helpful when working with raw text but very often our files will be in some kind of format that needs *parsing* by our code before being usable. *Parsing* refers to the process of taking this raw data and converting it into something useful by the rest of our code - sometimes this is as simple as reading the file (as we've seen with text), and other times it's a significant amount of work processing raw bytes and interpreting a header to decode the data.

A common file format used all over the place is the `csv` file format. `csv` stands for Comma Seperated Values, with each line being a row in a table with values seperated by commas.

We can actually build our own parser with raw Python with a neat bit of code, using some of the tools we've seen before:

In [16]:
with open("./test_csv.csv") as my_file:
    data = my_file.read().split("\n")
    data = list(map(lambda x: x.split(","), data))
    print(data)

[['Name', 'Colour', 'Age'], ['Sam', 'Blue', '26'], ['Tom', 'Purple', '34'], ['Alex', 'Pink', '40'], ['Sarah', 'Red', '19'], ['Jenny', 'Green', '65']]


This is some pretty compact but dense code that reads the file, splits it into a list with each line being an item in that list, then splits those elements by entry using the comma as a seperator.

The problem with this method (and the need to import a library for such a seemingly simple task) goes back in the history of the csv file format; the format was around before file type standards were established, and therefore there are some subtle differences between the files that can cause some problems. Perhaps more importantly, it would simply be nicer to have some code that was a bit clearer as to what it was doing - it's not obvious that those two lines are parsing a csv and that can be a problem for readability later.

By using the `csv` library, the above code becomes:

In [19]:
import csv

with open("./test_csv.csv") as my_file:
    reader = csv.reader(my_file)
    print([line for line in reader])

[['Name', 'Colour', 'Age'], ['Sam', 'Blue', '26'], ['Tom', 'Purple', '34'], ['Alex', 'Pink', '40'], ['Sarah', 'Red', '19'], ['Jenny', 'Green', '65']]


It's subtle, but there's a number of benefits for using the library:

* The code is more readable - we know exactly what this is doing (reading a csv).
* Under the hood, the reader is creating a more memory optimised object than the code before by using an *iterator* object
* For different csv specifications, it's a lot easier to fix our code using this method by applying different arguments to the `reader` function.

These are all marginal benefits, but are still important!

Another example of a file specific helper library is the `json` library. The `json` file type stands for JavaScript Object Notation and has now formed a bit of a standard for sending labelled data across the web. As such, you'll find it all the time when working with data originating from the web; and its popularity there has somewhat spread to programming in general when you need a general heirarchical data format.

Json files look a lot like how we write Python dictionaries - but parsing from one into another is a bit non-trivial and has to be done manually, so in this case using a library is certainly the way to go.

In [22]:
import json

with open("./test_json.json") as my_file:
    data = json.load(my_file)
    print(data)
    print(type(data))

{'Sam': {'Age': 26, 'Colour': 'Blue', 'Languages': ['Python', 'Javascript', 'Rust']}, 'Tom': {'Age': 26, 'Colour': 'Green', 'Languages': ['French', 'German', 'Polish']}}
<class 'dict'>


The great thing about the json format is it gives us a way to easily save dictionaries in a sort-of native format that we can load straight back into a library, while keeping it pretty standard and human-readable.

For example, if we have a dictionary we want to save, we can use `json.dump` to simply save the dictionary to a json file:

In [23]:
test_data = {"numbers": [1,2,3,4,5], "letters": "abcdefg"}

with open("./json_dump.json", "w") as my_file:
    json.dump(test_data, my_file)

Finally for our helper libraries, with perhaps the best named library in the Python standard library - `pickle` a library dedicated to *serializing* objects and saving them to disk.

*Serialization* is a form of saving data that is far quicker and more general than a format like csv or json, at the cost of the output not being human readable. For example, look at the following code:

In [26]:
class Car:

    def __init__(self, colour):

        self.colour = colour

    def beep_horn(self):
        
        print("Beep Beep!")
        return 0

my_car = Car("pink")
my_car.beep_horn()

Beep Beep!


0

Fantastic, we have this very practical class we can use in our projects - but what happends if we have 1000s of Cars that we want to savce to file? Well, we can use the `pickle` library:

In [28]:
import pickle

# Saving as a txt just because - we'll see it's def not text!
# Also have to use wb as we are writing bytes, not a string
with open("./test_car.txt", "wb") as my_file:
    pickle.dump(my_car, my_file)

You'll see if you open this file it looks like a load of rubbish - but we can recover the object using `pickle.load`:

In [31]:
with open("./test_car.txt", "rb") as my_file:
    new_car = pickle.load(my_file)
    print(new_car.colour)

pink


Excellent!

### A **VERY** Important Note About `pickle`

Cybersecurity experts will always tell you not to download files from strangers, and this is particularly true for `pickle`ed objects. The "unpickling" process calls some code within the class as defined, allowing an attacker to execute any code they want on your system if you unpickle their object. Take this example:

In [33]:
class InnocentLookingCar:

    def __init__(self, colour):
        self.colour = colour

    def beep_horn(self):
        print("Beep Beep!")

    def __setstate__(self, val):
        self.__dict__ = val
        print("Hacking your system!!!")

my_car = InnocentLookingCar("pink")
my_car.beep_horn()

with open("./bad_car.txt", "wb") as my_file:
    pickle.dump(my_car, my_file)
        
with open("./bad_car.txt", "rb") as my_file:
    new_car = pickle.load(my_file)
    print(new_car.colour)

Beep Beep!
Hacking your system!!!
pink


Moral of the story is - don't unpickle things you see lying around!

## Introducing the `os` library

When working with files, it's very helpful to have some functionality to explore our system, maybe to iterate through the files in a folder, creating folder for files to go into, deleting files, or finding the size of files.

Before we look at some of the helpful functions, we start with another security note!

### Another **VERY** Important Note About `os`

There's a pretty common function `os.system` that can be tempting to use as it can solve quite a few problems quickly. `os.system` runs a terminal command with it's input - for example:

In [34]:
import os

# ls lists the files in the current directory
os.system("ls")

Files.ipynb
bad_car.txt
hello.txt
json_dump.json
new_file.txt
test_car.txt
test_csv.csv
test_json.json


0

This function has its uses but can be extremely dangerous. Be **VERY** careful if you use this function (I'd recommend steering clear of it altogether!) as you can get into a situation like this (I have seen this code numerous times!):

In [37]:
name = input("Please enter your name")

# The "echo" terminal command will print out text
os.system("echo Hello " + name)

Hello Sam
Here's another echo!


0

Looks fine right? We are just taking in a name and using the terminal to print it rather than Python. However, what happens if we enter our name as `Sam && echo "Here's another echo!"`? Well we get the print as before, but we also execute the second command. If that second command was `rm -rf /` we would wipe the server! This *code injection* is a huge security problem and should be avoided at all costs!

That being said, almost everything we would want to do with `os.system` has a safer option through the `os` library. For example to list the files in a folder, instead of using `os.system("ls")`, we can use the following: 

In [38]:
print(os.listdir("./"))

['test_csv.csv', 'new_file.txt', 'test_json.json', 'json_dump.json', 'Files.ipynb', 'bad_car.txt', 'test_car.txt', 'hello.txt']


From here you can iterate over the files with a for loop, parsing them differently based on their extensions, and then put them into your analysis pipeline.

Another common task to do with the `os` library is creating some folders for new files to go in - we can create new files with just Python but to give our project some organisation we need to use `os.makedirs`:

In [39]:
# This command will just make a folder in the current directory with the name test_folder
os.makedirs("test_folder")

The `os` library can do a lot more, but finally for now let's look at how to delete files:

In [44]:
# os.remove gets rid of a single file
os.remove("./bad_car.txt")

# os.removedirs gets rid of a directory (folder)
os.removedirs("./test_folder/")

For now, that's enough for the `os` library!



## A note on other libraries

Finally - just a note on other libraries in Python. It's very common for a library to be centered around an object that defines what the library does - think about the `DateTime` object for `datetime` or the plot object in matplotlib. If this object requires you loading some data into it from a file, very often that library will have some built in methods for loading the file. The easiest example here is `pandas` - a library that brings R's excellent dataframes into Python to manage labelled data in an intelligent fashion. `pandas` has a plethora of tools for importing data in a consise way - for example for csv data we have:

In [46]:
import pandas as pd

data = pd.read_csv("./test_csv.csv")
print(data)

    Name  Colour  Age
0    Sam    Blue   26
1    Tom  Purple   34
2   Alex    Pink   40
3  Sarah     Red   19
4  Jenny   Green   65


It even formats it in a nice way!

Another example is the Python image library, `PIL`:

In [49]:
from PIL import Image

im = Image.open("./sam_ball.JPG")
print(im.format, im.size, im.mode)
im.show()

JPEG (1920, 1080) RGB


Both of these libraries are loaded with tools for working with their respective type - `PIL` is full of cool tools for working with images in Python and `pandas` is fundamental for data analysis!

## Exercises

For this weeks problems - all the files are in the `exercises` folder you can find here. For questions with multiple files, the files are zipped into a folder. You will need to use `!wget` and `!unzip` to download and unzip the files respectivly. 

### Reading and Writing from a File

Read the file `my_secrets.txt`. What's my Duolingo password? Save it to a file for later!

### Word Counter

Load the `book.txt` file into Python and write a program that counts each instance of each word. What are the five most common words? (Note: This book is taken from Project Gutenberg and may need some cleaning up of newlines (\n) and tabs (\t) before counting!)

### JSON APIs

APIs are web services for delivering data for other web applications. For example [this API](https://v2.jokeapi.dev/joke/Any?safe-mode) generates some information for jokes from a database. To automatically make requests to a Python API, we can use the requests library in the following way:

In [4]:
import requests
joke = requests.get("https://v2.jokeapi.dev/joke/Any?safe-mode")
joke.json()

{'error': False,
 'category': 'Programming',
 'type': 'single',
 'joke': 'The generation of random numbers is too important to be left to chance.',
 'flags': {'nsfw': False,
  'religious': False,
  'political': False,
  'racist': False,
  'sexist': False,
  'explicit': False},
 'id': 39,
 'safe': True,
 'lang': 'en'}

**PLEASE DON'T RUN A FOR OR WHILE LOOP ON AN API WITHOUT PUTTING A DELAY IN!**

You job is to write a Python program to get a joke from the jokes API above, and save it to a json file.

### School Grades 

`school_grades.zip` below has all the grades for a class - using Python and the `os`, find the following information:

* How many students are there in the class?
* Which student's favourite colour is red?
* Who scored highest in the maths exam?
* What was the average score in english?