# Lab 4A - File I/O
*Day 4 - August 1, 2024*

*I School Python Bootcamp*

*Author: Lauren Chambers<br>Modified from notebook by George McIntire*


In Python, reading and writing files is a fundamental operation for handling data. Python provides built-in functions and methods to easily perform these tasks. To work with files, you typically follow these steps:

## Opening a File

To read from or write to a file, you first need to open it. You can use the built-in `open()` function for this purpose. The `open()` function takes two arguments: the file path and the mode. The mode specifies whether you want to read, write, or append to the file. The common modes are:

`r`: Read mode (default). It allows you to read from the file.

`w`: Write mode. It allows you to write to the file. If the file exists, it will be truncated (i.e., its content will be erased). If the file does not exist, a new file will be created.

`a`: Append mode. It allows you to write to the file, but the data will be added at the end of the file, preserving the existing content. If the file does not exist, a new file will be created.

Here's an example of opening a file in read mode:

In [None]:
filepath = "zen_of_python.txt"

with open(filepath, "r") as f:
    zen = f.read()
    

Once the file is open in read mode, you can use the file object's read() method to read its content. This method reads the entire content of the file as a single string.

In [None]:
print(zen)

The `with` statement in Python is used for what is called "context management." It is mainly used when working with resources that need to be properly managed, such as files, network connections, and databases. The with statement ensures that the resource is properly set up, used, and cleaned up when it is no longer needed, even if an exception occurs.

We are using the `with` statement to open up the text file, read its contents and then close it.

Here's the other way when it comes to opening and close files in Python.

Open the file

In [None]:
f = open("zen_of_python.txt")

Extract the content

In [None]:
zen = f.read()

Close the file

In [None]:
f.close()

`.readlines()` is a method that tokenizes or splits up the text based on the lines in the text.

In [None]:
with open("zen_of_python.txt") as f:
    zen = f.readlines()

`zen` is a list of lines

In [None]:
zen

Run `!ls`  to list the files in this directory.

In [None]:
!ls

## Writing and Appending to Files

To write data to a file, you need to open the file in write or append mode. You can use the file object's `write()` method to write data to the file.

Let's write the following text passage and to a new file.

In [None]:
text = """By memorizing & compressing the data, the NN generalizes, and at scale, 
like other self-supervised architectures, will learn to meta-learn datapoints, 
becoming a compact encoding of the distribution which rapidly learns each new datapoint 
in a single gradient descent step (like Hopfield nets or Reptile). 
Because of the uniformity and small dimensionality of input/output, 
the NN can be a deep MLP rather than a CNN, Transformer, or MoE."""

`open("text.txt", "w")` creates a new .txt file.

In [None]:
with open("text.txt", "w") as f:
    f.write(text)

Let's verify it actually exists by opening and reading the `text.txt` file

In [None]:
with open("text.txt") as f:
    print(f.read())

<div class="alert alert-block alert-danger">
    BE CAREFUL WITH WRITE MODE
</div>

The following code, if run, would delete/overwrite `zen_of_python.txt`

```python
with open("zen_of_python.txt", "w") as f:
    f.write(text)
```

What if we want to add to an existing file? Then we use `a` or append mode

Let's append this second text passage to `text.txt`

In [None]:
text2 = """Training can be done by ordinary SGD, 
but also by any local learning rule, or any mix thereof
(eg. SGD at training time in compute-optimal large batches for offline training datasets, 
and then a local learning rule at ‘runtime’ when conditioning online with new inputs)."""

We're adding a `\n` character seperate the two text passages

In [None]:
with open("text.txt", "a") as f:
    f.write("\n" + text2)

Let's look at the results.

In [None]:
with open("text.txt") as f:
    print(f.read())

Let's try using iteration to write data line by line to a text file.

We're going to write to a text file the full url for each of these case ids.

In [None]:
url = "https://scholar.google.com/scholar_case?case={}"
case_ids = [6768454, 7712102, 6448738, 5374866, 6167299, 4683704, 3160303,
       3379031, 5507176, 7024792, 9332965, 3999836, 1314612, 5918388,
       1994897, 6497745, 4011633, 2837473, 7121246, 2001872]

In [None]:
with open("case_urls.txt", "a") as f:
    
    for i in case_ids:
        f.write(url.format(i) + "\n") # \n is the new line character

Load that file back in to see our handiwork:

In [None]:
with open("case_urls.txt", "r") as f:
    urls = f.readlines()

urls

How might you remove the new line characters from the URLs when reading the data back in?

## Tabular data

Our process for loading and writing tabular (AKA table-based or two-dimensional) data is a bit different. In order to load comma-separated value (CSV) files, we must import a new package. There are a few options, but first let's look at the built-in `csv` package.

### Using `csv`

In [None]:
import csv

Using the `csv.reader()` function creates a generator (remember our first generator friend `range()`?) containing each row of the file. You can view the rows by putting the generator through a for loop like so:

*Note: I didn't generate this list of countries and do not necessarily endorse any political implications of which countries are and are not included!*

In [None]:
with open('countries.csv', newline='') as f:
    reader = csv.reader(f)
    for row in reader:
        print(row)

However, just like `range()`, if we try to view the reader object itself we won't see the data that it points to -- because it is a generator, and it does not hold onto all of the data in memory like a list does.

In [None]:
reader

If we need to be able to see all the data at once, we can convert this into a list of lists. However, we still have to put that code within a `with open()` block. What happens if we don't?

In [None]:
countries_data = []

for row in reader:
    countries_data += row

countries_data

Our reader generator doesn't have access to the CSV file it needs to read from! Let's try again:

In [None]:
countries_data = []

with open('countries.csv', newline='') as f:
    reader = csv.reader(f)
    for row in reader:
        countries_data.append(row)

countries_data

Now let's play with these data just a little. Can we list all the countries?

In [None]:
all_countries = set([row[0] for row in countries_data[1:]])
all_countries

What about the population of each country in 2007?

In [None]:
pops = {}
for row in countries_data[1:]: # don't include the header row!
    # Pull out the relevant data from the row as local variables
    country = row[0]
    year = row[1]
    pop = row[2]
    
    if year == '2007': # Note we have to use a string here, not an integer
        pops[country] = pop

pops

What about the GDP of Senegal over time?

In [None]:
gdp_senegal = {}
for row in countries_data[1:]: # don't include the header row!
    # Pull out the relevant data from the row as local variables
    country = row[0]
    year = row[1]
    gdp = row[5]
    
    if country == "Senegal":
        gdp_senegal[year] = gdp

gdp_senegal

### Using `pandas`

Perhaps the more popular option is to use `pandas`. It has a lot of very intuitive functions!

In [None]:
import pandas as pd

In [None]:
countries_df = pd.read_csv("countries.csv")
countries_df

We can do all the same things using `pandas` that we did using `csv` - and much easier in my opinion.

In [None]:
# List all countries
set(countries_df.country)

In [None]:
# Population of each country in 2007
countries_df[countries_df.year == 2007][["country", "pop"]]
# Note that pandas implicitly converted the year column into an integer!

In [None]:
# GDP of Senegal over time
countries_df[countries_df.country == "Senegal"][["year", "gdpPercap"]]

## JSON / attribute-value pair data

Similarly to tabular data, loading JSON data requires another built-in package: `json`.

In [None]:
import json

In [None]:
with open("page_visits.json", "r") as fo:
    data = fo.read()
    json_dicts = json.loads(data)

json_dicts

This file creates a list of dictionaries. We can manipulate that to answer questions about the data. Like: how many pages did each use complete reading?

In [None]:
complete_reads = {}

for visit in json_dicts:
    userId = visit['userId']
    
    if userId not in complete_reads.keys():
        complete_reads[userId] = 0 # Initialize entry for new users
    
    if visit['completed']:
        complete_reads[userId] += 1

complete_reads

# Exercises

## Exercise 1
*Reading and writing text files*

Open up and load in `zen_of_python.txt` reverse the order of the words in it and then save the result to a new file called `zen_of_python_reversed.txt`

## Exercise 2
*Reading and analyzing JSON data*

Load in the `page_visits.json` file. Iterate over it and extract all the values associated with the `title` only if the `completed` value is True into a list.

## Exercise 3
*Reading and analyzing tabular data*

Load the `countries.csv` file (you can use `csv` or `pandas`, whichever you prefer). Calculate the average life expectancy for each continent in 1952.

## Exercise 4
*Self-teaching skills using package documentation*

Load `countries.csv` into a pandas dataframe.

Review this tutorial on the `pandas` website: https://pandas.pydata.org/docs/getting_started/intro_tutorials/05_add_columns.html

Use it to figure out how to add a new column to the dataframe: the raw GDP, as opposed to the per capita. (`gdp = pop * gdpPercap`)