# Working with Files

This notebook introduces file handling in Python.

You will need to run the following cell to copy the necessary files into a local data directory:

In [None]:
# copy all data files from the 3010 repo
import urllib.request
import json
from pathlib import Path

# Config
owner = "olearydj"
repo = "INSY3010"
branch = "main"
folder = "notebooks/data"
destination = "data"

# Get list of files from GitHub API
api_url = f"https://api.github.com/repos/{owner}/{repo}/contents/{folder}?ref={branch}"
response = urllib.request.urlopen(api_url)
files = json.loads(response.read())

# Download each file
Path(destination).mkdir(exist_ok=True)
for file in files:
    if file['type'] == 'file':  # Skip directories
        print(f"Downloading {file['name']}...")
        urllib.request.urlretrieve(file['download_url'], f"{destination}/{file['name']}")
        print(f"✓ {file['name']}")

## Reading Files

So far, input has come from the keyboard via `input()`, but we'll often want to read data in from a file. There are two main types of files:

- text files: human readable (txt, csv, html, json, log, etc.)
- binary files: not human readable (exe, zip, jpg, etc.)

Text files range from completely unstructured to very structured. Base  Python includes tools suitable for both. We will cover plain text (e.g. TXT) and CSV formats. The methods we describe can be applied to any non-binary format, though specific packages are available for most structured formats (e.g. JSON).

### Plain Text Formats

The general steps are to open the file, read from it, and close the file. You may choose to process data while reading from it, or after closing the file.

### Open, Read, and Close

To open a file, use the aptly named `open` function. `open` takes a single parameter, the file name as a string. The file name may include a relative or absolute path. If not included, the file must be in the same directory as the `.py` file. Open returns a *file object* that is used to "handle" the data.

In [None]:
file_name = './data/the_file.txt'
file_handle = open(file_name)
print(file_handle)

Here, `file_handle` is the name of the object used to manipulate `the_file.txt`. File objects like `file_handle` have several methods, including:

- `.read()` - returns the full file contents as a single string
- `.readlines()` - returns a list of strings, one for each line in the file
- `.readline()` - returns a single line at a time
- `.close()` - closes the file object

In this context, a line is a string of text ending with a newline (`\n`) character.

The following code reads the data associated with the file and closes the handle before printing the contents.

In [None]:
contents = file_handle.read()
file_handle.close()
print(contents)

### Contents as a List of Strings

It is often useful to read in the content as a list of strings, where each string represents a line in the file. This is accomplished with the `readlines` method.

In [None]:
file_name = './data/the_file.txt'
file_handle = open(file_name)

lines = file_handle.readlines()
file_handle.close()

print(lines)

This provides direct access to individual lines via indexing or groups via slicing.

In [None]:
# first line
print(lines[0])

# list of all lines, in reverse order
print(lines[::-1])

### One Line at a Time

Alternatively, you can read lines one at a time.

This is typically used when you want to process each line as it is read, and avoid loading the whole file first. Why might you want to do that?

In [None]:
file_name = './data/the_file.txt'
file_handle = open(file_name)
file_handle.readline()

Note the trailing newline on this output. We've not used `print`, so the cell output is simply the last value **returned** by its contents, a string (denoted by the single quotes), including the trailing newline character.

We haven't closed the file, so more readline operations can be performed:

In [None]:
file_handle.readline()

In [None]:
file_handle.readline()

Once we reach the end of a file, `readline` will always return an empty string.

### Looping through Lines

Reading line by line is typically done in a loop. We can use the empty string returned by `readline` at the end of a file to trigger the end of a `while` loop.

In [None]:
# close previous handle
file_handle.close()

file_name = './data/the_file.txt'
file_handle = open(file_name)

# loop through all lines
while True:
    line = file_handle.readline()
    if line == '':
        break
    print(line)

# close it again
file_handle.close()

Note the blank line introduced by this approach. Where does it come from? How would you correct for it?

However, file handle objects are iterable, so it is more common to use a `for` loop. This approach avoids the need for an explicit `readline`, making it simple and easy to read.

In [None]:
file_name = './data/the_file.txt'
file_handle = open(file_name)

for line in file_handle:
    print(line, end='')

file_handle.close()

### Exercise - Word Counts

Write a function `count_words` that takes a filename and returns a dictionary of all the words that appear in the file (keys) and the number of times each appears in it (values).

In [None]:
# have at it


Test it with the following cell:

In [None]:
file_name = './data/auburn_creed.txt'
count_words(file_name)

#### Solution

In [None]:
def count_words(fn):
    fh = open(fn)
    d = {}
    for line in fh:
        for word in line.split():
            if word in d:
                d[word] += 1
            else:
                d[word] = 1
    return d

How would you modify this code to ignore "stop words" (commonly used words that convey little meaning, e.g. “the”, “a”, “an”, or “in”) and eliminate punctuation at the end of words?

## Discussion

- Structure in text files varies
  - Impacts how you process resulting data
  - Line breaks can be used
    - to separate lines or paragraphs
    - inconsistently
    - not at all
  - Tools exist to help with this, e.g. `textwrap.wrap()`
- Interacting with file systems
  - Specify **path** or keep files in same directory as `.py`
  - Path names are different between Windows and MacOS / Linux
    - Windows: `c:\dev\insy3010`
    - MacOS / Linux: `~/dev/insy3010`
  - To write Python that works in both, use the `path` library

## Context Managers

In Python, a **context manager** can be used to handle *setup and teardown* operations associated with a task. When working with files, a context manager will automatically close the handle when access is complete. This is accomplished by the `with ... as` syntax.

In [None]:
file_name = './data/the_file.txt'

with open(file_name) as file_handle:
    lines = file_handle.readlines()

In [None]:
for idx, line in enumerate(lines):
    print(f'{idx:02d}: {line.strip()}')

Confirm that the file is automatically closed:

In [None]:
# no close required
line = file_handle.readline()

`enumerate` in the code above wraps any iterable and returns a tuple of `index, value` for each iteration:

In [None]:
lst = list('abcdefg')

for idx, val in enumerate(lst):
    print(f"{idx}: {val}")

It is good practice to always use `with` when reading files.

Context managers are available in a variety of contexts, including managing database connections, e.g.:

```python
with database.connect() as connection:
    cursor = connection.cursor()
    cursor.execute("SELECT * FROM users")
```

## Error Handling

It is the programmer's responsibility to prevent and/or handle all errors that may occur. When working with files this is particuarly important to avoid large scale data loss or corruption.

In some cases the most effective way to do this is to let them happen and "catch" them.

Contrasting approaches:

- anticipate and correct errors (look before you leap, lbyl)
- accept they may happen and catch them (ask forgiveness not permission, afnp)

Here is an example of the former:

In [None]:
# Check if the input is valid BEFORE trying to convert it
def lbyl_approach(user_input):
    '''check if input is numeric before converting, warn if not'''
    if user_input.isdigit() or (user_input.startswith('-') and user_input[1:].isdigit()):
        number = int(user_input)
        print(f"LBYL: Successfully converted '{user_input}' to {number}")
        return number
    else:
        print(f"LBYL: '{user_input}' is not a valid integer")
        return None

test_inputs = ["42", "-17", "abc", "12.5", "  "]

print("=== LBYL Approach ===")
for inp in test_inputs:
    lbyl_approach(inp)

Works but relatively complex and fragile.

To implement the second option, Python provides the `try ... except` mechanism:

```python
try:
    # if any error occurs in this code block...
except:
    # this code runs
```

Here is the same problem implemented with try / except:

In [None]:
# Try to convert it and handle the exception if it fails
def afnp_approach(user_input):
    '''try to convert and warn if it fails'''
    try:
        number = int(user_input)
        print(f"AFNP: Successfully converted '{user_input}' to {number}")
        return number
    except ValueError:
        print(f"AFNP: '{user_input}' is not a valid integer")
        return None


test_inputs = ["42", "-17", "abc", "12.5", "  "]

print("\n=== AFNP Approach ===")
for inp in test_inputs:
    afnp_approach(inp)


This is a simple and straightforward approach that is very robust.

Try / except can be extended to process different error types differently, for example, the success of the `open` statement depends on the existence of the specified file, among other things. The following code sample uses a specific except block to catch `FileNotFoundErrors` and another to catch anything else that might go wrong: 


```python
try:
    with open('missing.txt') as f:
        ...  # do stuff with f
except FileNotFoundError:
    print("File not found!")
    # handle this gracefully - get different filename?
except:
    # handle any other error
    print("Other error in file open")
    raise  # allows the error to crash the program
```

## Writing Files

Data can be written to an open file using the `write` method, which writes strings. But the file handle must have write permissions, which are set when it is created.

```text
>>> help(open)

Help on function open in module _io:

open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)
    Open file and return a stream.  Raise OSError upon failure.
```

A variety of modes exist, but we will focus on these:

- mode 'r' is the default, allows read only
- mode 'w' gives write permissions, will create a new file or **overwrite** existing
- mode 'a' opens for appending, will create a new file or add to an existing one

The general steps are to open the file in the appropriate mode, write to the file, and close it. You can either prepare the data to write before those steps or as part of the write process.

In [None]:
with open("./data/write_test.txt", "a") as f:
    f.write("File opened for append.")
    for idx in range(5):
        f.write(f"Writing line {idx}")
    f.write("Closing file.")

The resulting file looks like this:

```text
File opened for write.Writing line 0Writing line 1Writing line 2Writing line 3Writing line 4Closing file.
```

Why is everything on one line?

### Exercise - Read and Write Files

Write a function `add_line` that takes two arguments: a filename and a string. Update the file to append that string. Use try / except to catch any errors that occur in the process. If no errors occur, print a success message.

In [None]:
# code here


In [None]:
add_line('./data/write_test.txt', '\nthis is added text')

#### Solution

In [None]:
def add_line(filename, line):
    try:
        with open(filename, 'a') as f:
            f.write(line)
        print("Line added successfully.")
    except:
        print("Error occurred in add_line")

Test with valid line value (string type):

In [None]:
add_line("./data/write_test.txt", "\nAdded line.")

Test with invalid line value (int type):

In [None]:
add_line("./data/write_test.txt", 42)

## CSV Files

Comma Separated Values (CSV) files are use a structured, plain text format. They provide a simple way to represent tabular data. Each line in a CSV represents a row in the table, with the value of each column in a comma separated list.

For example, here part of a CSV of housing data from Zillow (spaces added for clarity):

```text
 1, 2222, 3, 3.5, 32312, 1981, 250000
 2, 1628, 3, 2,   32308, 2009, 185000
 3, 3824, 5, 4,   32312, 1954, 399000
 4, 1137, 3, 2,   32309, 1993, 150000
 5, 3560, 6, 4,   32309, 1973, 315000
 ```

Here, each line has values corresponding to the line number (index), square footage, bedrooms, bathrooms, zipcode, year built, and list price.

The CSV format is very common, especially for export / import. It is not usually created or edited by hand, but generated, e.g. by Excel. There are better formats for large or complex datasets, but CSV is often *good enough*.

Python provides a `csv` module for manipulating this file type.

### Reading CSV

Import the module, create a file handle, wrap it in a `csv.reader` object. The resulting interface is an iterable that yields each row of the CSV as a list of strings. Each item in the list is a field from the row.

For example, reading the first row of the Zillow data:

`1,2222,3,3.5,32312,1981,250000`

will return:

`['1','2222','3','3.5','32312','1981','250000']`

Note that each element is a string.

In [None]:
import csv

with open('./data/zillow.csv') as f:
    reader = csv.reader(f)
    for row in reader:
        print(row)

Sometime CSV files will include a header line with the column names. This can be read separately using the `next` function, which returns the next value in an iterable.

Variants of CSV exist where tab or other characters are used to separate fields. To handle those, `csv.reader` has a `delimiter` parameter, which can be set to any string value.

For example, consider this tab separated version of the Zillow file that includes a header line:

```text
Index	Sq Ft	Beds	Baths	Zip	Year	List Price ($)
1	2222	3	3.5	32312	1981	250000
2	1628	3	2	32308	2009	185000
3	3824	5	4	32312	1954	399000
4	1137	3	2	32309	1993	150000
5	3560	6	4	32309	1973	315000
```

In [None]:
import csv
# function for pretty printing data structures
from pprint import pprint as pp

with open('./data/zillow-tabs.csv') as f:
    # specify tab-delimited format
    reader = csv.reader(f, delimiter='\t')

    # build a list of rows, starting with the header
    zillow = []
    zillow.append(next(reader))
    for row in reader:
        zillow.append(row)

pp(zillow)

### Exercise - Type Conversion

Data loaded from a CSV is all strings. This is not usually correct for every column. Write a function `convert_type` that takes a row from the Zillow data and converts every value to an appropriate type. Modify the code above to call that function on each row.

In [None]:
import csv
# function for pretty printing data structures
from pprint import pprint as pp



#### Solution

First attempt, brute force.

In [None]:
import csv
from pprint import pprint as pp

def convert_type(data):
    idx = 0
    for val in data:
        data[idx] = int(val)
        idx += 1
    return data

with open('./data/zillow-tabs.csv') as f:
    reader = csv.reader(f, delimiter='\t')
    zillow = []
    zillow.append(next(reader))
    for row in reader:
        zillow.append(convert_type(row))

pp(zillow)

Second attempt, try / except.

In [None]:
import csv
from pprint import pprint as pp

def convert_type(data):
    idx = 0
    for val in data:
        try:
            data[idx] = int(val)
        except:
            print(f'Int conversion error in row {data[0]}')
            print('Converting to float')
            data[idx] = float(val)
        idx += 1
    return data

with open('./data/zillow-tabs.csv') as f:
    reader = csv.reader(f, delimiter='\t')
    zillow = []
    zillow.append(next(reader))
    for row in reader:
        zillow.append(convert_type(row))

pp(zillow)

### Writing CSV Files

The `csv.writer()` object is used to write CSV formatted data to a file handle with appropriate permissions. There are two relevant methods:

- `writerow(row_data)` writes a single row, where `row_data` is a list of values
- `writerows(table_data)` writes multiple rows, where `table_data` is a list of lists of values

Everything is written in plain text, without any information about the underlying data types. This information will have to be reconstructed when read into Python, Excel, etc.

In [None]:
import csv

zillow_data = [
    [1, 2222, 3, 3.5, 32312, 1981, 250000],
    [2, 1628, 3, 2, 32308, 2009, 185000],
    [3, 3824, 5, 4, 32312, 1954, 399000],
    [4, 1137, 3, 2, 32309, 1993, 150000],
    [5, 3560, 6, 4, 32309, 1973, 315000]
]

header = ['Index', 'Sq Ft', 'Beds', 'Baths', 'Zip', 'Year', 'List Price ($)']

with open('./data/zillow-new.csv', 'w') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(header)
    writer.writerows(zillow_data)

print("Export complete.")