# Week 5 Lab


## Instructions

For each task, write in the provided cell.


## Due date
Labs are due each week on Wednesday at 8pm (**Oct 11, 8pm**)

# Assignment
This assignment goes through some familiar code, using a new set of texts: a folder collecting all the Presidential Inaugaral Addresses by U.S. presidents, from George Washington in 1789 to Joe Biden in 2021 (as they are collected [here](https://archive.org/details/Inaugural-Address-Corpus-1789-2009), supplemented from recent ones [here](https://www.presidency.ucsb.edu/documents/presidential-documents-archive-guidebook/inaugural-addresses)).✼

## Tasks 1 and 2: Modify and Comment the Code

### Task 1

Below is the code from lecture. It calculates the overall and standardized type-token ratios for a set of files in a folder named `sot4chaps`, outputting its results into two CSV spreadsheet files. 

However, the folder containing the presidential speeches whose TTR we want to calculate is named `speeches`, not `sot4chaps`. Find the line of code that specifies which folder contains the text for which the TTRs will be calculated and modified to reference the `speeches` folder instead. (Note: if you don't modify this code correctly, the below code will generate two empty CSV files.)

### Task 2

For **every single line of code** in the cell below, **add a comment (using `#`) explaining what that line of code does**. Some lines of code appear twice in identical or near-identical form; comment them all, cutting and pasting your explanations if necessary. Comments are not necessary for blank lines with no code in them.

In [None]:
# Starter code is provided below.
# Modify it so that Python looks in the correct folder for texts to analyze
# Then add a comment to EVERY SINGLE LINE OF CODE to explain what it does

import re #Import regex package
from pathlib import Path

folder_path = "sot4chaps/"

sample_size = 0

file = open("ttr-overall.csv", mode="w", encoding="utf-8")

file.write('"Text","Types","Tokens","TTR"\n')

for file_path in sorted(Path(folder_path).glob('*.txt')):
    
    text = open(file_path, encoding='utf-8').read()
    text = re.sub("[^a-zA-Z0-9]", " ", text)
    
    text_words = text.split()
    tokens = len(text_words)
    
    if sample_size == 0 or tokens < sample_size:
        sample_size = tokens
    
    unique_words = []
    
    for word in text_words:
        word = word.lower()
        if word not in unique_words:
            unique_words.append(word)
            
    types = len(unique_words)
    
    ttr = (types / tokens) * 100
    
    file.write(f'"{file_path.stem}",{types},{tokens},{ttr}\n')

file.close()



file = open("ttr-standardized.csv", mode="w", encoding="utf-8")

file.write('"Text","Types","Tokens","TTR"\n')

for file_path in sorted(Path(folder_path).glob('*.txt')):
    text = open(file_path, encoding='utf-8').read()
    text = re.sub("[^a-zA-Z0-9]", " ", text)
    
    text_words = text.split()
    text_words_standardized = text_words[:sample_size]
    tokens_standardized = len(text_words_standardized)

    unique_words_standardized = []
    
    for word in text_words_standardized:
        word = word.lower()
        if word not in unique_words_standardized:
            unique_words_standardized.append(word)
            
    types_standardized = len(unique_words_standardized)
    
    ttr_standardized = (types_standardized / tokens_standardized) * 100
    
    file.write(f'"{file_path.stem}",{types_standardized},{tokens_standardized},{ttr_standardized}\n')

file.close()

## Task 3: Make Some Predictions

Before you look at the CSV files that the code generates, make some predictions about what you think you might see. Do you expect TTRs of presidential addresses to change over time? Are there particular US presidents you'd expect to have a higher or lower TTR? Do you think that Republican or Democratic presidents will tend to have higher or lower TTRs? Write 1-2 sentences in the Markdown cell below with guesses and predictions?

(Note: If you are immune from US cultural imperialism and don't know anything about the history of our neighbour to the south, that is absolutely fine, and you can base your predictions on something other than your intimate knowledge of US history...)

(Replace this text and enter your answers here)



## Task 4: Interpret the Results (Sorted by Year)

Open the `ttr-standardized.csv` file that is created when the code above is run, where the results are sorted by year. In the cell below, reflect on how these results compare with your predictions.

(Replace this text and enter your answers here)

## Task 5: Interpret the Results (Sorted by TTR)

The code cell below uses a package called `pandas` — which we will be meeting after the midterm — to generate a pretty table that sorts all the presidential speeches by their TTRs, from lowest to highest.

(Note that you are not expected to know anything about `pandas` for the midterm itself.)

In [None]:
import pandas as pd    # imports the pandas package
presidential_speeches = pd.read_csv('ttr-standardized.csv')    # loads the "ttr-standardized.csv" file you created above into a pandas "dataframe" object
presidential_speeches.sort_values(by='TTR')   # sorts the rows from your CSV by TTR from smallest to largest, then displays this as a pretty table


Without worrying for now about how `pandas` works (we'll dig into that after the midterm), use the sorted table above to reflect on the TTR experiment we have just conducted.

In the Markdown cell below, reflect on whether the sorted results help you to notice any trends in the data. What further insights does the sorted table provide into your predictions, or into the texts themselves?

(Replace this text and enter your answers here)

✼ Pedantic footnote: in fact, this corpus does not include Washington's 1793 address, since it is only 140 words long and using it as the shortest text obscured the TTR trends that were visible with a larger sample. The file with this address is included in the week's homework folder, in a subfolder named `excluded_speeches`, if you want to explore adding it back in. You can also discuss in tutorial whether this was an appropriate way to handle this outlier, and what other options might have been.

### Optional: Questions about this week's material

In the Markdown cell below, please feel free ask any question(s) you have about this week’s lecture material and/or the material in this lab.

(Replace with any questions you have!)

## Marking Rubric

Two points are awarded for labs: one point for submitting the completed lab on time, and one point for making at honest effort at completing it.


## How to Submit
1. Download this notebook to your local computer and save it as `W05_lab.ipynb`.

2. Log in here: https://markus-ds.teach.cs.toronto.edu

3. Submit your homework to `lab5: Lab 5`.