In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hw06.ipynb")

<div class="alert alert-success" markdown="1">

#### Homework 6

# SQL, Regular Expressions, and GPTEECS

### EECS 398-003: Practical Data Science, Fall 2024

#### Due Thursday, October 17th at 11:59PM (due in **two** weeks)
    
</div>

## Instructions

Welcome to Homework 6! In this homework, you will practice writing SQL queries, use regular expressions to extract meaning out of messy text data, and apply your knowledge of cosine similarity and TF-IDF to implement a supercharged ChatGPT-like bot. See the [Readings section of the Resources tab on the course website](https://practicaldsc.org/resources/#readings) for supplemental resources.

You are given six slip days throughout the semester to extend deadlines. See the [Syllabus](https://practicaldsc.org/syllabus) for more details. With the exception of using slip days, late work will not be accepted unless you have made special arrangements with your instructor.

To access this notebook, you'll need to clone our [public GitHub repository](https://github.com/practicaldsc/fa24/). The [⚙️ Environment Setup](https://practicaldsc.org/env-setup) page on the course website walks you through the necessary steps. Once you're done, you'll submit your completed notebook to Gradescope.

Please start early and submit often. You can submit as many times as you'd like to Gradescope, and we'll grade your **most recent** submission.

<div class="alert alert-success" markdown="1">
Unlike other homeworks, Homework 6 <b>has no hidden tests</b>, because of its proximity to the Midterm Exam. This means the tests you see in your notebook are the exact same as the ones that will be used to grade your work on Gradescope. When you submit on Gradescope, you'll see your score shortly after you submit, once the autograder finishes running.
<br><br>
<b>Even though Homework 6 is due after the Midterm Exam, you should work on it before, since everything in the homework is in scope for the exam!</b> In particular, we recommend working on Questions 1-3 before the exam, because they provide core practice with SQL and regular expressions, both of which will appear on the exam. Question 4 is more "applied", and while cosine similarity, bag of words, and TF-IDF will appear on the exam, the best way to practice with those ideas is by working on relevant old exam problems at the <a href="https://study.practicaldsc.org"><b>study site</b></a>.
</div>

If you do fail a test in your notebook, look for a brief failure message that describes the error.

This homework is worth a total of **56 points**, all of which come from the autograder. The number of points each question is worth is listed at the start of each question. **The four questions in the assignment are independent, so feel free to move around if you get stuck**. Tip: if you're using Jupyter Lab, you can see a Table of Contents for the notebook by going to View > Table of Contents.

<!-- <a name='like-dataframe'>

</a>

<div class="alert alert-warning" markdown="1">
    
**Note**: Throughout this homework, you'll see statements like this frequently:

<blockquote>Complete the implementation of the function ____, which takes in a DataFrame <code>df</code> like <code>other_df</code> and _____.</blockquote>

What this means is that you should assume that `df` has the same number of columns as `other_df`, with the same column titles and data types, but potentially a different number of rows in a different order, with a potentially different index. You should always also assume that `df` has at least one row.

We have you implement functions like this to prevent you from hard-coding your answers to one specific dataset.

</div>
 -->
 
<div class="alert alert-danger" markdown="1">
<tt>for</tt>-loops are <strong>allowed</strong> throughout this entire homework.

</div>

To get started, run the **two** cells below, plus the cell at the top of the notebook that imports and initializes `otter`. The cell below installs a few new packages that weren't included in the `pds` conda environment that we'll need throughout the assignment.

In [None]:
!pip install duckdb
!pip install groq

In [None]:
import duckdb
import pandas as pd
import numpy as np
import os
import re
import groq
from IPython.display import Markdown

## Question 1: LoansQL 💵

---

In this question, you'll practice writing SQL queries involving the LendingClub loans dataset from [Lecture 7](https://practicaldsc.org/resources/lectures/lec07/lec07-filled.html) and Homework 4. Run the cell below to load in our dataset as the DataFrame `loans` and clean it using the same steps from lecture.

In [None]:
def clean_term_column(df):
    return df.assign(
        term=df['term'].str.split().str[0].astype(int)
    )

def clean_date_column(df):
    return (
        df
        .assign(date=pd.to_datetime(df['issue_d'], format='%b-%Y'))
        .drop(columns=['issue_d'])
    )

loans = (
    pd.read_csv('data/loans.csv')
    .pipe(clean_term_column)
    .pipe(clean_date_column)
)

loans

As we did in [Lecture 10](https://practicaldsc.org/resources/lectures/lec10/lec10-filled.html#SQL), we will use the Python module `duckdb` to execute SQL queries within our Jupyter Notebook. We've included code to install and import `duckdb` at the top of this notebook.

Specifically, we will use the `run_sql` function we defined in lecture.
- `run_sql` takes in a string containing a SQL query.
- `run_sql` outputs the result of running the query, treating all DataFrames mentioned in the query as if they were SQL tables. The result it outputs (here) is a **DataFrame**.

In [None]:
def run_sql(query_str):
    if query_str == ... or query_str.strip() == '':
        raise NotImplementedError('The input passed to run_sql is empty. Update it to include your query.')
    out = duckdb.query(query_str)
    return out.to_df()

For example, the following call to `run_sql` references `loans`, a DataFrame already defined in our notebook. It returns a new DataFrame (and doesn't modify the original `loans` DataFrame).

In [None]:
run_sql('''
SELECT * FROM loans
WHERE term = 60 AND loan_amnt > 20000
''')

The above query happens to be equivalent to the following:

In [None]:
loans[(loans['term'] == 60) & (loans['loan_amnt'] > 20000)]

All of the questions below will ask you to assign your answer to a **string**. To test your code, we will call `run_sql` on the string that you define, and make sure the resulting DataFrame has the right properties. We suggest you use multi-line strings (defined by `'''` triple quotes `'''`) like in the example call to `run_sql` above.

Most of the syntax you need to answer the queries here was covered in Lecture 10. The chart at the start of the [SQL section of Lecture 10](https://practicaldsc.org/resources/lectures/lec10/lec10-filled.html#SQL) contains a nice summary of the necessary keywords.

### Question 1.1 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

Assign `query_1` to a string containing a SQL query that finds **the total (sum) of all loan amounts for each loan purpose, only among loans given in Michigan (`'MI'`)**. The DataFrame that results from calling `run_sql(query_1)` should have two columns, **`'purpose'`** and **`'total_loans'`**, and should be sorted in **descending order** of `'total_loans'`.

In [None]:
query_1 = '''
...
'''
run_sql(query_1)

In [None]:
grader.check("q01_01")

### Question 1.2 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

Assign `query_2` to a string containing a SQL query that finds **the average credit score per state, among states _having_ at least 150 loans**. The DataFrame that results from calling `run_sql(query_2)` should have two columns, **`'state'`** and **`'average_credit'`**, and should be sorted in **increasing order** of `'average_credit'`.

Some guidance:
- Extract credit scores from the `'fico_range_low'` column in `loans`.
- One of the SQL keywords you need to use was _italicized_ in the first sentence above.

In [None]:
query_2 = '''
...
'''
run_sql(query_2)

In [None]:
grader.check("q01_02")

### Question 1.3 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

The LendingClub uses **simple interest** to calculate loan payments. Using simple interest, the total amount due for a loan is:

$$\text{Total Amount Due} = \text{Loan Amount} \cdot \left(1 + (\text{Interest Rate} \cdot \text{Loan Length in Years}) \right)$$

For example, a loan for \\$10,000 at an interest rate of 15% for 5 years would pay a total amount of \\$17,500:

$$10000 \cdot (1 + 0.15 \cdot 5) = 17500$$

Since there are 60 months in 5 years, this lendee would pay $\frac{\$17,500}{60} = \$291.67$ per month for 60 months. **Note that the percentage 15% is equivalent to the decimal 0.15.**

More generally:

$$\text{Monthly Payments} = \frac{\text{Total Amount Due}}{\text{Loan Length in Months}} = \frac{\text{Loan Amount} \cdot \left(1 + (\text{Interest Rate} \cdot \text{Loan Length in Years}) \right)}{\text{Loan Length in Months}}$$

Assign `query_3` to a string containing a SQL query that finds **loan amount, term, interest rate, and monthly payment amount of the single loanholder with the highest monthly payments**. The DataFrame that results from calling `run_sql(query_3)` should have four columns, `'amount'`, `'term'`, `'interest'`, and `'monthly'`, and should only have a **single row**.

In [None]:
query_3 = '''
...
'''
run_sql(query_3)

In [None]:
grader.check("q01_03")

## Question 2: Practice with Regular Expressions 📕

---

Regular expressions can be tricky, and the best way to gain familiarity with them is through lots of practice.

In this question, you will work through 10 parts, **each of which requires you to write a regular expression that matches strings that satisfy certain criteria**. You will do this by – as usual – completing the implementation of a function. In Questions 2.1 through 2.9, your function will take in a string and return `True` if the string follows the pattern and `False` otherwise.

- Make sure to take a close look at the examples for each function, as they provide useful guidance for the types of strings you should and shouldn't match.
- Make sure to refer to the [Regular Expression Resources](https://practicaldsc.org/resources/#regular-expressions) on the course website. In particular, we recommend having [regex101.com](https://regex101.com/) open while working, along with the [cheat sheet](https://practicaldsc.org/resources/other/berkeley-regex-reference.pdf).
- The number of points each part is worth tells you its relative difficulty level – some are worth <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div> and some are worth <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>. If you're spending lots of time on exercises worth <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>, take a close look at the syntax from [Lecture 9](https://practicaldsc.org/resources/lectures/lec09/lec09-filled.html), as there is probably an easier way of writing the necessary pattern!
- The 10 parts are all independent, and are **not** sorted by difficulty – some of the easiest parts are in the middle or towards the end!

### Question 2.1  <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

Write a regular expression that matches strings that have `'['` as the third character and `']'` as the sixth character. Example behavior is given below.

```python
>>> match_1("abcde]")
False

>>> match_1("ab[cde")
False

>>> match_1("ab[cd]ef")
True
```

In [None]:
def match_1(string):
    pattern = ...

    # Do not edit the following code.
    return re.findall(pattern, string) != []

In [None]:
grader.check("q02_01")

### Question 2.2 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

Write a regular expression that matches strings that are phone numbers that start with `'(734)'` and follow the format `'(xxx) xxx-xxxx'` (`'x'` represents a digit). Example behavior is given below.

```python
>>> match_2("(734) 456-7890")
True

>>> match_2("(123) 456-7890")
False

>>> match_2("(734) 456-7890b")
False
```

Note that there is a space between `'(xxx)'` and `'xxx-xxxx'`!

In [None]:
def match_2(string):
    pattern = ...

    # Do not edit the following code.
    return re.findall(pattern, string) != []

In [None]:
grader.check("q02_02")

### Question 2.3 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

Write a regular expression that matches strings that:
- are between 6 and 10 characters long (inclusive),
- contain only alphanumeric characters, whitespace and `'?'`, and
- end with `'?'`.

Example behavior is given below.

```python
>>> match_3('qw?ertsd?')
True

>>> match_3("ab   c ?")
True

>>> match_3(" adf!qes ?")
False

>>> match_3('wwwWW .? ')
False
```

Note that `'?'` is a special character.

In [None]:
def match_3(string):
    pattern = ...

    # Do not edit the following code.
    return re.findall(pattern, string) != []

In [None]:
grader.check("q02_03")

### Question 2.4 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>

Write a regular expression that matches strings with exactly two `'$'`, one of which is at the start of the string, such that:
- the characters between the two `'$'` can be anything (including nothing) except the lowercase letters `'a'`, `'b'`, and `'c'`, (and `'$'`), and
- the characters after the second `'$'` can only be the **lowercase or uppercase** letters `'a'`/`'A'`, `'b'`/`'B'`, and `'c'`/`'C'`, with every `'a'`/`'A'` before every `'b'`/`'B'`, and every `'b'`/`'B'` before every `'c'`/`'C'`. There **must be** at least one `'a'` or `'A'`, at least one `'b'` or `'B'`, and at least one `'c'` or `'C'`.

Example behavior is given below.

```python
>>> match_4("$!@#$aABc")
True

>>> match_4('$qw!!  $aaBC')
True

>>> match_4('$a$aABc')
False

>>> match_4('$!@$')
False
```

In [None]:
def match_4(string):
    pattern = ...

    # Do not edit the following code.
    return re.findall(pattern, string) != []

In [None]:
grader.check("q02_04")

### Question 2.5 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

Write a regular expression that matches strings that represent valid Python file names, including the extension. For simplicity, assume that file names only contain letters, numbers, and underscores. Example behavior is given below.

```python
>>> match_5("eecs398.py")
True

>>> match_5('eecs398_.py')
True

>>> match_5("here is a Python file eecs398.py")
False

>>> match_5("eecs398+.py")
False
```

In [None]:
def match_5(string):
    pattern = ...

    # Do not edit the following code.
    return re.findall(pattern, string) != []

In [None]:
grader.check("q02_05")

### Question 2.6 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

Write a regular expression that matches strings that:
- are made up of only lowercase letters and exactly one underscore (`'_'`), and
- have at least one lowercase letter on both sides of the underscore.

Example behavior is given below.

```python
>>> match_6("aab_cbbbc")
True

>>> match_6("zebra_d")
True

>>> match_6("aab_Abbbc")
False

>>> match_6("zebra_")
False
```

In [None]:
def match_6(string):
    pattern = ...

    # Do not edit the following code.
    return re.findall(pattern, string) != []

In [None]:
grader.check("q02_06")

### Question 2.7 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

Write a regular expression that matches strings that start with and end with an underscore (`'_'`). Example behavior is given below.

```python
>>> match_7("_abc_")
True

>>> match_7("_ZeBr@45Din000!!!\b_")
True

>>> match_7("abc")
False

>>> match_7("_ncde")
False

>>> match_7("_") # Need at least two underscores!
False
```

In [None]:
def match_7(string):
    pattern = ...

    # Do not edit the following code.
    return re.findall(pattern, string) != []

In [None]:
grader.check("q02_07")

### Question 2.8 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

Apple serial numbers are strings of length 1 or more that are made up of any characters, other than
- the uppercase letter `'O'`, 
- the lowercase letter `'i`', and 
- the number `'1'`.

Write a regular expression that matches strings that are valid Apple serial numbers. Example behavior is given below.

```python
>>> match_8('ASDJKL9380JKAL')
True

>>> match_8("ASJDKLFK0ASDo!!!!!!! !!!!!!!!!")
True

>>> match_8('iPhone 10')
False

>>> match_8("hi ASDJKL9380JKAL")
False
```

In [None]:
def match_8(string):
    pattern = ...

    # Do not edit the following code.
    return re.findall(pattern, string) != []

In [None]:
grader.check("q02_08")

### Question 2.9 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>

Suppose DataID numbers are formatted as `'SC-NN-CCC-NNNN'`, where 
- SC represents state code in uppercase (e.g. `'MI'`),
- NN represents a number with 2 digits (e.g. `'98'`),
- CCC represents a three letter city code in uppercase (e.g. `'DTW'`), and
- NNNN represents a number with 4 digits (e.g. `'1998'`).

Write a regular expression that matches strings that are DataID numbers corresponding to the cities of `'DTW'` (Detroit) or `'LAN'` (Lansing), or the state of `'TX'` (Texas). Assume that there is only one city named `'DTW'` and only one city named `'LAN'`.

Example behavior is given below.

```python
>>> match_9('MI-32-LAN-1232')
True

>>> match_9('TX-32-DTW-1232')
True

# Lansing is not in California!
>>> match_9('CA-32-LAN-1232')
False

>>> match_9('mI-32-LAN-1232')
False
```

In [None]:
def match_9(string):
    pattern = ...

    # Do not edit the following code.
    return re.findall(pattern, string) != []

In [None]:
grader.check("q02_09")

### Question 2.10 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>

In this final part, your task involves more than writing a single regular expression.

Complete the implementation of the function `match_10`, which takes in a string (`string`) and:
- converts the string to lowercase,
- removes all non-alphanumeric characters (i.e. removes everything that is not in the `\w` character class), and the letter `'a'`, and
- returns a list of every **non-overlapping** three-character substring in the remaining string, starting from the beginning of the string.
   
Example behavior is given below.

```python
>>> match_10('Ab..DEF')
['bde']

>>> match_10('FINALS are COMING A')
['fin', 'lsr', 'eco', 'min']

>>> match_10('h9i9hOWW44areY@')
['h9i', '9ho', 'ww4', '4re']
```

Here's how `match_10` should process `'Ab..DEF'`:

1. Convert to lowercase: `'ab..def'`.
2. Remove non-alphanumeric characters and the letter `'a'`: `'bdef'`.
3. Starting from the beginning of the string, there is only a single non-overlapping three character substring: `'bde'`. Hence, we return `['bde']`.

Some guidance: 
- Perform your operations in the exact order described above, otherwise your code may not pass all the tests.
- Don't use a `for`-loop. You'll need to use `re.sub` in addition to `re.findall`.

In [None]:
def match_10(string):
    ...

# Feel free to change the input below to test out your implementation of match_10.
match_10('FINALS are COMING A')

In [None]:
grader.check("q02_10")

## Question 3: Capture Groups 📡

---

The dataset stored in `data/messy.txt` contains personal information from a fictional website that a user scraped from web server logs. Within this dataset, there are four fields that are of interest to you:
1. Social Security Numbers
1. Bitcoin Addresses
1. Email Addresses 
1. Street Addresses

Your job is to use `re.findall` to extract out the relevant pieces of information from `messy.txt`. **Since this data is very messy, your function will be allowed to miss ~5% of the records in each list. Good spot checking using certain useful substrings (e.g. `'@'` for emails) should help assure correctness!** As usual, your functions will be tested on a sample of the file `messy.txt`.

Note that there are multiple "delimiters" (separators) in use in the file; there are few enough of them that you can safely determine what they are. **Before attempting any of the parts here, open `data/messy.txt` in your favorite text editor and explore!**

### Question 3.1 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

Complete the implementation of the function `extract_ssns`, which takes in a string (`string`) containing the contents of a server log file and returns the Social Security Numbers in the file as a list. Example behavior is given below.

```python
>>> extract_ssns('bitcoin:1BvBMSEYstWetqTFn5Au4m4GFg7xJaNVN2$jk%^3\t,test@test55.umich.edu,lkj5r%ji|ssn:423-01-9575,530 High Street')
['423-01-9575']

>>> out = extract_ssns(open('data/messy.txt', encoding='utf8').read())
>>> out[0]
'380-09-9403'
```

Some guidance:
- For our purposes, an SSN is a string of the form 3 digits-2 digits-4 digits.
- The returned list should not contain any empty strings or the string `'null'`.

In [None]:
def extract_ssns(string):
    ...
    
# To test your work, first run:
extract_ssns('bitcoin:1BvBMSEYstWetqTFn5Au4m4GFg7xJaNVN2$jk%^3\t,test@test55.umich.edu,lkj5r%ji|ssn:423-01-9575,530 High Street')
# Then, once that works, uncomment:
# extract_ssns(open('data/messy.txt', encoding='utf8').read())

In [None]:
grader.check("q03_01")

### Question 3.2  <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

Complete the implementation of the function `extract_bitcoin_addresses`, which takes in a string (`string`) containing the contents of a server log file and returns the Bitcoin addresses in the file as a list. Example behavior is given below. 

```python
>>> extract_bitcoin_addresses('bitcoin:1BvBMSEYstWetqTFn5Au4m4GFg7xJaNVN2$jk%^3\t,test@test55.umich.edu,lkj5r%ji|ssn:423-01-9575,530 High Street')
['1BvBMSEYstWetqTFn5Au4m4GFg7xJaNVN2']

>>> out = extract_bitcoin_addresses(open('data/messy.txt', encoding='utf8').read())
>>> out[0]
'18A8rBU3wvbLTSxMjqrPNc9mvonpA4XMiv'
```

Some guidance:
- Assume Bitcoin addresses are alphanumeric strings.
- The returned list should not contain any empty strings or the string `'null'`.

In [None]:
def extract_bitcoin_addresses(string):
    ...
    
# To test your work, first run:
extract_bitcoin_addresses('bitcoin:1BvBMSEYstWetqTFn5Au4m4GFg7xJaNVN2$jk%^3\t,test@test55.umich.edu,lkj5r%ji|ssn:423-01-9575,530 High Street')
# Then, once that works, uncomment:
# extract_bitcoin_addresses(open('data/messy.txt', encoding='utf8').read())

In [None]:
grader.check("q03_02")

### Question 3.3 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

Complete the implementation of the function `extract_emails`, which takes in a string (`string`) containing the contents of a server log file and returns the email addresses in the file as a list. Example behavior is given below.

```python
>>> extract_emails('bitcoin:1BvBMSEYstWetqTFn5Au4m4GFg7xJaNVN2$jk%^3\t,test@test55.umich.edu,lkj5r%ji|ssn:423-01-9575,530 High Street')
['test@test55.umich.edu']

>>> out = extract_emails(open('data/messy.txt', encoding='utf8').read())
>>> out[0]
'dottewell0@gnu.org'
```

Some guidance:
- Assume that the usernames and domain names in an email address are alphanumeric. Domain names don't need to end in `'.com'` – assume that all parts of a domain name, including the very end, can be made up of any alphanumeric characters.
- The returned list should not contain any empty strings or the string `'null'`. (It likely won't by default, but we've included this instruction in all four parts of this question.)

In [None]:
def extract_emails(string):
    ...
    
# To test your work, first run:
extract_emails('bitcoin:1BvBMSEYstWetqTFn5Au4m4GFg7xJaNVN2$jk%^3\t,test@test55.umich.edu,lkj5r%ji|ssn:423-01-9575,530 High Street')
# Then, once that works, uncomment:
# extract_emails(open('data/messy.txt', encoding='utf8').read())

In [None]:
grader.check("q03_03")

### Question 3.4 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

Complete the implementation of the function `extract_street_addresses`, which takes in a string (`string`) containing the contents of a server log file and returns the street addresses in the file as a list. Example behavior is given below.

```python
>>> extract_street_addresses('bitcoin:1BvBMSEYstWetqTFn5Au4m4GFg7xJaNVN2$jk%^3\t,test@test55.umich.edu,lkj5r%ji|ssn:423-01-9575,530 High Street')
['530 High Street']

>>> out = extract_street_addresses(open('data/messy.txt', encoding='utf8').read())
>>> out[0]
'814 Monterey Court'
```

As before, the returned list should not contain any empty strings or the string `'null'`.

In [None]:
def extract_street_addresses(string):
    ...

# To test your work, first run:
extract_street_addresses('bitcoin:1BvBMSEYstWetqTFn5Au4m4GFg7xJaNVN2$jk%^3\t,test@test55.umich.edu,lkj5r%ji|ssn:423-01-9575,530 High Street')
# Then, once that works, uncomment:
# extract_street_addresses(open('data/messy.txt', encoding='utf8').read())

In [None]:
grader.check("q03_04")

## Question 4: GPTEECS 🤖

---

### Overview

Large Language Models (LLM), like GPT-4 by OpenAI, Claude by Anthropic, or Llama by Meta, are statistical models that were trained on massive datasets for the purpose of generating useful new text. [ChatGPT](https://chat.openai.com) and other similar chat interfaces make calls to an LLM API under-the-hood, and show you the results in a text message-like format.

Open ChatGPT or your favorite other LLM chat interface, and ask it:

> What's the difference between the late submission policy in EECS 467 and EECS 492?

Until very recently (when ChatGPT started being able to search the internet), ChatGPT would tell you that it doesn't know what EECS 467 and EECS 492 are. And even if it did give you an answer, it's not necessarily clear whether it pulled the answer from a reliable source, or whether it's still true today (it may have found syllabi online from many years ago, and could be hallucinating). 

### Retrieval-Augmented Generation (RAG)

A solution to this issue is **Retrieval-Augmented Generation (RAG)**. **In this question, we will use RAG to implement GPTEECS, a chat interface designed to answer questions about EECS syllabi.** Here's the general idea behind RAG, and how we'll use it in this question:

1. We want to implement a chat bot that can answer questions about something specific.<br><small>**Here**, we want our chat bot to answer questions about EECS class' syllabi.</small>
1. To do so, we download and store documents that contain the relevant context that we wish our LLM knew about.<br><small>**Here**, we'll download the syllabi of various EECS classes and store them as `.txt` files. We've already done this for you.</small>
1. Then, when the user asks a question – called a **query** – we determine which of our locally-stored documents are most relevant in answering their question.<br><small>**Here**, when a user asks a question about EECS class(es), we'll determine which syllabus documents are most likely to have the answer.</small>
1. Once we find the most relevant documents, we send the user's query, **along with** the most relevant documents, to our language model, allowing it to find the answer for us with the context it needs.

<center><img src="imgs/retrieval-augmented-generation.png" width=700><br>(<a href="https://towhee.io/tasks/detail/pipeline/retrieval-augmented-generation">image source</a>)</center>

RAG enables organizations to create customized chat interfaces that are better equipped to answer questions about the organization than an out-of-the-box language model. For instance, if you operated a store and wanted an AI-powered customer support chat, you may use RAG to create a chat bot that knows about your store's catalog, return policies, etc. ChatGPT even allows you to make custom GPTs [yourself](https://openai.com/index/introducing-gpts/) by uploading customized knowledge bases, and these (likely) use a process similar to RAG.

### FAQs

- **How do we determine which documents are most relevant to the user's query?** Here, we'll implement this using TF-IDF and cosine similarity, as we've seen in [Lecture 12](https://practicaldsc.org/resources/lectures/lec12/lec12-filled.html)! In practice, more sophisticated, state-of-the-art techniques for converting text to numbers are used (if you're curious, look into "word embeddings").
- **Why not just send all of the documents to our language model, instead of finding the documents that are most relevant?** LLMs have a [context window](https://www.hopsworks.ai/dictionary/context-window-for-llms), which is a limit on the length of the input query they can take in. If your query is too long, an LLM may not be able to process it. (And, if it includes unnecessary information, it can be hard for the LLM to give you an accurate response.)

### Your Task
The folder `data/syllabi` contains syllabi for several EECS classes. These documents together comprise our **corpus**.

In [1]:
!ls data/syllabi

183.txt 280.txt 373.txt 390.txt 465.txt 471.txt 481.txt 484.txt 489.txt 493.txt
203.txt 281.txt 376.txt 445.txt 467.txt 473.txt 482.txt 485.txt 490.txt 494.txt
270.txt 370.txt 388.txt 453.txt 470.txt 475.txt 483.txt 487.txt 492.txt


Shortly, using the ideas from Lecture 12, you will develop a working implementation of the following function:

```python
>>> top_n_similar_documents('C++ programming and systems design', 4, bow)
['482.txt', '473.txt', '370.txt', '281.txt']
```

And even cooler, you'll implement a function that can fully answer questions, like:

```python
>>> ask_gpteecs("I really want to learn theoretical probability and math, what should I take?")
'Based on your interest in theoretical probability and math, I recommend taking EECS 445: Introduction to Machine Learning. This course covers the foundational algorithms and "tricks of the trade" in machine learning, including regression, classification...'
```

### Question 4.1 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">1 Point</div>

First, let's figure out how to call a Large Language Model directly from our notebook.

OpenAI does have a Python API, but it's relatively limited on the free plan. Instead, we'll use tools from [Groq](https://groq.com/). Groq is a hardware company designing processors for training LLMs efficiently, and allows for fast, free access to open-source LLM APIs. We'll use the [Groq API](https://console.groq.com/docs/quickstart) to make calls to Meta's Llama 3 API. (As mentioned at the start of this section, Llama is Meta's competitor to GPT. So technically, we're not implementing GPTEECS, but EECSLlama?)

Go [**here**](https://console.groq.com/docs/quickstart) and create a Groq API key. Then, complete the implementation of `query_llama`, a function that takes in a string (`query_string`) and returns the text response that results from passing `query_string` to Groq. The function has largely been implemented for you; most of what you need to do is create an API key and put it in the right place below.

(Yes, confusingly, we're using the word "query" in this homework to refer to slightly different, but related, ideas: in SQL, queries are used to extract information from a **database**, and here, our queries pull information from an API.)

In [None]:
def query_llama(query_string):
    client = groq.Groq(
        api_key= ...
    )
    
    chat_completion = client.chat.completions.create(
        messages=[
            {
                "role": "user",
                "content": query_string
            }
        ],
        model="llama3-8b-8192",
        # temperature=0 # Try uncommenting this and running the call to query_llama below many times. What do you notice? Recomment it out afterwards.
    )

    return chat_completion.choices[0].message.content

# Feel free to change the input below to test out your implementation of query_llama.
# The Markdown function behaves like the print function,
# but renders text formatting (e.g. bolding, bullet points) when the output from Llama
# contains these elements.
Markdown(query_llama('Tell me a joke about data science'))

In [None]:
grader.check("q04_01")

Now, we can call `query_llama`! Run the cell below.

In [None]:
Markdown(query_llama('Tell me about EECS 485 at Michigan, but keep it concise: just one paragraph.'))

To experiment:
- Run the cell many times. You'll notice that the response is very different every time – and it's almost never accurate! (Click [here](https://eecs485.org) to see what EECS 485 here is actually about.)
- Uncomment the line that says `temperature=0` in your definition of `query_llama`, and then run the above cell many times again. What do you notice now? (To see what argument is doing, go to the [documentation](https://console.groq.com/docs/api-reference#chat-create) and search for "temperature".) Recomment out the line before proceeding.
- If you remove "but keep it concise: just one paragraph.", what do you notice?

Now we have a way of passing queries to a Large Language Model and getting back results. Right now, it's not knowledgeable enough to answer questions about EECS classes. Soon, we'll change that.

We'll get back to using `query_llama` in the final part of this question. For now, we need to switch our attention to implementing RAG – that is, being able to find the syllabus documents that are most similar to our input query. Once we implement it, when we pass our (new) function the input `'Tell me about EECS 485 at Michigan, but keep it concise: just one paragraph.'`, it'll provide accurate, up-to-date information about EECS 485, since we'll send the syllabus for EECS 485 to Llama along with the original input. **Keep this goal in mind. The next few parts may seem unrelated, but they all come together at the end!**

### Question 4.2 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">1 Point</div>

A **token** is an alphanumeric string. In Lecture 12, we referred to tokens as "terms". Before computing any numbers, we need to find the terms in each syllabus, i.e. we need to **tokenize** each syllabus.

Complete the implementation of the function `tokenize`, which takes in a string (`string`) of text and returns a list containing all of the tokens in `string`. Convert all characters to lowercase before extracting tokens.

Example behavior is given below.

```python
>>> tokenize("EECS 398-003 Practical Data Science's about data management and applied machine learning.")
['eecs',
 '398',
 '003',
 'practical',
 'data',
 'science',
 's',
 'about',
 'data',
 'management',
 'and',
 'applied',
 'machine',
 'learning']

>>> tokenize(open('data/syllabi/485.txt').read())[:20]
['eecs',
 '485',
 'web',
 'systems',
 'syllabus',
 'the',
 'university',
 'of',
 'michigan',
 'fall',
 '2024',
 'a',
 'holistic',
 'course',
 'of',
 'modern',
 'web',
 'systems',
 'and',
 'technologies']
```

Note that this part is only worth 1 point, so it shouldn't take very long!

In [None]:
def tokenize(string):
    ...

# Feel free to change the input below to test out your implementation of tokenize.
tokenize("EECS 398-003 Practical Data Science's about data management and applied machine learning.")

In [None]:
grader.check("q04_02")

Before we move onto Question 4.3, it's worth mentioning that in practice, we'd put a bit more care into tokenizing our documents. For one, we might **lemmatize** our tokens, which would allow us to group words like `'eating'`, `'ate'`, and `'eatery'` all to `'eat'`. We've omitted such steps here for simplicity.

### Question 4.3 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">4 Points</div>

Complete the implementation of the function `files_to_bow`, which takes in a string describing the **path** to a folder with syllabus files (`path`) and returns the corresponding **bag of words matrix** as a DataFrame, with:
- One row per file, indexed by the file name. The DataFrame should be sorted by the index in ascending order.
- One column per unique word (token) among all syllabi (i.e. across the entire corpus). The order of the columns in the DataFrame does not matter.
- Values corresponding to the number of occurrences of each word in each file.

Example behavior is given below.

```python
>>> out = files_to_bow('data/syllabi')
>>> out.shape
(29, 4306)

>>> out.loc['280.txt', 'computer']
12
```

Some guidance:
- You must implement all of the steps by hand, i.e. no using `sklearn`'s `CountVectorizer`.
- To find all of the files in a folder, use `os.listdir` (we've already imported `os`). Make sure to verify that the files you're processing end in `.txt` – there may be other files in `path` that aren't valid syllabi, and we don't want to process those.
- Our solution involved creating an intermediate helper function that read in the necessary files, tokenized them, and stored them in an appropriate data structure. You can design your implementation however you'd like, but it's a good idea to break it down into smaller pieces.
- Since we've already tokenized each file, it's not necessary to use regular expressions to count the number of occurrences of particular words in each document. Look into the list `count` method, which you can use in conjunction with a `for`-loop or the Series `apply` method. Our solution follows the work in Lecture 12 closely.
- Our solution only takes ~5 seconds to run on `files_to_bow('data/syllabi')`. Make sure yours is similarly quick.

In [None]:
def files_to_bow(path):
    ...

# Uncomment the line below once you've implemented files_to_bow.
# files_to_bow('data/syllabi')

In [None]:
grader.check("q04_03")

Since we'll need it in all of our future calculations, we'll create a globally-defined instance of `bow` below. **Make sure that throughout the rest of your notebook, `bow` is defined exactly as below!**

In [None]:
bow = files_to_bow('data/syllabi')
bow.head()

### Question 4.4 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>

Complete the implementation of the function `bow_to_tfidf`, which takes in a bag of words matrix (`bow`) returned by `files_to_bow`. `bow_to_tfidf` should return a DataFrame with the same row labels and column labels as `bow`, but with all values converted to TF-IDFs – that is, the outputted DataFrame should contain the TF-IDF of every word in every file.

Example behavior is given below.

```python
# Here, we're referring to the globally-defined bow.
>>> out = bow_to_tfidf(bow)
>>> out.shape == bow.shape
True

>>> out.loc['485.txt', 'science']
0.0005272966438261625
```

Some guidance:
- Follow our logic from Lecture 12 to convert `bow` to a TF-IDF matrix. Your implementation here should be relatively short (< 10 lines).
- While not strictly required (in that we won't test it), we recommend you implement `compute_idfs`, which takes in a DataFrame like `bow` and returns a **Series** containing the inverse document frequency (IDF) of each word in `bow`. Not only will this help compartmentalize your work for this question, but it'll make your life much easier in Question 4.5, when you'll again need to use the IDFs of every word in the corpus.

In [None]:
def compute_idfs(bow):
    # Not required, but suggested!
    ...

def bow_to_tfidf(bow):
    ...

# Uncomment the line below once you've implemented bow_to_tfidf.
# bow_to_tfidf(bow).head()

In [None]:
grader.check("q04_04")

Before we move forward, it's worth stopping and looking at what we've already accomplished. Run the cell below to see the 5 words with the highest TF-IDFs in each syllabus.

In [None]:
def five_largest(row):
    return ', '.join(row.index[row.argsort()][-5:])

bow_to_tfidf(bow).apply(five_largest, axis=1)

Compare that to the 5 words with the highest frequences in each syllabus:

In [None]:
bow.apply(five_largest, axis=1)

Hopefully, the value of TF-IDF is clear, but it's also clear that TF-IDF isn't perfect in summarizing documents. But, as we'll soon see, it'll serve our purposes well!

Before you move to Question 4.5, there's one piece of syntax that you'll find useful: the Series `reindex` method. Here's an example of how it works:

In [None]:
things = pd.Series({'a': 2, 'b': 5, 'c': 1})
things

In [None]:
stuff = pd.Series({'a': 'hello', 'b': 'hi', 'x': 9})
stuff

In [None]:
things.reindex(stuff.index)

In [None]:
things.reindex(stuff.index).fillna(0)

### Question 4.5 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>

Complete the implementation of the function `new_query_to_tfidf`, which takes in a string (`query_string`) and a bag of words matrix (`bow`) and returns a Series such that:
- The index contains the same labels as `bow`'s columns (meaning that if `bow` has 4306 columns, the outputted Series should have 4306 elements).
- The values contain the TF-IDF of each word, using `query_string` to compute TFs and **the entire corpus of syllabi (not including the new query)** to compute IDFs.

Example behavior is given below.

```python
>>> out = new_query_to_tfidf('yooo I am very very very interested in a practical machine learning course', bow)
>>> out.shape
(4306,)

# Most of the values in out are 0, since
# "yooo I am very very very interested in a practical machine learning course"
# doesn't contain most of the 4306 words in bow.
# Since 'yooo' is not in bow.columns, it doesn't appear in the index of out, either.
>>> out[out > 0]
machine       0.090005
interested    0.174514
very          0.328012
am            0.152385
i             0.032527
practical     0.109337
learning      0.050711
dtype: float64
```

To be clear, the TF-IDF of a word $t$ in a new query string $q$ is:

$$\text{tfidf}(t, q) = \underbrace{\frac{\text{\# of occurrences of $t$ in $q$}}{\text{total \# of tokens in $q$}}}_{\text{computed using } q \: (\texttt{query\_string})} \cdot \underbrace{\log \left(\frac{\text{total \# of syllabi}}{\text{\# of syllabi in which $t$ appears}} \right)}_{\text{computed solely using \texttt{bow}}}$$

Note that this means that the IDFs of each word have nothing to do with the `query_string` that is passed in. This is precisely why we suggested you implement `compute_idfs(bow)` in the previous part – because it would help your implementation of `bow_to_tfidf`, and also help your implementation of `new_query_to_tfidf`.

Some additional guidance:
- This function should only take a few lines to implement, but requires combining several steps, going all the way back to Question 4.2. Think about how the `reindex` method might be useful.
- In the function signature below, you'll see `new_query_to_tfidf(query_string, bow=bow)`. `bow=bow` sets the default value of the `bow` argument to the globally-defined value of `bow`, meaning if we only pass one argument (`query_string`) to `new_query_to_tfidf`, it will automatically use the global `bow`. It's important for our function to be able to take in bag of words matrices other than our globally-defined `bow`, in case we want to use it on a different corpus of documents. But, most of the time we will call it on the global `bow`, so this is done for convenience.

In [None]:
def new_query_to_tfidf(query_string, bow=bow):
    ...

# Feel free to change the input below to test out your implementation of new_query_to_tfidf.
out = new_query_to_tfidf('yooo I am very very very interested in a practical machine learning course')
out[out > 0]

In [None]:
grader.check("q04_05")

Let's take stock of what we have so far.
- We have the TF-IDFs of every word in every document in our corpus. This means that we have a **vector representation** of each syllabus.
- We have a function that can take any query string and turn it into a **vector** of TF-IDF scores, as well.

Now, we can use techniques from Lecture 12 – specifically, cosine similarity – to find the syllabi that are most similar (and, hence, most relevant) to our query string!

### Question 4.6 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>

Complete the implementation of the function `top_n_similar_documents`, which takes in a string (`query_string`), a positive integer `n`, and a bag of words matrix (`bow`) and returns a list containing the names of the `n` most similar documents to `query_string`. 

Use cosine similarity to measure the similarity between two vectors; you can implement cosine similarity however you'd like. Remember that document names are stored in the index of `bow`. The documents in the returned list should be sorted in **decreasing order of similarity**.

Example behavior is given below.

```python
>>> top_n_similar_documents('yooo I am very very very interested in a practical machine learning course', 3, bow)
['467.txt', '445.txt', '453.txt']

>>> top_n_similar_documents('C++ programming and systems design', 4, bow)
['482.txt', '473.txt', '370.txt', '281.txt']
```

In [None]:
def top_n_similar_documents(query_string, n, bow=bow):
    ...

# Feel free to change the inputs below to test out your implementation of top_n_similar_documents.
top_n_similar_documents('yooo I am very very very interested in a practical machine learning course', 3)

In [None]:
grader.check("q04_06")

Awesome! You've implemented the retrieval step in RAG. That is, given a query, you're able to automatically find the most relevant documents in our "knowledge database" for answering that query.

It's time for the final step: passing a `query_string`, along with the contents of the most relevant documents, to a Large Language Model (which we already learned how to access, using `query_llama`).

### Question 4.7 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">4 Points</div>

Complete the implementation of the function `ask_gpteecs`, which takes in a string (`query_string`) containing a question about EECS courses, a positive integer `n`, and a bag of words matrix `bow`. `ask_gpteecs` should return a **string** containing the result of:

- querying Llama 3 using `query_llama` from Question 4.1,
- where the query contains **both** the contents of `query_string` and
- the **top `n`** most similar syllabus documents,
- stitched together in a way that you deem appropriate.

Here's what we mean by "in a way that you deem appropriate." Suppose our query is `'yooo I am very very very interested in a practical machine learning course'`, and suppose `n=3` (the default).
- The top 3 most similar documents are `'467.txt'`, `'445.txt'`, and `'453.txt'`.
- If we just ask Llama, `'yooo I am very very very interested in a practical machine learning course'`, it won't know anything about EECS 467, EECS 445, or EECS 453. If we ask it, `'yooo I am very very very interested in a practical machine learning course, tell me about them: 467.txt, 445.txt, and 453.txt'`, it also won't know anything about those courses.
- Instead, once we identify which (3) documents are most relevant, we need to read them in as strings once again using `open`, then create a new `query_string` that looks something like:

```python
'''
Hi! I'm looking to answer this query that a student sent me, regarding EECS courses at the University of Michigan:

yooo I am very very very interested in a practical machine learning course

Here are some relevant courses from my knowledge base.

here's EECS 467
EECS 467: Autonomous Robots
Software methods and implementation for robot perception, world mapping, ...
...

here's EECS 445
Syllabus
Introduction to Machine LearningFall 2016
The course is a programming-focused introduction to Machine Learning.
...

here's EECS 453
Course Instructor: Prof. Qing Qu
Course Time: Mon/Wed 12:00 PM – 1:30 PM
...
'''
```

- You can structure your final query string however you'd like, and you're encouraged to experiment with different phrasings to see if they influence your results; you can start by copying the example format above, but then try and make it your own. (This is called **prompt engineering**.)
- In the example above, we only included the first few lines of the relevant syllabi, but in your actual prompts, you'd include the entire text. You'll need to figure out a way of programmatically adding the course numbers and course syllabi text to your prompt string – remember, `n` might be something other than 3.

In [None]:
def ask_gpteecs(query_string, n=3, bow=bow):
    ...

# Feel free to change the inputs below to test out your implementation of ask_gpteecs.
# The Markdown function behaves like the print function,
# but renders text formatting (e.g. bolding, bullet points) when the output from Llama
# contains these elements.
Markdown(ask_gpteecs('yooo I am very very very interested in a practical machine learning course'))

In [None]:
grader.check("q04_07")

**Great work!** You've now implemented Retrieval-Augmented Generation, and have your very own ChatGPT-like interface that knows about EECS classes.

Unfortunately, it's not perfect. Look what happens with the following query:

In [None]:
Markdown(ask_gpteecs('what is different about eecs 280 and eecs 281'))

At the bottom, the error says:

```
BadRequestError: Error code: 400 - {'error': {'message': 'Please reduce the length of the messages or completion.', 'type': 'invalid_request_error', 'param': 'messages', 'code': 'context_length_exceeded'}}
```

This is telling us that the query sent to Llama is longer than its **context window**, an idea we discussed at the start of Question 4. Since we're using the model `'llama3-8b-8192'`, the largest possible query we can send is 8192 tokens. If we take a look at the length of each syllabus, we see that the 280 and 281 syllabi together are longer than 8192 tokens (not including punctuation, or the input query, or our instructions):

In [None]:
bow.sum(axis=1).sort_values(ascending=False)

Feel free to keep toying with `query_llama` (see the [Groq documentation here](https://console.groq.com/docs/quickstart) to see what you can customize) and `ask_gpteecs` to try and improve the performance of your implementation.

## Finish Line 🏁

Congratulations! You're ready to submit Homework 6.

To submit your homework:

1. Select `Kernel -> Restart & Run All` to ensure that you have executed all cells, including the test cells.
2. Read through the notebook to make sure everything is fine and all tests passed.
3. Run the cell below to run all tests, and make sure that they all pass.
4. Download your notebook using `File -> Download as -> Notebook (.ipynb)`, then upload your notebook to Gradescope under "Homework 6".
5. Stick around while the Gradescope autograder grades your work. **Remember that Homework 6 has no hidden tests! This means the tests you see in your notebook are the exact same as the ones that will be used to grade your work on Gradescope. When you submit on Gradescope, you'll see your score shortly after you submit, once the autograder finishes running.** 
6. Check that you have a confirmation email from Gradescope and save it as proof of your submission.

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()