In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hw02.ipynb")

<div class="alert alert-success" markdown="1">

#### Homework 2

# Arrays, Probability, and DataFrames

### EECS 398: Practical Data Science, Winter 2025

#### Due Tuesday, January 28th at 11:59PM
    
</div>

## Instructions

Welcome to Homework 2! In this homework, you will practice with `numpy` arrays, linear algebra, and random simulations, as in Lecture 3. You'll also start to understand the connection between probability and machine learning, using your understanding of EECS 203 material. Finally, you'll start to work with real, tabular data using the `pandas` library.

You are given 8 slip days throughout the semester to extend deadlines. See the [Syllabus](https://practicaldsc.org/syllabus) for more details. With the exception of using slip days, late work will not be accepted unless you have made special arrangements with your instructor.

To access this notebook, you'll need to clone our [public GitHub repository](https://github.com/practicaldsc/wn25/). The [Environment Setup](https://practicaldsc.org/env-setup) page on the course website walks you through the necessary steps. Once you're done, you'll submit your completed notebook to Gradescope.

<div class="alert alert-warning">
This homework features a mix of autograded programming questions and manually-graded questions.

- Questions 1.4 and 3 are **manually graded**, and say **[Written ✏️]** in their titles. For this question, **do not write your answers in this notebook**! Instead, write **all** of your answers in a separate PDF. You can create this PDF either digitally, using your tablet or using [Overleaf + LaTeX](https://overleaf.com) (or some other sort of digital document), or by writing your answers on a piece of paper and scanning them in. Submit this separate PDF to the **Homework 2 (Questions 1.4, 3; written problems)** assignment on Gradescope, and **make sure to correctly select the pages associated with each question**! Make sure to show your work for all written questions, as answers without work shown may not receive full credit.
    
- Questions 1.1-1.3, 2, and 4-6 are **fully autograded**, and say **[Autograded 💻]** in the title. For these questions, all you need to is write your code in this notebook, run the local `grader.check` tests, and submit to the **Homework 2 (Questions 1.1-1.3, 2, 4-6; autograder problems)** assignment on Gradescope to have your code graded by the hidden autograder. Remember that the public `grader.check` tests in your notebook are not comprehensive, and that your work will also be graded on hidden test cases on Gradescope after the submission deadline.
    
Your Homework 2 submission time will be the **later** of your two individual submissions. Please start early and submit often. You can submit as many times as you'd like to Gradescope, and we'll take your **most recent** submission. 
</div>
</div>

This homework is worth a total of **63 points**, 49 of which come from the autograder, **and 14 of which are manually graded by us** (Questions 1.4, 3). The number of points each question is worth is listed at the start of each question. **The three parts of the assignment are independent, so feel free to move around if you get stuck**. Tip: if you're using Jupyter Lab, you can see a Table of Contents for the notebook by going to View > Table of Contents.

To get started, run the cell below.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib_inline.backend_inline import set_matplotlib_formats

import plotly
import plotly.graph_objects as go
import plotly.express as px
import plotly.io as pio

# Preferred styles
pio.templates["pds"] = go.layout.Template(
    layout=dict(
        margin=dict(l=30, r=30, t=30, b=30),
        autosize=True,
        width=600,
        height=400,
        xaxis=dict(showgrid=True),
        yaxis=dict(showgrid=True),
        title=dict(x=0.5, xanchor="center"),
    )
)
pio.templates.default = "simple_white+pds"

set_matplotlib_formats("svg")
sns.set_context("poster")
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (10, 5)

# Use plotly as default plotting engine
pd.options.plotting.backend = "plotly"

import networkx as nx

## Question 1: PageRank 🔗

---

In this part of the homework, you'll replicate the PageRank algorithm, the algorithm that Google uses to decide how to rank search results. Along the way, you'll develop proficiency with using arrays in the context of linear algebra.

<div class="alert alert-success">

To get started, read [**this guide**](https://practicaldsc.org/guides/linear-algebra/pagerank) we've written about the PageRank algorithm. Think of it as an extension of the homework spec. It involves linear algebra but is self-contained.

</div>

We'll start by defining a 2D array, `A`, representing the adjacency matrix defined in the guide.

In [None]:
A = np.array([[0, 1 / 2, 1 / 2, 1 / 3],
              [1, 0, 0, 1 / 3],
              [0, 0, 0, 1 / 3],
              [0, 1 / 2, 1 / 2, 0]])
A

The function below draws a network given an adjacency matrix. Run the cell below to see a visualization of `A`'s network.

In [None]:
def plot_from_adjacency(adjacency_matrix, node_sizes=0.25):
    np.random.seed(25)
    plt.figure(figsize=(8, 5))
    G = nx.from_numpy_array(adjacency_matrix.T, create_using=nx.DiGraph)
    layout = nx.spring_layout(G)
    labels_dict={i: i+1 for i in range(adjacency_matrix.shape[0])}
    nx.draw(G, layout, 
            node_size=15000 * node_sizes, labels=labels_dict, with_labels=True, font_color='white', font_weight='bold', font_size=15, 
            connectionstyle='arc3, rad = 0.1')
    plt.show()

plot_from_adjacency(A)

This graph should resemble the one shown in the aforementioned guide, but was generated using code. Cool!

### Question 1.1 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">4 Points</div>

Having to specify an adjacency matrix manually is slightly cumbersome. It is more natural and convenient for us to describe the links between webpages using a dictionary. Once such example is as follows:

In [None]:
example_net = {
    1: [2],
    2: [1, 4],
    3: [1, 4],
    4: [1, 2, 3]
}

In the above "network dictionary", we are told that:
- Page 1 links to Page 2.
- Page 2 links to Pages 1 and 4.
- Page 3 links to Pages 1 and 4.
- Page 4 links to Pages 1, 2, and 3.

**Note that this dictionary describes the same network that the adjacency matrix `A` does.**

Below, complete the implementation of the function `create_adjacency`, which takes in a network dictionary, `network` (formatted similarly to `example_net`) and returns the adjacency matrix that corresponds to `network`. Example behavior is given below.

```python
>>> create_adjacency(example_net)
array([[0.        , 0.5       , 0.5       , 0.33333333],
       [1.        , 0.        , 0.        , 0.33333333],
       [0.        , 0.        , 0.        , 0.33333333],
       [0.        , 0.5       , 0.5       , 0.        ]])
```

A few notes:
- It is **not** guaranteed that there are 4 pages in the network. 
- It **is** guaranteed that all pages link to at least one page – potentially itself – and that there is at least one page.
- Remember that adjacency matrices are 1-indexed, like in math, but Python is 0-indexed.

Some guidance:
- Look into `np.zeros`.
- You can use (multiple) `for`-loops.

In [None]:
def create_adjacency(network):
    ...

# Feel free to change this input to make sure your function works correctly.
create_adjacency(example_net)

In [None]:
grader.check("q01_01")

### Question 1.2 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

Complete the implementation of the function `compute_scores`, which takes in an adjacency matrix `matrix` of shape `(n, n)`, where `n` is a positive integer, and returns an array of length `n` containing the PageRank scores of all pages. Compute the PageRank scores using a matrix power of 100, as we did in the linked [**guide**](https://practicaldsc.org/guides/linear-algebra/pagerank). Example behavior is given below.

```python
# Remember, compute_scores should work for any adjacency matrix, not just A!
>>> compute_scores(A)
array([0.30769231, 0.38461538, 0.07692308, 0.23076923])
```

In [None]:
def compute_scores(matrix):
    ...

# Feel free to change this input to make sure your function works correctly.
compute_scores(A)

In [None]:
grader.check("q01_02")

We can change the sizes of the pages in our network to be proportional to their PageRank scores. To do this, use the `node_sizes` argument in `plot_from_adjacency`.

In [None]:
plot_from_adjacency(A, node_sizes=compute_scores(A))

### Question 1.3 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

Complete the implementation of the function `pagerank`, which takes in an adjacency matrix `matrix` of shape `(n, n)`, where `n` is a positive integer, and returns a list containing the **numbers** of the pages in the matrix, in **decreasing** order of PageRank score as computed by `compute_scores`. If there are ties in scores, return the pages in any order. Example behavior is given below.

```python
>>> pagerank(A)
[2, 1, 4, 3]
```

Some guidance:
- Look into `np.argsort`.
- This should only take 2-3 lines; do not write a `for`-loop.

In [None]:
def pagerank(matrix):
    ...

# Feel free to change this input to make sure your function works correctly.
pagerank(A)

In [None]:
grader.check("q01_03")

Once you've completed `pagerank`, run the following cell to visualize your work.

In [None]:
# Here's another example network, which you can use to test your code.
test_net = {1: [6], 2: [1, 3, 4, 6], 3: [2, 5, 6], 4: [1], 5: [1, 2, 6], 6: [2, 4]}
print(test_net)
test_net_adjacency = create_adjacency(test_net)
test_net_scores = compute_scores(test_net_adjacency)
plot_from_adjacency(test_net_adjacency, node_sizes=test_net_scores)

### Question 1.4: Sinkholes [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>

Let's wrap up this question with a puzzle. Consider the following network:

In [None]:
weird_net = {
    1: [1],
    2: [1, 3, 5],
    3: [2, 4],
    4: [1, 2, 3],
    5: [3]
}

Note that Page 1 links to itself and to no other pages. Practically speaking, we can interpret Page 1 as being a "dead end" or "sinkhole", with no outgoing links.

Run the cells below to compute the adjacency matrix and scores for the above network and to visualize it.

In [None]:
weird_matrix = create_adjacency(weird_net)
weird_scores = compute_scores(weird_matrix)
print(weird_scores)
plot_from_adjacency(weird_matrix, node_sizes=weird_scores)

<!-- BEGIN QUESTION -->

It appears that Page 1's score is 1, and all of the other pages' scores are 0! (`9.6409e-10` means $9.64 \cdot 10^{-10}$, which is a really, really small number.)

In your PDF submission, write your answers to the following two manually-graded prompts:

- Why do you think Page 1's score is so high, and the other pages' scores are so low? (***Hint:*** Think about how we interpret the score of a page.)
- Read the [Damping factor](https://en.wikipedia.org/wiki/PageRank#Damping_factor) section of the Wikipedia article on PageRank. In two sentences, describe (to the best of your ability) how using damping would prevent the score of Page 1 from becoming 1. (If you read the article closely, the answer is there – describe it in your own words.)

<!-- END QUESTION -->

## Question 2: Euchre Returns! 🃏

---

In this question, we'll practice simulating various probabilities in the context of the popular card game, [Euchre](https://www.wikihow.com/Play-Euchre)!

Euchre, as you may have seen in EECS 280, is a 4-player card game, in which the players are named Player 0, Player 1, Player 2, and Player 3. Euchre is played with a deck of **24 cards**. Each card has two attributes:
- a **suit**, which is either hearts (<span style="color:red">❤️</span>), diamonds (<span style="color:red">♦️</span>), clubs (♣), or spades (♠).
- a **face value**, which is either 9, 10, Jack (J), Queen (Q), King (K), or Ace (A).

The full deck of 24 cards is shown below.

<div align="center">

<span style="color:red">❤️: 9, 10, J, Q, K, A</span><br>
<span style="color:red">♦️: 9, 10, J, Q, K, A</span><br>
<span style="color:black">♣: 9, 10, J, Q, K, A</span><br>
<span style="color:black">♠: 9, 10, J, Q, K, A</span><br>

</center>

The 24 cards are all different. The list `deck`, defined below, represents our deck of 24 cards. Each element in deck is a string representing a card, in the format `'{face} {suit}'`.

Note that you don't need to know how Euchre is played, beyond what is explained above, to work on the following question. (Suraj has never taken EECS 280 or played Euchre before, either.)

In [None]:
# Do NOT edit this cell! 
# In our hidden tests, we will assume deck is defined exactly as below.
suits = ['Hearts', 'Diamonds', 'Clubs', 'Spades']
faces = ['9', '10', 'J', 'Q', 'K', 'A']
deck = [f'{face} {suit}' for suit in suits for face in faces]
deck

At the start of a round of Euchre, the 24 cards are randomly distributed to the 4 players such that each player gets 5 cards, and the remaining 4 cards ($24 - 5 \cdot 4 = 4$) are put to the side. 

You should use functions built into `np.random` to simulate the act of shuffling and distributing cards, but be careful about the arguments you provide to these functions and the assumptions they make. You're free to define helper functions to use throughout your work, too – **we did this ourselves** – and you'll need to research various string, list, array, and `np.random` functions and methods to get this all to work.

Your answers will be (very) slightly different than those of your peers, and slightly different each time you run your notebook, due to randomness. This is expected, and our autograder tests will account for this. **Don't** work out the math by hand – that isn't the point of these questions, rather the point is to build your fluency in simulating.

### Question 2.1 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">5 Points</div>

Complete the implementation of the function `prob_at_least_k_same_suit`, which takes in an integer `k` and returns a simulated estimate of the **probability that Player 0 ends up with at least $k$ cards of the same suit**. Remember, the 24 cards in the Euchre deck (`deck`) are distributed such that Player 0, Player 1, Player 2, and Player 3 each receive 5 cards, and 4 cards are put to the side.

Example behavior is given below.

```python
>>> prob_at_least_k_same_suit(4)
0.02693

# Each time we call prob_at_least_k_same_suit with the same input,
# we should see a slightly different result.
>>> prob_at_least_k_same_suit(4)
0.02591
```

Some extra guidance:
- Use 100,000 repetitions in your simulation.
- If `k` is negative or `0`, return `0`. If `k` is greater than the number of cards given to Player 0, return `0`. (Note that there are some values of `k` for which you'll always end up returning `1` – it's useful to think about which values of `k` these are. Think back to the pigeonhole principle from EECS 203!)
- Look at the default `replace` argument in `np.random.choice` – you may need to set this to something else.

In [None]:
def prob_at_least_k_same_suit(k):
    ...

# Feel free to change this input to make sure your function works correctly.
# Each time you run this cell, you should see a slightly different result!
prob_at_least_k_same_suit(4)

In [None]:
grader.check("q02_01")

### Question 2.2 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">5 Points</div>

Complete the implementation of the function `prob_all_one_jack`, which takes no arguments and returns a simulated estimate of the **probability that each player ends up with exactly one Jack (J)**.

Again, use 100,000 repetitions in your simulation.

In [None]:
def prob_all_one_jack():
    ...

# Each time you run this cell, you should see a slightly different result!
prob_all_one_jack()

In [None]:
grader.check("q02_02")

## Question 3: Probability 🤝 Machine Learning

---

In this question, we'll introduce you to a key idea machine learning and statistics, called **maximum likelihood estimation**.

<div class="alert alert-success">

To get started, read [**this guide**](https://practicaldsc.org/guides/machine-learning/mle) we've written about the method of maximum likelihood estimation. Think of it as an extension of the homework spec.

</div>

<!-- BEGIN QUESTION -->

### Question 3.1 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

When you step on campus, each person you see has a $0.1$ chance of saying "Go Blue!" to you, independent of all other people.

Tomorrow, what's the probability that the first person to say "Go Blue!" to you is the **6th** person you see?

Leave your answer in unsimplified form. This question should not take very long; think back to the probability distributions you learned in EECS 203 (other than the binomial distribution).

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 3.2 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

Again, assume that the probability that each person you see has a $0.1$ chance of saying "Go Blue!" to you, independent of all other people. 

What's the probability that:
- the first person to say "Go Blue!" to you tomorrow is the **6th** person you see, **and**
- the first person to say "Go Blue!" to you the day after tomorrow is the **10th** person you see, **and**
- the first person to say "Go Blue!" to you the day after that is the **2nd** person you see?

Again, leave your answer in unsimplified form. Note that we're asking for a single probability, not three separate probabilities.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 3.3 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">4 Points</div>

Now, suppose that the probability that each person you see says "Go Blue!" to you is some **unknown parameter**, $\pi$. (That is, $0.1$ will not appear in the rest of this question.)

Suppose you go to campus on $n$ straight days, and you collect a dataset $x_1, x_2, ..., x_n$, where:
- On Day 1, the first person to say "Go Blue!" to you is the $x_1$th person you saw (or "person $x_1$"), **and**
- On Day 2, the first person to say "Go Blue!" to you is person $x_2$, **and**,
- On Day 3, the first person to say "Go Blue!" to you is person $x_3$, **and** so on.
- In general, for $i = 1, 2, ..., n$, on Day $i$, the first person to say "Go Blue!" to you is person $x_i$.

For example, the dataset $x_1 = 5, x_2 = 10, x_3 = 2$ would mean that on Day 1, person 5 was the first to say "Go Blue!"; on Day 2, person 10 was the first to say "Go Blue!"; and on Day 3, person 2 was the first to say "Go Blue!".

Show that $\log L(\pi)$, the log of the likelihood function for $\pi$, is:

$$\log L(\pi) = \log(1 - \pi) \sum_{i = 1}^n (x_i - 1) + n \log \pi$$

Some guidance: Try and generalize the calculation you made in Question 3.2.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 3.4 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>

Using the result to Question 3.3, find $\pi^*$, the maximum likelihood estimate of $\pi$ given the dataset $x_1, x_2, ..., x_n$. Once you've done that, give a brief English explanation of why the value of $\pi^*$ makes intuitive sense.

<!-- END QUESTION -->

## Questions 4-6: Movies 🎥

---

In the final three questions of the homework, we'll build our skills in working with DataFrames in `pandas`.

<div class="alert alert-info">

Questions 4-6 will rely on knowing the basics of `pandas`. As of this homework being released, we haven't yet covered `pandas`. We will cover the material necessary for these questions in Lecture 4 on Wednesday, January 22nd, which will give you enough time to complete them before the deadline of Tuesday, January 28th. If you'd like to get a head start, you can navigate to last semester's course website (see the [Archive](https://practicaldsc.org/archive) tab of the course website) to find the relevant materials.

</div>

Run the cell below to load in a dataset with information about various movies from [IMDb](https://www.imdb.com/), the Internet Movie Database. (The dataset comes from [Kaggle](https://www.kaggle.com/datasets/parthdande/imdb-dataset-2024-updated)).

In [None]:
imdb = pd.read_csv('data/imdb-2024-cleaned.csv')
imdb

As the preview above shows us, `imdb` has 255 rows, and has some brand new movies, like `'A Quiet Place: Day One'`, released earlier this year.

Not sure what one of the columns means? Google it, as data scientists do – you'll find helpful information directly on IMDb's website.

In lecture, we'll learn that it's good practice to set the index of a DataFrame to a unique identifier for each row, if one exists. At first glance, it may seem like the values in the `'Title'` column are unique, but upon further investigation we see that there are actually duplicate `'Title'`s:

In [None]:
imdb['Title'].value_counts()

There are 2 rows for `'Planet of the Apes'`, for example. We can query to see just those rows:

In [None]:
imdb[imdb['Title'] == 'Planet of the Apes']

It seems like there are indeed multiple different `'Planet of the Apes'` movies, released in different years. So, for now, we'll leave the index of `imdb` as-is.

Time to get to work!

<div class="alert alert-danger">
    
**DO NOT** modify the `imdb` DataFrame at any point in this notebook! If you do, your code won't be graded correctly when you submit it to Gradescope.

</div>

### Question 4: Exploring the Data 🤔

In this question, which is made up of 7 smaller subparts, we'll ask you to answer various questions involving the `imdb` DataFrame. **Don't** hard-code your answers; use `pandas` code to find them programatically.

#### Question 4.1 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">1 Point</div>

Assign `highest_rating_ever` to the highest `'IMDb Rating'` of any movie in `imdb`. Your answer should be a float.

In [None]:
highest_rating_ever = ...
highest_rating_ever

In [None]:
grader.check("q04_01")

#### Question 4.2 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

Assign `movie_with_highest_rating_ever` to the `'Title'` of the movie with the highest `'IMDb Rating'` of any movie in `imdb`. Assume there are no ties. Your answer should be a string.

In [None]:
movie_with_highest_rating_ever = ...
movie_with_highest_rating_ever

In [None]:
grader.check("q04_02")

#### Question 4.3 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">1 Point</div>

Assign `num_comedy` to the number of movies in `imdb` with a `'Genre'` of `'Comedy'`. Your answer should be an integer.

In [None]:
num_comedy = ...
num_comedy

In [None]:
grader.check("q04_03")

#### Question 4.4 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

Assign `most_common_genre_2024` to the name of the most common `'Genre'` among all movies in `imdb` released in 2024. Assume there are no ties. Your answer should be a string.

In [None]:
most_common_genre_2024 = ...
most_common_genre_2024

In [None]:
grader.check("q04_04")

#### Question 4.5 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>

Assign `prop_big_directors` to the **proportion** of movies in `imdb` directed by the top 4 `'Director'`s, when `'Director'`s are ranked by the number of movies they've directed in descending order. Assume that the top 4 `'Directors'` are unambiguous; i.e. that there isn't a 5th director that is tied with one of the top 4. Your answer should be a float between 0 and 1.

In [None]:
prop_big_directors = ...
prop_big_directors

In [None]:
grader.check("q04_05")

#### Question 4.6 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

Assign `average_harry_potter` to the mean `'IMDb Rating'` of all movies with the string `'Harry Potter'` in the `'Title'`. Your answer should be a float.

In [None]:
average_harry_potter = ...
average_harry_potter

In [None]:
grader.check("q04_06")

#### Question 4.7: [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">4 Points</div>

Assign `num_hours_family_friendly` to the number of **hours** it would take you to watch the 10 highest-rated `'PG-13'` movies whose `'Genre'` is either `'Action'` or `'Adventure'`. By highest-rated, we're referring to movies with the highest `'IMDb Rating'`s. Assume that there are no ties in `'IMDb Rating'`s. Your answer should be a float.

This is a sophisticated problem, but one that only requires knowledge from Lecture 4. Remember to break the problem into steps, and write one step/method at a time.

In [None]:
num_hours_family_friendly =  ...
num_hours_family_friendly

### Question 5: Star Struck 🤩

Take a look at the `'Star Cast'` column of `imdb`, which we didn't use in Question 4:

In [None]:
imdb['Star Cast']

Right now, actors' names aren't separated by spaces. With `'Star Cast'` in this form, we can still perform _some_ queries, like looking at all of the movies Margot Robbie was in:

In [None]:
imdb[imdb['Star Cast'].str.contains('Margot Robbie')]

But we **can't** answer questions like, which actor was in the most movies? Let's change that by cleaning the data 🧹.

#### Question 5.1 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">5 Points</div>

Complete the implementation of the function `extract_names`, which takes in `s`, a single string formatted like those in the `'Star Cast'` column of `imdb`, and returns a **list** of strings with the names of all of the actors in `s`. Example behavior is given below.

```python
>>> extract_names('Makoto ShinkaiClark Cheng')
['Makoto Shinkai', 'Clark Cheng']

>>> extract_names('Anya Taylor-JoyChris HemsworthTom Burke')
['Anya Taylor-Joy', 'Chris Hemsworth', 'Tom Burke']

>>> extract_names('Santa Ono')
['Santa Ono']
```

You **must** assume that:
- Each actor only has a single first name and a single last name.
- Actors names are "split" when you see a lowercase character followed immediately by an uppercase character. For instance, in the input<br><center>"Anya Taylor-Jo**yC**hris Hemswort**hT**om Burke"</center><br>the only two occurrences of a lowercase character followed by an uppercase character are at the boundaries of names. You **must** split names in this way, even though it means that `extract_names` will work incorrectly for cases like the one below. (We're adding this assumption because it makes the implementation of `extract_names` simpler.)

```python
>>> extract_names('Leonardo DiCaprioJonah HillMargot Robbie')
['Leonardo Di', 'Caprio', 'Jonah Hill', 'Margot Robbie']
```

Note that the implementation of `extract_names` doesn't involve any `pandas` methods; it's a pure Python problem.

In [None]:
def extract_names(s):
    ...

# Feel free to change this input to make sure your function works correctly.
extract_names('Makoto ShinkaiClark Cheng')

In [None]:
grader.check("q05_01")

Now, using syntax we'll see in Week 4, we've defined for you a Series, named `star_names`, containing the names of all individual actors in `'Star Cast'`. This involves using the Series `apply` method to call your `extract_names` function on every value in `imdb['Star Cast']`, and then combining the resulting lists into one massive list, and finally converting that to Series.

In [None]:
star_names = pd.Series(imdb['Star Cast'].apply(extract_names).sum())
star_names

Note that the values in `star_names` are not unique!

#### Question 5.2 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

Complete the implementation of the function `most_common`, which takes in a Series, `ser`, and returns a **list** containing the mode(s) of `ser`. The order of the elements in the returned list does not matter. Assume `ser` contains at least one element. Example behavior is given below.

```python
>>> most_common(pd.Series([1, 2, 2]))
[2]

>>> most_common(pd.Series([1, 2]))
[1, 2]

# This works strangely because of our assumptions in 5.1.
# We're intentionally not showing you the entire output.
>>> star_names_output = most_common(star_names)
>>> 'Caprio' in star_names_output and 'Leonardo Di' in star_names_output
True
```

In [None]:
def most_common(ser):
    ...

# Feel free to change this input to make sure your function works correctly.
most_common(pd.Series([1, 2, 2]))

In [None]:
grader.check("q05_02")

After you've implemented `most_common`, run the cell below.

In [None]:
most_common(star_names)

You may recognize some of these names, but not others. You should explore! For instance, query `imdb` for all the movies that `'Pete Docter'` starred in. What do you notice?

In [None]:
# You're NOT required to do this, but you should – it's fun!
...

### Question 6: Diving Deeper 🤿

To wrap up, let's answer a few more involved questions involving the original DataFrame, `imdb`.

Remember that DataFrames are **mutable**, which means that they can be modified as a side effect of calling a function. Make sure the functions you define **don't** have any side effects, i.e. they don't modify their input DataFrames directly, but rather just return the output that is asked of them.

#### Question 6.1 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>

Complete the implementation of the function `longest_title`, which takes in a DataFrame `df` and returns the longest movie `'Title'`. Assume that `df` has the same 9 column titles as `imdb`, with the same data types, **but potentially a different number of rows in a different order, with a potentially different index**. Assume that `df` has at least one row, and assume there are no ties. Example behavior is given below.

```python
# Remember, imdb.head(5) is a DataFrame with just the first 5 rows of imdb.
>>> longest_title(imdb.head(5))
'10 Things I Hate About You'
```

In [None]:
def longest_title(df):
    ...

# Feel free to change this input to make sure your function works correctly.
longest_title(imdb.head(5))

In [None]:
grader.check("q06_01")

Once you've implemented `longest_title`, run the cell below to see the longest movie title in `imdb` 🦁.

In [None]:
longest_title(imdb)

#### Question 6.2 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>

Typically, it seems that `'IMDb Rating'`, which are determined by user ratings, and `'MetaScore'` ratings, which are weighted averages of scores given by trusted movie critics, are correlated. That is, movies with high `'IMDb Rating'`s tend to have high `'MetaScore'` ratings, and movies with low `'IMDb Rating'`s tend to have low `'MetaScore'` ratings.

We can see this trend in the following scatter plot. **Hover over points to see the names of the movies!**

In [None]:
imdb.plot(kind='scatter', 
          x='IMDb Rating', 
          y='MetaScore', 
          hover_name='Title', 
          width=800, 
          height=600,
          title='MetaScore vs. IMDb Rating')

As you can see, there are some outliers, i.e. movies that had relatively high `'MetaScore'`s but relatively low `'IMDb Rating'`s, or vice versa.

We define the `'MetaScore'` to `'IMDb Rating'` ratio of a movie as follows:

$$\frac{\text{Movie's MetaScore}}{\text{Movie's IMDb Rating}}$$

Complete the implementation of the function `metascore_to_rating_outlier`, which takes in a DataFrame `df` and returns the movie with the **largest** `'MetaScore'` to `'IMDb Rating'` ratio. Note that this is only one of the two types of outliers mentioned in the paragraph at the top of this cell, but use this definition in your solution.

Assume that `df` has the same 9 column titles as `imdb`, with the same data types, but potentially a different number of rows in a different order, with a potentially different index. Assume that `df` has at least one row, and assume there are no ties. Example behavior is given below.


```python
>>> metascore_to_rating_outlier(imdb.head(5))
'A Quiet Place'
```

In [None]:
def metascore_to_rating_outlier(df):
    ...

# Feel free to change this input to make sure your function works correctly.
metascore_to_rating_outlier(imdb.head(5))

In [None]:
grader.check("q06_02")

#### Question 6.3 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>

And finally, now that you're almost done Homework 2, it's time to find new movies in your favorite genre that the data thinks you might like.

Complete the implementation of the function `genre_specific`, which takes in a DataFrame `df` and a string, `genre`, corresponding to a `'Genre'` in `imdb`. `genre_specific(df, genre)` should return a DataFrame that:
- Contains all of the movies in `df` of `'Genre'` `genre` **released in 2024**. Assume `genre` is a valid value in `df['Genre']`.
- Is sorted in descending order of `'IMDb Rating'`, with ties broken by `'MetaScore'` in descending order. Assume no two movies have both the same `'IMDb Rating'` and the same `'MetaScore'`. (There are a few cases where this happens, but you won't be tested on them in the hidden test cases.)
- Is indexed by `'Title'`, and only has two columns: `'IMDb Rating'` and `'MetaScore'`.

Example behavior is given below.

```python
>>> genre_specific(imdb, 'Horror')
```

<div style="text-align: left;">
  <table border="1" class="dataframe">
    <thead>
      <tr>
        <th style="text-align: left;"></th>
        <th style="text-align: left;">IMDb Rating</th>
        <th style="text-align: left;">MetaScore</th>
      </tr>
      <tr>
        <th>Title</th>
        <th></th>
        <th></th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <th>Alien: Romulus</th>
        <td>7.1</td>
        <td>66.9</td>
      </tr>
      <tr>
        <th>Exhuma</th>
        <td>7.0</td>
        <td>66.9</td>
      </tr>
      <tr>
        <th>The First Omen</th>
        <td>6.7</td>
        <td>65.0</td>
      </tr>
      <tr>
        <th>Abigail</th>
        <td>6.7</td>
        <td>62.0</td>
      </tr>
      <tr>
        <th>Immaculate</th>
        <td>5.8</td>
        <td>57.0</td>
      </tr>
      <tr>
        <th>Sting</th>
        <td>5.7</td>
        <td>57.0</td>
      </tr>
      <tr>
        <th>The Strangers: Chapter 1</th>
        <td>5.0</td>
        <td>43.0</td>
      </tr>
      <tr>
        <th>Tarot</th>
        <td>4.9</td>
        <td>36.0</td>
      </tr>
    </tbody>
  </table>
</div>


In [None]:
def genre_specific(df, genre):
    ...

In [None]:
grader.check("q06_03")

## Finish Line 🏁

Congratulations! You're ready to submit Homework 2.

You need to submit Homework 2 twice:

### To submit the manually graded problems (Questions 1.4, 3; marked [Written ✏️])

- Make sure your answers **are not** in this notebook, but rather in a separate PDF.
    - You can create this PDF either digitally, using your tablet or using [Overleaf + LaTeX](https://overleaf.com) (or some other sort of digital document), or by writing your answers on a piece of paper and scanning them in.
- Submit this separate PDF to the **Homework 2 (Questions 1.4, 3; written problems)** assignment on Gradescope, and **make sure to correctly select the pages associated with each question**!

### To submit the autograded problems (Questions 1.1-1.3, 2, 4-6; marked [Autograded 💻])

1. Select `Kernel -> Restart & Run All` to ensure that you have executed all cells, including the test cells.
2. Read through the notebook to make sure everything is fine and all tests passed.
3. Run the cell below to run all tests, and make sure that they all pass.
4. Download your notebook using `File -> Download as -> Notebook (.ipynb)`, then upload your notebook to Gradescope under **Homework 2 (Questions 1.1-1.3, 2, 4-6; autograded problems)**.
5. Stick around while the Gradescope autograder grades your work.
6. Check that you have a confirmation email from Gradescope and save it as proof of your submission.

Your Homework 2 submission time will be the **later** of your two individual submissions.