In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("cs109a_hw1.ipynb")

<h1 style="padding-top: 25px;padding-bottom: 25px;text-align: left; padding-left: 10px; background-color: #DDDDDD; 
    color: black;"> <img style="float: left; padding-right: 10px; width: 45px" src="https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/iacs.png"> CS109A Introduction to Data Science </h1>

## Homework 1: Web Scraping, Data Parsing, and EDA

**Harvard University**<br/>
**Fall 2022**<br/>
**Instructors**: Pavlos Protopapas and Natesh Pilai



<hr style='height:2px'>

In [None]:
## RUN THIS CELL TO GET THE RIGHT FORMATTING 
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)

#### Instructions
- To submit your assignment follow the instructions given in Canvas.
- Exercise **responsible scraping**. Web servers can become slow or unresponsive if they receive too many requests from the same source in a short amount of time. Use a delay of 2 seconds between requests in your code.  
- Web scraping requests can take several minutes. This is another reason why you should not wait until the last minute to do this homework.
- Plots should be legible and interpretable without having to refer to the code that generated them.
- When asked to interpret a visualization, do not simply describe it (e.g., "the curve has a steep slope up"), but instead explain what you think the plot *means*.
- The use of 'hard-coded' values to try and pass tests rather than solving problems programmatically will not receive credit.
- The use of *extremely* inefficient or error-prone code (e.g., copy-pasting nearly identical commands rather than looping) may result in only partial credit.
- Enable scrolling output on cells with very long output.
- Feel free to add additional code or markdown cells.
- Ensure your code runs top to bottom without error and passes all tests by restarting the kernel and running all cells. This is how the notebook will be evaluated (this can take a few minutes)

In [None]:
import json
from time import sleep
import re
import requests
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

<hr style='height:2px'>

## ⭐ Follow the stars in IMDb's list of "The Top 100 Stars for 2021" 

### Overview
In this assignment you'll practice scraping, parsing, and analyzing HTML data pulled from web.

Specifically, you'll extract information about each person on IMDb's list of "Top 100 Stars for 2021" (https://www.imdb.com/list/ls577894422/), perform some EDA, ask some questions of the data, and interpret your findings.

For example, we might like to know: 
- What is the relationship between when a performer started their career and their total number of acting credits? 
- How many stars started as child actors?
- How do the distribution of ages compare across genders?
- Who is the most prolific actress or actor in IMDb's list of the Top 100 Stars for 2021? 

These questions and more are addressed in details below. 

## Part 1 - Scraping and Parsing

<div class='exercise'><b>Q1 - Scrape Top Stars List</b></div>

Scrape the HTML from the webpage of the "Top 100 Stars for 2021" (https://www.imdb.com/list/ls577894422/) into a `requests` object and name it `my_page`. 

_Points:_ 2.5

In [None]:
my_page = ...

In [None]:
grader.check("q1")

<div class='exercise'><b>Q2 - Making BeautifulSoup</b></div>

Create a Beautiful Soup object named `star_soup` from the HTML content in `my_page`.


_Points:_ 2.5

In [None]:
star_soup = ...

In [None]:
# check your code - you should see a familiar HTML code
print (star_soup.prettify()[:1000])

In [None]:
grader.check("q2")

<div class='exercise'><b>Q3 - Parse Stars</b></div>

Write a function called `parse_stars` that accepts `star_soup` as its input and returns a list of dictionaries to be saved in a variable called `star_list` (see function definition below for details)

IMDb star pages do not have a 'sex' or 'gender' field. Some roles are gender neutral (e.g., "writer") and relying on actor/actress distinctions will also give results inconsistent with the more detailed data available on the site. You should infer gender based on the frequency of the personal pronouns used in each star's truncated bio that appears on the main "Top 100 Star of 2021" page.

You may find a data structure like this useful:

```python
pronouns = {'woman': ['she','her'],
            'man': ['he', 'him', 'his'],
            'non-binary': ['they', 'them', 'their']}
```

Simply count the occurrences of the different pronouns in the bio and make the classification based on the grouping with the majority count.

>**Hint:** Throughout this assignment you will likely find it useful to create small 'helper' functions which are then used by your larger functions like `parse_stars`

```
Function
--------
parse_stars

Input
------
star_soup: the soup object with the scraped page
   
Returns
-------
a list of dictionaries; each dictionary corresponds to a star profile and has the following data:

    name: (str) the name of the star
    role: (str) role in film designated on top 100 page (e.g., 'actress', 'writer')
    gender: (str) 'man', 'woman', or 'non-binary' based in pronoun counts in top 100 page bio
    url: (str) the url of the link under star's name that leads to a page with more details
    page: (bs4.BeautifulSoup) BS object acquired by scraping and parsing the above 'url' page

Example:
--------
{'name': 'Elizabeth Olsen',
 'role': 'actress',
 'gender': 'woman',
 'url': 'https://www.imdb.com/name/nm0647634',
 'page': <!DOCTYPE html>
 <html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
 <head>
 <meta charset="utf-8"/>
 <script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>
 <script>
 ...
}

```

_Points:_ 25

In [None]:

def parse_stars(star_soup) -> list:
    ...

star_list = ...

In [None]:
# check your code
# this list is large because of the html code into the `page` field
# to get a better picture, print only the first element
star_list[0]

In [None]:
grader.check("q3")

<div class='exercise'><b>Q4 - Create Star Table</b></div>

Write a function called `create_star_table`, which takes `star_list` as an input and returns a *new* list of dictionaries, `star_table`, which includes more extensive information about each star extracted from their `page`. 

See function the definition below for more details.

>**Note:** The years of some credits are ranges (e.g., 2001-2002). You should use only the starting year.

>**Hint:** Carefuly note the ordering, case, and data type of the values in each dictionary.

```
Function
--------
create_star_table

Input
------
star_list (list of dictionaries)
   
Returns
-------

a list of dictionaries; each dictionary corresponds to a star profile and has the following data:

    name: (str) the name of the star
    role: (str) 'actor', 'actress', 'writer', etc. (note the case)
    gender: (str) 'woman', 'man', or 'non-binary' (based on pronouns in bio)  
    year_born : (int) year star was born (some pages do note have a full date so we'll just use year)
    first_credit: (str) title of their first credit in their capacity designated by 'role'
    year_first_credit: (int) the year they made their first movie or TV show
    num_credits: (int) number of movies or TV shows they have made over their career in their capacity designated by 'role'
    
--------
Example:

{'name': 'Elizabeth Olsen',
  'role': 'actress',
  'gender': 'woman',
  'year_born': 1989,
  'first_credit': 'How the West Was Fun',
  'year_first_credit': 1994,
  'num_credits': 27}
  
```

_Points:_ 25

In [None]:
def create_star_table(star_list: list) -> list:
    ...

In [None]:
star_table = ...

In [None]:
# check your code
star_table

In [None]:
grader.check("q4")

### 💡 Saving and Restoring Our List of Dictionaries

It's good practice to save your data structure to disk once you've done all of your scraping. This way you can often avoid repeating all the HTTP requests which can be slow (and taxing on servers!).

We had to wait until this stage to save our data structure as the `bs4.BeautifulSoup` object in our original `star_list` (the `page` values) can not be easily [serialized](https://docs.python-guide.org/scenarios/serialization/).

The code provided below will save the data structure to a [JSON](https://www.json.org/json-en.html) file named `starinfo.json` in the data subdirectory.

In [None]:
# your code here
with open("data/starinfo.json", "w") as f:
    json.dump(star_table, f)

To confirm this worked as intended, open the JSON file and load its contents into a variable for viewing.

In [None]:
with open("data/starinfo.json", "r") as f:
    star_table = json.load(f)
    
# output should be the same
star_table

This method of saving and restoring data structures will likely be useful to you in the future!

## Part 2 - Pandas and EDA

<div class='exercise'><b>Q5 - Creating a DataFrame</b></div>

For the sake of consistency, we've provide our own JSON file, `data/starinfo_2021_staff.json`, which you should use for the rest of the notebook. Load the contents of this JSON file and use it to create a Pandas DataFrame called `frame`. 

>**Hint:** Remember, the data structure in the JSON file is a list of dictionaries.

_Points:_ 2.5

In [None]:
frame = ...

In [None]:
# Take a peek
frame.head(20)

In [None]:
grader.check("q5")

<div class='exercise'><b>Q6 - Creating a New Column</b></div>

Add a new column to your dataframe with the *approximate* age of each star at the time of their first credit. Name this new column `age_at_first_credit`.

>**NOTE:** We say *approximate* age because we've only collected the year of birth as several star pages do not include a full birth date. **The approximate age of a star in a given year should be the difference between that year and the star's birth year.**

_Points:_ 2.5

In [None]:
# You should visually inspect some of your results
frame.head()

<div class='exercise'><b>Q7 - Subsetting and Sorting</b></div>

In this section you'll subset and sort the DataFrame to answer a pair of questions:

<div class='exercise'><b>Q7.1 - Child Stars</b></div>

Which stars received their first credit **before the age of 11?**

Store the resulting dataframe as `child_stars`sorted by `age_at_first_credit` in **ascending** order.\
Store the number of such "child stars" in `num_child_stars`.

_Points:_ 2.5

In [None]:
child_stars = ...
num_child_stars = ...

print ("{} stars received their first credit before the age of 11.".format(num_child_stars))
display(child_stars)

In [None]:
grader.check("q7.1")

<div class='exercise'><b>Q7.2 - Late Bloomers</b></div>

Which stars received their first credit at **26-years-old or older?**

Store the resulting dataframe as `late_bloomers`sorted by `age_at_first_credit` in **descending** order.\
Store the number of such "late bloomers" in `num_late_bloomers`.

_Points:_ 2.5

In [None]:
late_bloomers = ...
num_late_bloomers = ...

print ("{} stars received their first credit at 26 or older.".format(num_late_bloomers))
display(late_bloomers)

In [None]:
grader.check("q7.2")

<div class='exercise'><b>Q8 - Visualization  </b></div>

In this section you'll use your Python visualization skills to further explore the data:

<!-- BEGIN QUESTION -->

<div class='exercise'><b>Q8.1  - Exploring Trends</b></div>

Create 2 [scatter plots](https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.scatter.html?highlight=scatter#matplotlib.axes.Axes.scatter): one showing the relationship between **age at first movie** and number of credits, the other between **year born** and number of credits.

What can you say about these relationships? Are there any apparent outliers? Please limit your written responses to 4 sentences or fewer.

_Points:_ 5

_Type your answer here, replacing this text._

In [None]:
# your code here


<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<div class='exercise'><b>Q8.2 - Age Distributions</b></div>

Let's look at the distribution of movie and TV performers' ages by gender.

Create two plots, each plot consisting of **two overlayed [histograms](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html?highlight=hist#matplotlib.pyplot.hist)** comparing the distribution of men's current ages to women's current ages.

In the first plot, the distributions should be normalized to show the *proportion* of each gender at each age.

The second plot should show the *counts* of each gender at each age. 

Interpret the resulting plots. (4 sentences or fewer)

>**NOTE 1:** Again, we are dealing with *approximate* ages as defined above.

>**NOTE 2:** You should exclude those whose `role` is not 'actor' or 'actress' from your analysis

_Points:_ 10

_Type your answer here, replacing this text._

In [None]:
# your code here


<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<div class='exercise'><b>Q8.3 - Credits Per Year</b></div>

Create a [box plot](https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.boxplot.html?highlight=boxplot#matplotlib.axes.Axes.boxplot) or [violin plot](https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.violin.html?highlight=violin#matplotlib.axes.Axes.violin) comparing the **credits-per-year-active** for men and women performers.

Here we assume all stars in the list are still active. 

Do these distributions look the same across genders? Can you identify the stars corresponding to any outliers? Comment on these points and anything else of interest gleaned from your plot. (6 sentences or fewer)

>**NOTE:** Again, you should exclude those whose `role` is not 'actor' or 'actress' from your analysis.

_Points:_ 10

_Type your answer here, replacing this text._

In [None]:
# your code here


<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<div class='exercise'><b>Q9 - Most Prolific Stars</b></div>

Make a plot visualizing the number of credits received by each star. Who is the most prolific person in IMDb's list of the Top 100 Stars for 2021? Define **most prolific** as the person with the most credits.

>**Note 1:** Your analysis should include all 100 stars

>**Note 2:** The stars in the plots should be sorted based on number of credits to make the plot easier to read.

_Points:_ 10

In [None]:
# your code here


In [None]:
highest_performer_name = ...
highest_performer_credits = ...
print ("{} had the most credits with {}".format(highest_performer_name, highest_performer_credits))

<!-- END QUESTION -->

**This concludes HW1. Thank you!**

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()