In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab05.ipynb")

# Lab 05: Pandas Continued

Welcome to Advanced Topics in Data Science for High School! Throughout the course you will complete assignments like this one. You can't learn technical subjects without hands-on practice, so these assignments are an important part of the course.

**Collaboration Policy:**

Collaborating on labs is more than okay -- it's encouraged! You should rarely remain stuck for more than a few minutes on questions in labs, so ask a neighbor or an instructor for help. Explaining things is beneficial, too -- the best way to solidify your knowledge of a subject is to explain it. You should **not** _just_ copy/paste someone else's code, but rather work together to gain understanding of the task you need to complete. 

**Due Date:**

## Today's Assignment 

In this lab we will continue discussion of [Pandas](https://pandas.pydata.org/). In today's assignment, you'll learn about:

* visualizing data 

* grouping dataframes

* merging dataframes

First, set up the imports by running the cell below.

In [None]:
import pandas as pd
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt

## Babynames

In Lab 04 we learned how to filter and slice `pandas` dataframes. In today's lab we'll continue working with dataframes. Let's load `babynames` dataset.

In [None]:
baby = pd.read_csv('data/baby_names.csv')
baby.head()

**Question 1.** You should see a column named **Unnamed: 0**. Drop this column from the `baby` dataframe. 

**Hint:** The documentation for the `.drop` command is [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html).

In [None]:
baby = ...
baby

In [None]:
grader.check("q1")

## Name Popularity

Some names gain/lose popularity because of cultural phenomena such as a political figure coming to power or a successful athlete or entertainer in the during the prime years of his/her career. 

Let's look at an example.

**Question 2.** Subset the `baby` dataframe to include only the observations with the name is *Kanye*.

In [None]:
kanye = ...
kanye.head()

In [None]:
grader.check("q2")

We want to get an idea of how the popularity of this name has changed over time. To do this we need the number of babies born that were named *Kanye* for each year.

**Question 3.** Create a dataframe with two columns, **Year** and **Count**. In the **Year** column we'll have the year and in the **Count** column we'll have the number of babies born in that year that were named *Kanye*.

For example, your data frame should look like this

|     | Year | Count |
| --- | ---- | -----:|
|**0**| 2003 | 5     |
|**1**| 2004 | 124   |
|**2**| 2005 | 42    |
|**3**| 2006 | 18    |
|**4**| 2014 | 5     |

In [None]:
kanye_count = ...
kanye_count

In [None]:
grader.check("q3")

<!-- BEGIN QUESTION -->

**Question 4.** Create a line plot to visualize how the popularity of the name *Kayne* has changed over time. Be sure to give your plot a title and label your axes.

In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 5.** What do you notice about this plot? What might be the cause of increases or decreases in the popularity of the name *Kanye*?

**Note:** Be sure to mention any information you may have gotten from other sources (make sure you cite your sources).

_Type your answer here, replacing this text._

<!-- END QUESTION -->

Another cultural phenomena that happened was the "Karen" meme. Read the BBC article [What exactly is a "Karen" and where did the meme come from?](https://www.bbc.com/news/world-53588201). Then we'll use the `babynames` data set to investigate the change in popularity of the name *Karen*. 

<!-- BEGIN QUESTION -->

**Question 6.** Create a line plot to visualize how the popularity of the name *Karen* has changed over time. Be sure to give your plot a title and label your axes. 

In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 7.** What do you notice about this plot? What might be the cause of increases or decreases in the popularity of the name *Karen*? 

**Note:** Be sure to mention any information you may have gotten from other sources (make sure you cite your sources).

_Type your answer here, replacing this text._

<!-- END QUESTION -->

## North Carolina Names

**Question 8.** Create a dataframe named `nc` that only contains the names from North Carolina. 

In [None]:
nc = ...
nc.head()

In [None]:
grader.check("q8")

To count the number of instances of each unique value in a `Series`, we can use the `value_counts()` [method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html) as `df["col_name"].value_counts()`. 

**Note:** We are not computing the number of babies but instead the number of names (rows in the table) for each year.

Run the cell below.

In [None]:
nc["Year"].value_counts().sort_values()

<!-- BEGIN QUESTION -->

**Question 9.** As the years go by, there seems to be an increase in the variety of names given to babies born in North Carolina. Why do you think this is happening?

**Note:** Be sure to mention any information you may have gotten from other sources (make sure you cite your sources).

_Type your answer here, replacing this text._

<!-- END QUESTION -->

I wonder is this trend the same for male and female baby names. Are we more likely to be creative with male names or female names? Let's find out.

<!-- BEGIN QUESTION -->

**Question 10.** Count the number of different names for each `Sex` in  the `nc` dataframe.

In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 11.** Based on your results from **Question 10**, do you think we are more likely to be creative with male names or female names?. Why do you think this is the case?

**Note:** Be sure to mention any information you may have gotten from other sources (make sure you cite your sources).

_Type your answer here, replacing this text._

<!-- END QUESTION -->

## Groupby

Before we jump into using the [`groupby`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) function in Pandas, let's recap how grouping works in general for tabular data through a guided set of questions based on a small toy dataset of movies and genres. 

**Note:** If you want to see a visual of how grouping of data works, here is a link to an animation [Groupby Animation](http://www.ds100.org/sp18/assets/lectures/lec03/03-groupby_and_pivot.pdf)

**Problem Setting:** In the summer of 2018, there were a lot of good and bad movies that came out. Below is a dataframe with 5 columns: name of the movie as a `string`, the genre of the movie as a `string`, the first name of the director of the movie as a `string`, the average rating out of 10 on Rotten Tomatoes as an `integer`, and the total gross revenue made by the movie as an `integer`. The point of the guided questions below is to understand how the grouping of data works in general, **not** how grouping works in code. 

Below is the `movies` dataframe we are using, imported from the `movies.csv` file.

Run the cell below.

In [None]:
movies = pd.read_csv('data/movies.csv')
movies.head()

If we grouped the `movies` dataframe above by `genre`, how many groups would be in the output and what would be the groups? 

**Question 12.**  Assign `num_groups` to the number of groups created and fill in `genre_list` as a list containing the names of genres as strings that represent the groups. Make sure your list is sorted.

In [None]:
...
num_groups = ...
genre_list = ...

In [None]:
grader.check("q12")

**Question 13.**  Whenever we group tabular data, it's usually the case that we need to aggregate values from the ungrouped column(s). If we were to group the `movies` dataframe above by `genre`, which column(s) in the `movies` dataframe would it make sense to aggregate if we were interested in finding how well each genre did in the eyes of people? Fill in `agg_cols` with the column name(s) as a list. Make sure your list is sorted alphabetically.

In [None]:
agg_cols = ...
agg_cols

In [None]:
grader.check("q13")

Now, let's see `groupby` in action, instead of keeping everything abstract. To aggregate data in Pandas, we use the `.groupby()` [function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html). 

The code below will group the `movies` dataframe by `genre` and find the average revenue and rating for each genre.

Run the cell below.

In [None]:
movies.loc[:, ['genre', 'rating', 'revenue']].groupby('genre').mean()

Notice that the index of the dataframe is genre. If we wanted to change it back to integers (the default index), we can use the `.reset_index()` method.

In [None]:
movies.loc[:, ['genre', 'rating', 'revenue']].groupby('genre').mean().reset_index()

**Question 14.** Let's move back to baby names and specifically, the `nc` dataframe. Find the sum of `Count` for each `Name` in the `nc` table. Make sure the dataframe is sorted by the **Name** column.

For example, the dataframe should look like this

| | Name | Count |
|-|------|-------:|
|**0**|Aaden|132|
|**1**|Aadhya|74|
|**2**|Aadya|33|
|**3**|Aahana|6|
|**4**|Aaiden|5|

**Note:** In this question we are now computing the number of registered babies with a given name.

In [None]:
nc_name_count = ...
nc_name_count.head()

In [None]:
grader.check("q14")

## Grouping Multiple Columns

Let's move back to the `movies` dataframe. 

Which of the following lines of code 

1. `movies.groupby('revenue')[['genre', 'rating']].mean()`

2. `movies.groupby(['genre', 'rating'])['revenue'].mean()`

3. `pd.pivot_table(index = 'rating', columns = 'genre', values = 'revenue', aggfunc = np.mean)`

4. `pd.pivot_table(movies, index = 'genre', columns = 'rating', values = 'revenue', aggfunc = np.mean)`


will output the following dataframe? 

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th>rating</th>
      <th>5</th>
      <th>6</th>
      <th>7</th>
      <th>8</th>
    </tr>
    <tr>
      <th>genre</th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Action &amp; Adventure</th>
      <td>208681866.0</td>
      <td>129228350.0</td>
      <td>318344544.0</td>
      <td>6708147.0</td>
    </tr>
    <tr>
      <th>Animation</th>
      <td>374408165.0</td>
      <td>NaN</td>
      <td>NaN</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>Comedy</th>
      <td>55383976.0</td>
      <td>30561590.0</td>
      <td>NaN</td>
      <td>111705055.0</td>
    </tr>
    <tr>
      <th>Drama</th>
      <td>NaN</td>
      <td>17146165.5</td>
      <td>NaN</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>Horror</th>
      <td>NaN</td>
      <td>NaN</td>
      <td>68765655.0</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>Science Fiction &amp; Fantasy</th>
      <td>NaN</td>
      <td>312674899.0</td>
      <td>NaN</td>
      <td>NaN</td>
    </tr>
  </tbody>
</table>

**Question 15.** Assign your answer (hard-coded as a string) as either 1, 2, 3, or 4 to the variable `q15_ans`. 

**Note:** Recall that the arguments to `pd.pivot_table` are as follows: `data` is the input dataframe, `index` includes the values we use as rows, `columns` are the columns of the pivot table, `values` are the values in the pivot table, and `aggfunc` is the aggregation function that we use to aggregate `values`.


**Hint:** Click [here](https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html) to look at the documentation for the `pivot_table` method.

In [None]:
q15_ans = ...

In [None]:
grader.check("q15")

## Merging

Time to put everything together.

**Question 16.** Merge `movies` and `nc_name_count` to find the number of registered baby names for each director using [`pd.merge`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html?highlight=merge#pandas.merge). Only include names that appear in both `movies` and `nc_name_count`.

Your first row should look something like this:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>director</th>
      <th>genre</th>
      <th>movie</th>
      <th>rating</th>
      <th>revenue</th>
      <th>Count</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>David</td>
      <td>Action &amp; Adventure</td>
      <td>Deadpool 2</td>
      <td>7</td>
      <td>318344544</td>
      <td>91158</td>
    </tr>
  </tbody>
</table>
</table>


In [None]:
merged_movies = ...
merged_movies

In [None]:
grader.check("q16")

**Question 17.** Where there any directors in the original `movies` table did not get included in the `merged_movies` dataframe? If so, how many and what are there their names. Save the names of these directors (if there are any) to a list named `missing_directors`. Make sure the list is sorted. If there are none leave the list empty.

In [None]:
...
missing_directors = ...
missing_directors

In [None]:
grader.check("q17")

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(run_tests=True)