In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab05.ipynb")

# Lab 05: Pandas Continued

Welcome to Lab 05! In this lab we will continue discussion of [Pandas](https://pandas.pydata.org/) and you will learn about:

* Grouping dataframes

* Merging dataframes

To receive credit for a lab, answer all questions correctly and submit before the deadline.

**Due Date:** 

**Collaboration Policy:** Data science is a collaborative activity. While you may talk with others about the labs, we ask that you **write your solutions individually**. If you do discuss the assignments with others **please include their names below** (it's a good way to learn your classmates' names).

**Collaborators:** 

List collaborators here.

Run the cell below.

In [None]:
import pandas as pd
import numpy as np

# 1. Dataframe Methods

Read in the `baby_names.csv` as a dataframe named `baby_names`.

In [None]:
baby_names = pd.read_csv('data/baby_names.csv', index_col = 0)

**Question 1.** Create a dataframe named `nc` that only contains the names from North Carolina. 


In [None]:
nc = ...
nc.head()

In [None]:
grader.check("q1")

To count the number of instances of each unique value in a `Series`, we can use the `value_counts()` [method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html) as `df["col_name"].value_counts()`. 

**Question 2.** Count the number of different names for each Year in `NC` (North Carolina) from the `nc` DataFrame created in **Question 1**.

**Note:** We are not computing the number of babies but instead the number of names (rows in the table) for each year.


In [None]:
num_of_names_per_year = ...
num_of_names_per_year

In [None]:
grader.check("q2")

**Question 3.** Count the number of different names for each `Sex` in `NC`.


In [None]:
num_of_names_per_gender = ...
num_of_names_per_gender

In [None]:
grader.check("q3")

# 2. Groupby

Before we jump into using the [`groupby`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) function in Pandas, let's recap how grouping works in general for tabular data through a guided set of questions based on a small toy dataset of movies and genres. 

**Note:** If you want to see a visual of how grouping of data works, here is a link to an animation [Groupby Animation](http://www.ds100.org/sp18/assets/lectures/lec03/03-groupby_and_pivot.pdf)

**Problem Setting:** In the summer of 2018, there were a lot of good and bad movies that came out. Below is a dataframe with 5 columns: name of the movie as a `string`, the genre of the movie as a `string`, the first name of the director of the movie as a `string`, the average rating out of 10 on Rotten Tomatoes as an `integer`, and the total gross revenue made by the movie as an `integer`. The point of the guided questions below is to understand how the grouping of data works in general, **not** how grouping works in code. 

Below is the `movies` dataframe we are using, imported from the `movies.csv` file.

Run the cell below.

In [None]:
movies = pd.read_csv('data/movies.csv')
movies

If we grouped the `movies` dataframe above by `genre`, how many groups would be in the output and what would be the groups? 

**Question 4.**  Assign `num_groups` to the number of groups created and fill in `genre_list` as a list containing the names of genres as strings that represent the groups.

In [None]:
num_groups = ...
genre_list = ...

In [None]:
grader.check("q4")

**Question 5.** Whenever we group tabular data, it is usually the case that we need to aggregate values from the ungrouped column(s). If we were to group the `movies` dataframe above by `genre`, which column(s) in the `movies` dataframe would it make sense to aggregate if we were interested in finding how well each genre did in the eyes of people? Fill in `agg_cols` with the column name(s) as a list.


In [None]:
agg_cols = ...
agg_cols

In [None]:
grader.check("q5")

Now, let's see `groupby` in action, instead of keeping everything abstract. To aggregate data in Pandas, we use the `.groupby()` [function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html). 

The code below will group the `movies` dataframe by `genre` and find the average revenue and rating for each genre.

Run the cell below.

In [None]:
movies.loc[:, ['genre', 'rating', 'revenue']].groupby('genre').mean()

**Question 6.** Let's move back to baby names and specifically, the `nc` dataframe. Find the sum of `Count` for each `Name` in the `nc` table. You should use `df.groupby("col_name").sum()` and your result should be a Pandas Series.

**Note:** In this question we are now computing the number of registered babies with a given name.


In [None]:
count_for_names = ...
count_for_names

In [None]:
grader.check("q6")

**Question 7.** Find the sum of `Count`, in **descending order**, for each female name after year 2000 in North Carolina. Your result should be a Pandas Series.


In [None]:
...
nc_female_name_count

In [None]:
grader.check("q7")

# 3. Grouping Multiple Columns

Let's move back to the `movies` dataframe. 

Which of the following lines of code 

1. `movies.groupby('revenue')[['genre', 'rating']].mean()`

2. `movies.groupby(['genre', 'rating'])['revenue'].mean()`

3. `pd.pivot_table(index = 'rating', columns = 'genre', values = 'revenue', aggfunc = np.mean)`

4. `pd.pivot_table(index = 'genre', columns = 'rating', values = 'revenue', aggfunc = np.mean)`


will output the following dataframe? 

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th>rating</th>
      <th>5</th>
      <th>6</th>
      <th>7</th>
      <th>8</th>
    </tr>
    <tr>
      <th>genre</th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Action &amp; Adventure</th>
      <td>208681866.0</td>
      <td>129228350.0</td>
      <td>318344544.0</td>
      <td>6708147.0</td>
    </tr>
    <tr>
      <th>Animation</th>
      <td>374408165.0</td>
      <td>NaN</td>
      <td>NaN</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>Comedy</th>
      <td>55383976.0</td>
      <td>30561590.0</td>
      <td>NaN</td>
      <td>111705055.0</td>
    </tr>
    <tr>
      <th>Drama</th>
      <td>NaN</td>
      <td>17146165.5</td>
      <td>NaN</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>Horror</th>
      <td>NaN</td>
      <td>NaN</td>
      <td>68765655.0</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>Science Fiction &amp; Fantasy</th>
      <td>NaN</td>
      <td>312674899.0</td>
      <td>NaN</td>
      <td>NaN</td>
    </tr>
  </tbody>
</table>

**Question 8.** Assign your answer (hard-coded) as either 1, 2, 3, or 4 to the variable `q8_answer`. 

**Note:** Recall that the arguments to `pd.pivot_table` are as follows: `data` is the input dataframe, `index` includes the values we use as rows, `columns` are the columns of the pivot table, `values` are the values in the pivot table, and `aggfunc` is the aggregation function that we use to aggregate `values`.


**Hint:** Click [here](https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html) to look at the documentation for the `pivot_table` method.

In [None]:
q8_answer = ...
q8_answer

In [None]:
grader.check("q8")

# 4. Merging

Time to put everything together. 

**Quetion 9.** Merge `movies` and `count_for_names` to find the number of registered baby names for each director using [`pd.merge`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html?highlight=merge#pandas.merge). Only include names that appear in both `movies` and `count_for_names`.

**Hint:** You might need to convert the `count_for_names` series to a dataframe. To do this click [here](https://pandas.pydata.org/docs/reference/api/pandas.Series.to_frame.html) to take a look at the `to_frame` method of a series. 

Your first row should look something like this:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>director</th>
      <th>genre</th>
      <th>movie</th>
      <th>rating</th>
      <th>revenue</th>
      <th>Count</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>David</td>
      <td>Action &amp; Adventure</td>
      <td>Deadpool 2</td>
      <td>7</td>
      <td>318344544</td>
      <td>99158</td>
    </tr>
  </tbody>
</table>
</table>


In [None]:
merged_movies = ...
merged_movies

**Question 10.** How many directors in the original `movies` table did not get included in the `merged_movies` dataframe? Asign your answer (hard-coded) as a number in `q10_answer`.

In [None]:
movies.shape[0]-merged_movies.shape[0]

In [None]:
q10_answer = ...
q10_answer

In [None]:
grader.check("q10")

<!-- BEGIN QUESTION -->

**Question 11.** In 1-2 sentences explain why some directors in the `movies` dataframe were left out of the `merged_movies` dataframe.

_Type your answer here, replacing this text._

<!-- END QUESTION -->



---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

When done exporting, download the .zip file by `SHIFT`-clicking on the file name and selecting **Save Link As**. Or, find the .zip file in the left side of the screen and right-click and select **Download**. You'll submit this .zip file for the assignment in Canvas to Gradescope for grading.

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)