In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab04.ipynb")

# Lab 04: Pandas Continued

In this lab we will continue discussion of [Pandas](https://pandas.pydata.org/) and you will learn about:

* Grouping dataframes
* Merging dataframes

To receive credit for a lab, answer all questions correctly and submit before the deadline.

**Due Date:** Wednesday, February 24, 2021 at 7:00 p.m.

**Collaboration Policy:** Data science is a collaborative activity. While you may talk with others about the labs, we ask that you **write your solutions individually**. If you do discuss the assignments with others **please include their names below** (it's a good way to learn your classmates' names).

**Collaborators:** 

List collaborators here.

**Note:** In this notebook a custom figure size has been configured. Click [here](https://matplotlib.org/users/customizing.html) to read the documentation about customizing aspects of matplotlib.

Run the cell below.

In [113]:
import pandas as pd
import numpy as np

import otter
grader = otter.Notebook()

Read in the `baby_names.csv` as a dataframe named `baby_names`.

In [114]:
baby_names = pd.read_csv('baby_names.csv', index_col = 0)

**Question 1.1.** Create a dataframe named `nc` that only contains the names from North Carolina. 

<!--
BEGIN QUESTION
name: q1_1
manual: false
-->

In [119]:
nc = ...
nc 

In [None]:
grader.check("q1_1")

To count the number of instances of each unique value in a `Series`, we can use the `value_counts()` [method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html) as `df["col_name"].value_counts()`. 

**Question 1.2.** Count the number of different names for each Year in `NC` (North Carolina) from the `nc` DataFrame created in **Question 1.1.**.

**Note:** We are not computing the number of babies but instead the number of names (rows in the table) for each year.

<!--
BEGIN QUESTION
name: q1_2
manual: false
-->

In [121]:
num_of_names_per_year = ...
num_of_names_per_year

In [None]:
grader.check("q1_2")

**Question 1.3.** Count the number of different names for each gender in `NC`.

<!--
BEGIN QUESTION
name: q1_3
manual: false
-->

In [126]:
num_of_names_per_gender = ...
num_of_names_per_gender

In [None]:
grader.check("q1_3")

## 2. Groupby

Before we jump into using the [`groupby`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) function in Pandas, let's recap how grouping works in general for tabular data through a guided set of questions based on a small toy dataset of movies and genres. 

**Note:** If you want to see a visual of how grouping of data works, here is a link to an animation [Groupby Animation](http://www.ds100.org/sp18/assets/lectures/lec03/03-groupby_and_pivot.pdf)

**Problem Setting:** In the summer of 2018, there were a lot of good and bad movies that came out. Below is a dataframe with 5 columns: name of the movie as a `string`, the genre of the movie as a `string`, the first name of the director of the movie as a `string`, the average rating out of 10 on Rotten Tomatoes as an `integer`, and the total gross revenue made by the movie as an `integer`. The point of these guided questions (parts a and b) below is to understand how grouping of data works in general, **not** how grouping works in code. We will worry about how grouping works in Pandas in 7c, which will follow.

Below is the `movies` dataframe we are using, imported from the `movies.csv` file.

Run the cell below.

In [131]:
movies = pd.read_csv('movies.csv')
movies

<!-- BEGIN QUESTION -->

If we grouped the `movies` dataframe above by `genre`, how many groups would be in the output and what would be the groups? 

**Question 2.1.**  Assign `num_groups` to the number of groups created (hard-code) and fill in `genre_list` as a list containing the names of genres as strings that represent the groups.

<!--
BEGIN QUESTION
name: q2_1
manual: true
-->

In [133]:
num_groups = ...
genre_list = ...

In [None]:
grader.check("q2_1")

<!-- END QUESTION -->

**Question 2.2.** Whenever we group tabular data, it is usually the case that we need to aggregate values from the ungrouped column(s). If we were to group the `movies` dataframe above by `genre`, which column(s) in the `movies` dataframe would it make sense to aggregate if we were interested in finding how well each genre did in the eyes of people? Fill in `agg_cols` with the column name(s) as a list.

<!--
BEGIN QUESTION
name: q2_2
manual: false
-->

In [136]:
agg_cols = ...
agg_cols

In [None]:
grader.check("q2_2")

Now, let's see `groupby` in action, instead of keeping everything abstract. To aggregate data in Pandas, we use the `.groupby()` [function](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html). 

The code below will group the `movies` dataframe by `genre` and find the average revenue and rating for each genre.

Run the cell below.

In [139]:
movies.loc[:, ['genre', 'rating', 'revenue']].groupby('genre').mean()

**Question 2.3.** Let's move back to baby names and specifically, the `nc` dataframe. Find the sum of `Count` for each `Name` in the `nc` table. You should use `df.groupby("col_name").sum()` and your result should be a Pandas Series.

**Note:** In this question we are now computing the number of registered babies with a given name.

<!--
BEGIN QUESTION
name: q2_3
manual: false
-->

In [140]:
count_for_names = ...
count_for_names

In [None]:
grader.check("q2_3")

**Question 2.4.** Find the sum of `Count`, in descending order, for each female name after year 2000 (`>2000`) in North Carolina. Your result should be a Pandas Series.

<!--
BEGIN QUESTION
name: q2_4
manual: false
-->

In [145]:
...
nc_female_name_count

In [None]:
grader.check("q2_4")

<!-- BEGIN QUESTION -->

## 3. Grouping Multiple Columns

Let's move back to the `movies` dataframe. 

Which of the following lines of code 

1. `pd.pivot_table(data = movies, index = 'genre', columns = 'rating', values = 'revenue', aggfunc = np.mean)`

2. `movies.groupby(['genre', 'rating'])['revenue'].mean()`

3. `pd.pivot_table(data = movies, index = 'rating', columns = 'genre', values = 'revenue', aggfunc = np.mean)`

4. `movies.groupby('revenue')[['genre', 'rating']].mean()`


will output the following dataframe? 

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th>rating</th>
      <th>5</th>
      <th>6</th>
      <th>7</th>
      <th>8</th>
    </tr>
    <tr>
      <th>genre</th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Action &amp; Adventure</th>
      <td>208681866.0</td>
      <td>129228350.0</td>
      <td>318344544.0</td>
      <td>6708147.0</td>
    </tr>
    <tr>
      <th>Animation</th>
      <td>374408165.0</td>
      <td>NaN</td>
      <td>NaN</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>Comedy</th>
      <td>55383976.0</td>
      <td>30561590.0</td>
      <td>NaN</td>
      <td>111705055.0</td>
    </tr>
    <tr>
      <th>Drama</th>
      <td>NaN</td>
      <td>17146165.5</td>
      <td>NaN</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>Horror</th>
      <td>NaN</td>
      <td>NaN</td>
      <td>68765655.0</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>Science Fiction &amp; Fantasy</th>
      <td>NaN</td>
      <td>312674899.0</td>
      <td>NaN</td>
      <td>NaN</td>
    </tr>
  </tbody>
</table>

**Question 3.1.** Asign your answer (hard-coded) as either 1, 2, 3, or 4 to the variable `q3_1_answer`. 

**Note:** Recall that the arguments to `pd.pivot_table` are as follows: `data` is the input dataframe, `index` includes the values we use as rows, `columns` are the columns of the pivot table, `values` are the values in the pivot table, and `aggfunc` is the aggregation function that we use to aggregate `values`.

<!--
BEGIN QUESTION
name: q3_1
manual: true
-->

In [150]:
q3_1_answer = ...
q3_1_answer

<!-- END QUESTION -->

## 4. Merging

Time to put everything together. 

**Quetion 4.1.** Merge `movies` and `count_for_names` to find the number of registered baby names for each director using [`pd.merge`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html?highlight=merge#pandas.merge). Only include names that appear in both `movies` and `count_for_names`.

**Hint:** You might need to convert the `count_for_names` series to a dataframe. To do this take a look at the `to_frame` method of a series. 

Your first row should look something like this:

**Note**: It is ok if you have two separate columns with names instead of just one column.

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>director</th>
      <th>genre</th>
      <th>movie</th>
      <th>rating</th>
      <th>revenue</th>
      <th>Count</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>David</td>
      <td>Action &amp; Adventure</td>
      <td>Deadpool 2</td>
      <td>7</td>
      <td>318344544</td>
      <td>99158</td>
    </tr>
  </tbody>
</table>
</table>

<!--
BEGIN QUESTION
name: q4_1
manual: false
-->

In [151]:
merged_df = ...
merged_df

<!-- BEGIN QUESTION -->

**Question 4.2.** How many directors in the original `movies` table did not get included in the `merged_df` dataframe? Asign your answer (hard-coded) as a number in `q4_2_answer`.

<!--
BEGIN QUESTION
name: q4_2
manual: true
-->

In [152]:
q4_2_answer = ...
q4_2_answer

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 4.3.** In 1-2 sentences explain your answer to Question 4.2.

<!--
BEGIN QUESTION
name: q4_3
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

When done exporting, download the .zip file by `SHIFT`-clicking on the file name and selecting **Save Link As**. Or, find the .zip file in the left side of the screen and right-click and select **Download**. You'll submit this .zip file for the assignment in Canvas to Gradescope for grading.

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)