# Class 7 Exercises

These exercises will help you practice the skills and concepts that you learned in today's class.

To get participation credit for today's class, make sure that you work on these exercises and then submit a screenshot or PDF of your work to the appropriate assignment page in Canvas.

___

## Datasheets for Datasets
### *The Pudding*'s Film Dialogue Data

[Google Doc for Datasheets with Datasets Group Exercise](https://docs.google.com/document/d/1EMsuxPrX4ChO2HSMy_zKidS-KQluoIfcv0_23W5BKuA/edit?usp=sharing)

The dataset that we're working with in this lesson is taken from Hannah Andersen and Matt Daniels's *Pudding* essay, ["Film Dialogue from 2,000 screenplays, Broken Down by Gender and Age"](https://pudding.cool/2017/03/film-dialogue/). The dataset provides information about 2,000 films from 1925 to 2015, including characters’ names, genders, ages, how many words each character spoke in each film, the release year of each film, and how much money the film grossed. They included character gender information because they wanted to contribute data to a broader conversation about how "white men dominate movie roles."

___

Our goals in this notebook:
1. First, we want to explore and clean the dataset. What are the overall trends and patterns here?
2. Then, we want to ask some specific questions. Which characters speak the most and least in the entire dataset?
3. Finally, we want to ask: Can we identify and plot the top 20 movies with the most male vs female dialogue?

## Import Pandas

To use the Pandas library, we first need to `import` it.

In [2]:
#Your Code Here

## Change Display Settings

By default, Pandas will display 60 rows and 20 columns. I often change [Pandas' default display settings](https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html) to show more rows or columns.

In [3]:
pd.options.display.max_rows = 200

## Get Data

In [9]:
film_df = #Your Code Here

This creates a Pandas [DataFrame object](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#DataFrame) — often abbreviated as *df*, e.g., *slave_voyages_df*. A DataFrame looks and acts a lot like a spreadsheet. But it has special powers and functions that we will discuss in the next few lessons.

## Overview

Examine the first 9 rows in the DataFrame

In [None]:
#Your Code Here

Examine a random 8 rows in the DataFrame

In [None]:
#Your code here

Generate information about all the columns in the data (such as the data types for each column)

In [11]:
#Your Code Here

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23047 entries, 0 to 23046
Data columns (total 9 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   title                   23047 non-null  object 
 1   release_year            23047 non-null  int64  
 2   character               23047 non-null  object 
 3   gender                  23047 non-null  object 
 4   words                   23047 non-null  int64  
 5   proportion_of_dialogue  23047 non-null  float64
 6   age                     18262 non-null  float64
 7   gross                   19386 non-null  float64
 8   script_id               23047 non-null  int64  
dtypes: float64(3), int64(3), object(3)
memory usage: 1.6+ MB


Just like Python has different data types, Pandas has different data types, too. These data types are automatically assigned to columns when we read in a CSV file. We can check these Pandas data types with the [`.dtypes` method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html).



| **Pandas Data Type** |  **Explanation**                                                                                   |
|:-------------:|:---------------------------------------------------------------------------------------------------:|
| `object`         | string                                                                               |
| `float64`         | float                                               |
| `int64`       | integer                                                        |
| `datetime64`       |  date time              

Make a histogram of all the numerical columns in the DataFrame

In [None]:
#Your Code Here

Generate descriptive statistics for all the columns in the data 

In [13]:
#Your Code Here

Unnamed: 0,title,release_year,character,gender,words,proportion_of_dialogue,age,gross,script_id
count,23047,23047.0,23047,23047,23047.0,23047.0,18262.0,19386.0,23047.0
unique,1994,,17543,3,,,,,
top,Lone Star,,Doctor,man,,,,,
freq,40,,37,16131,,,,,
mean,,1998.132425,,,907.902634,0.086518,42.275052,106.735428,4194.804486
std,,14.746052,,,1399.616135,0.107746,57.912595,145.861933,2473.037601
min,,1929.0,,,101.0,0.001537,3.0,0.0,280.0
25%,,1992.0,,,193.0,0.019773,30.0,22.0,2095.0
50%,,2001.0,,,396.0,0.042423,39.0,56.0,3694.0
75%,,2009.0,,,980.0,0.104171,50.0,136.0,6224.5


### ❓ What patterns or outliers do you notice?

## Filter/Subset Data

Make a filter that will check whether the age for a character is greater than 100

In [26]:
age_filter = #Your Code Here

Then use this filter to select only characters who are older than 100

In [None]:
film_df[age_filter]

## Drop Rows

To drop the character from the dataset who are supposedly older than 100, you will need to find the index numbers for every relevant row

In [602]:
film_df = film_df.drop(#Your Code Here) 

Check to see whether the data has been dropped

In [None]:
film_df[age_filter]

## Sort Values

Sort the DataFrame from the character who has the highest `proportion_of_dialogue` to the lowest.  Then examine the first 20 rows with `.head(20)` or `[:20]`.

In [None]:
film_df...#Your Code Here

Sort the DataFrame from the character who has the lowest `proportion_of_dialogue` to the highest. Then examine the first 20 rows with `.head(20)` or `[:20]`.

In [None]:
film_df...#Your Code Here

Sort the DataFrame from the character who speaks the least number of `words` to the character who speaks the most number of words. Then examine the first 20 rows with `.head(20)` or `[:20]`.

In [None]:
film_df...#Your Code Here

### ❓ What patterns do you notice here? What surprises you or doesn't surprise you?

## Groupby

Group by film title and then calculate the sum total for every column.

In [20]:
film_df.groupby...#Your Code Here

Unnamed: 0_level_0,release_year,words,proportion_of_dialogue,age,gross,script_id
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
(500) Days of Summer,26117,18500,1.0,378.0,481.0,19942
10 Things I Hate About You,23988,19680,1.0,307.0,780.0,18144
12 Years a Slave,56364,19628,1.0,712.0,1680.0,42476
12 and Holding,30075,15968,1.0,513.0,0.0,22710
127 Hours,8040,5145,1.0,114.0,80.0,6080
...,...,...,...,...,...,...
Zero Effect,13986,13927,1.0,227.0,21.0,57106
Zerophilia,16040,16686,1.0,160.0,0.0,30144
Zodiac,62217,14656,1.0,1071.0,1271.0,201221
eXistenZ,17991,9447,1.0,309.0,36.0,62757


Group by film title, select the `words` column, and then calculate the sum total for every column.

In [22]:
film_df.groupby...#Your Code Here

title
(500) Days of Summer          18500
10 Things I Hate About You    19680
12 Years a Slave              19628
12 and Holding                15968
127 Hours                      5145
                              ...  
Zero Effect                   13927
Zerophilia                    16686
Zodiac                        14656
eXistenZ                       9447
xXx                            8285
Name: words, Length: 1994, dtype: int64

Group by film title AND gender, isolate the column `words`, and then calculate the sum total for every column.

*Note: Remember that to group by multiple columns, you need to put the column names in square brackets `[]`.*

In [23]:
film_df.groupby...#Your Code Here

title                       gender
(500) Days of Summer        man       12762
                            woman      5738
10 Things I Hate About You  man       10688
                            woman      8992
12 Years a Slave            man       16176
                                      ...  
Zodiac                      woman      1421
eXistenZ                    man        5695
                            woman      3752
xXx                         man        7287
                            woman       998
Name: words, Length: 3936, dtype: int64

## Filter, Then Groupby

Filter the DataFrame for only characters labeled as `woman`. Then save this filtered DataFrame as `women_film_df`. Be sure to make a `.copy()`

In [31]:
women_filter = #Your Code Here

In [32]:
women_film_df = film_df[women_filter].copy()

Filter the DataFrame for only characters labeled as `man`. Then save this filtered DataFrame as `men_film_df`. Be sure to make a `.copy()`

In [33]:
men_filter = #Your Code Here

In [34]:
men_film_df = film_df[men_filter].copy()

Now group `women_film_df` by film title, select the `words` column, and sum the `words` spoken by women.

In [35]:
women_film_df.groupby...#Your Code Here

title
(500) Days of Summer          5738
10 Things I Hate About You    8992
12 Years a Slave              3452
12 and Holding                5324
127 Hours                      809
                              ... 
Zero Effect                   2216
Zerophilia                    4612
Zodiac                        1421
eXistenZ                      3752
xXx                            998
Name: words, Length: 1940, dtype: int64

Assign the resulting Series to a new variable `women_by_film`

In [36]:
women_by_film = women_film_df.groupby...#Your Code Here

With the same workflow as above, make a new another new variable `men_by_film`

In [37]:
men_by_film = men_film_df.groupby...#Your Code Here

## Sort

Sort `women_by_film` from the film with the most words to the film with the least words. Then examine the top 20 values.

In [None]:
women_by_film...#Your Code Here

Assign this sorted list of movies to the variable `top20_women`

In [39]:
top20_women = women_by_film...#Your Code Here

With the same workflow, make a new variable `top20_men`

In [40]:
top20_men = men_by_film...#Your Code Here

### ❓ What patterns do you notice here? What surprises you or doesn't surprise you?

## Making Plots

Make a bar chart of `top20_women`. Give the chart a title, and specify a color.

In [None]:
top20_women...#Your Code Here

To save the plot, you can use `ax.figure.savefig()` and the name of the file in quotation marks.

In [None]:
ax = top20_women...#Your Code Here
ax.figure.savefig('top20_women.png')

Look in the file browser on the left and double click the PNG file. How does it look? Uh oh!

Sometimes parts of a plot will get cut off when you save it. To fix this issue, you can use a function from the Matplotlib library called `plt.tight_layout()`, which will adjust the plot before you save it.

To use this function, you need to `import matplotlib.pyplot as plt`.

In [None]:
import matplotlib.pyplot as plt

ax = top20_women...#Your Code Here
plt.tight_layout()
ax.figure.savefig('top20_women.png')