# Class 6 Exercises

___

## Dataset
### *The Pudding*'s Film Dialogue Data

The dataset that we're working with in this lesson is taken from Hannah Andersen and Matt Daniels's *Pudding* essay, ["Film Dialogue from 2,000 screenplays, Broken Down by Gender and Age"](https://pudding.cool/2017/03/film-dialogue/). The dataset provides information about 2,000 films from 1925 to 2015, including characters’ names, genders, ages, how many words each character spoke in each film, the release year of each film, and how much money the film grossed. They included character gender information because they wanted to contribute data to a broader conversation about how "white men dominate movie roles."

___

## Import Pandas

To use the Pandas library, we first need to `import` it.

In [1]:
import pandas as pd

## Change Display Settings

By default, Pandas will display 60 rows and 20 columns. I often change [Pandas' default display settings](https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html) to show more rows or columns.

In [2]:
pd.options.display.max_rows = 200

## Get Data

In [8]:
film_df = pd.read_csv('../Exercises/Class6-Exercises/Pudding-Film-Dialogue-Clean.csv', delimiter=",", encoding='utf-8')

This creates a Pandas [DataFrame object](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#DataFrame) — often abbreviated as *df*, e.g., *slave_voyages_df*. A DataFrame looks and acts a lot like a spreadsheet. But it has special powers and functions that we will discuss in the next few lessons.

## Overview

To look at a random *n* number of rows in a DataFrame, we can use a method called `.sample()`.

In [9]:
film_df.sample(10)

Unnamed: 0,title,release_year,character,gender,words,proportion_of_dialogue,age,gross,script_id
19373,Trading Places,1983,Mortimer Duke,man,712,0.078857,75.0,249.0,7423
2846,Silver Bullet,1985,Marty Coslaw,man,872,0.210934,14.0,,1329
15324,Titanic,1997,Lewis Bodine,man,649,0.051887,,1249.0,5136
16298,St. Vincent,2014,Charisse,woman,563,0.074118,,46.0,5472
9071,Mrs Brown,1997,Doctor Jenner,man,398,0.025267,71.0,17.0,2957
1672,Raiders of the Lost Ark,1981,Dr. Ren� Belloq,man,630,0.14341,38.0,9.0,1034
18408,W.,2008,Don Evans,man,229,0.028852,37.0,30.0,6922
1338,Gremlins,1984,Pete Fountaine,man,334,0.050922,13.0,383.0,971
11616,Win Win,2011,Cindy,woman,760,0.057337,34.0,11.0,3726
12875,Jason Goes to Hell: The Final Friday,1993,Joey B.,woman,107,0.040809,31.0,33.0,4238


## Pandas Methods

| Pandas method | Explanation                         |
|----------|-------------------------------------|
| `.sum()`      | Sum of values                       |
| `.mean()`     | Mean of values                      |
| `.median()`   | Median of values         |
| `.min()`      | Minimum                             |
| `.max()`      | Maximum                             |
| `.mode()`     | Mode                                |
| `.std()`      | Unbiased standard deviation         |
| `.count()`    | Total number of non-blank values    |
| `.value_counts()` | Frequency of unique values |

### ❓  How old (on average) are the characters in this dataset?

In [10]:
film_df['age'].mean()

42.2750520205892

### ❓  How old is the oldest character in the dataset?

In [11]:
film_df['age'].max()

2009.0

### ❓  How young was the youngest person?

In [12]:
film_df['age'].min()

3.0

### ❓  How many men vs. women characters are in the dataset?

In [17]:
film_df['gender'].value_counts()

man      16131
woman     6911
?            5
Name: gender, dtype: int64

## Examine Subsets

### ❓  Who are all the characters from Monsters, Inc.?

![](https://imgr.search.brave.com/Sc0_L9RRhHrC2B_0i2XRMWpuhGCN8yKvb0hiqnexAAk/fit/877/225/ce/1/aHR0cHM6Ly90c2U0/Lm1tLmJpbmcubmV0/L3RoP2lkPU9JUC5G/MVlGNmRvZzFvNHhi/NFdVNzJYbGhBSGFF/QSZwaWQ9QXBp)

[Roz from Monsters, Inc.](https://disney.fandom.com/wiki/Roz)


Write a conditional statement that will filter the DataFrame to only show rows that have the title "Monsters, Inc."

In [15]:
title_filter = film_df['title'] == 'Monsters, Inc.'

In [16]:
film_df[title_filter]

Unnamed: 0,title,release_year,character,gender,words,proportion_of_dialogue,age,gross,script_id
20730,"Monsters, Inc.",2001,Celia,woman,399,0.037702,43.0,445.0,7991
20731,"Monsters, Inc.",2001,Charlie,man,128,0.012095,61.0,445.0,7991
20732,"Monsters, Inc.",2001,Floor Manager,man,130,0.012284,59.0,445.0,7991
20733,"Monsters, Inc.",2001,Fungus,man,220,0.020788,57.0,445.0,7991
20734,"Monsters, Inc.",2001,George Sanderso,man,148,0.013985,39.0,445.0,7991
20735,"Monsters, Inc.",2001,Henry J. Watern,man,1192,0.112633,73.0,445.0,7991
20736,"Monsters, Inc.",2001,"James P. ""Sulle",man,2625,0.248039,49.0,445.0,7991
20737,"Monsters, Inc.",2001,Mike Wazowski,man,4653,0.439667,53.0,445.0,7991
20738,"Monsters, Inc.",2001,Needleman,man,128,0.012095,35.0,445.0,7991
20739,"Monsters, Inc.",2001,Randall Boggs,man,734,0.069357,44.0,445.0,7991


### ❓  What potential issues do you notice when you look closer at this data? (Hint: Look at Roz)

What do you think about The Pudding's approach to assigning gender in this dataset? What alternatives could we potentially use, if any?

Write a conditional statement that will filter the DataFrame to only show rows that have the title "Mulan"

In [32]:
title_filter = film_df['title'] == "Mulan"

In [33]:
film_df[title_filter]

Unnamed: 0,title,release_year,character,gender,words,proportion_of_dialogue,age,gross,script_id
9093,Mulan,1998,Chi Fu,man,932,0.06465,69.0,224.0,2961
9094,Mulan,1998,Chien-Po,man,168,0.011654,48.0,224.0,2961
9095,Mulan,1998,Fa Li,woman,156,0.010821,,224.0,2961
9096,Mulan,1998,Fa Zhou,man,542,0.037597,55.0,224.0,2961
9097,Mulan,1998,First Ancestor,man,238,0.016509,61.0,224.0,2961
9098,Mulan,1998,General Li,man,316,0.02192,69.0,224.0,2961
9099,Mulan,1998,Ling,man,448,0.031077,43.0,224.0,2961
9100,Mulan,1998,Mulan,woman,3028,0.210044,35.0,224.0,2961
9101,Mulan,1998,Mushu,man,4594,0.318674,37.0,224.0,2961
9102,Mulan,1998,Shan-Yu,man,570,0.039539,43.0,224.0,2961
