# Pandas Basics — Part 3

_I adapted this notebook from Melanie Walsh's [Pandas Basics — Part 3](https://melaniewalsh.github.io/Intro-Cultural-Analytics/features/Data-Analysis/Pandas-Basics-Part3.html), which is from her online textbook [Introduction to Cultural Analytics & Python](https://melaniewalsh.github.io/Intro-Cultural-Analytics/features/welcome.html)_

In this lesson, we're going to introduce more fundamentals of [pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/overview.html), a powerful Python library for working with tabular data like CSV files.

We will review skills learned from the last two lessons and introduce how to:

- Check for duplicate data
- Clean and transform data
- Manipulate string data
- Apply functions
- Reset index
- Bonus! Create an interactive data viz

## Dataset
### *The Pudding*'s Film Dialogue Data

```{epigraph}
Lately, Hollywood has been taking so much shit for rampant sexism and racism. The prevailing theme: white men dominate movie roles.
But it’s all rhetoric and no data, which gets us nowhere in terms of having an informed discussion. How many movies are actually about men? What changes by genre, era, or box-office revenue? What circumstances generate more diversity?
```
-- Hannah Anderson and Matt Daniels, ["Film Dialogue from 2,000 screenplays, Broken Down by Gender and Age"](https://pudding.cool/2017/03/film-dialogue/)


The dataset that we're working with in this lesson is taken from Hannah Andersen and Matt Daniels's *Pudding* essay, ["Film Dialogue from 2,000 screenplays, Broken Down by Gender and Age"](https://pudding.cool/2017/03/film-dialogue/). The dataset provides information about 2,000 films from 1925 to 2015, including characters’ names, genders, ages, how many words each character spoke in each film, the release year of each film, and how much money the film grossed. They included character gender information because they wanted to contribute data to a broader conversation about how "white men dominate movie roles."

Yet transforming complex social constructs like gender into quantifiable data is tricky and historically fraught. They claim, in fact, that one of the [most frequently asked questions](https://medium.com/@matthew_daniels/faq-for-the-film-dialogue-by-gender-project-40078209f751) about the piece is about gender: “Wait, but let’s talk about gender. How do you know the monster in Monsters Inc. is a boy!" The short answer is that they don't. To determine character gender, they used actors' IMDB information, which they acknowledge is an imperfect approach: "Sometimes, women voice male characters. Bart Simpson, for example, is voiced by a woman. We’re aware that this means some of the data is wrong, AND we’re still fine with the methodology and approach."

**What do we lose because of how this dataset treats gender?**

* ANSWER HERE
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *

## Import Pandas

In [1]:
import pandas as pd

## Set Display Settings

By default, Pandas will display 60 rows and 20 columns. I often change [pandas' default display settings](https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html) to show more rows or columns.

In [2]:
pd.options.display.max_rows = 100

## Read in CSV File

To read in a CSV file, we will use the function `pd.read_csv()` and insert the name of our desired file path. 

In [3]:
film_df = pd.read_csv('../docs/Pudding-Film-Dialogue-Clean.csv', delimiter=",", encoding='utf-8')

When reading in the CSV file, we also specified the `encoding` and `delimiter`. The `delimiter` parameter specifies the character that separates or "delimits" the columns in our dataset. For CSV files, the delimiter will most often be a comma. (CSV is short for *Comma Separated Values*.) Sometimes, however, the delimiter of a CSV file might be a tab (`/t`) or, more rarely, another character.

## Display Data

We can display a DataFrame in a Jupyter notebook simply by running a cell with the variable name of the DataFrame.

`NaN` is the Pandas value for any missing data. See ["Working with missing data"](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html?highlight=nan) for more information.

In [4]:
film_df

Unnamed: 0,script_id,character,words,gender,age,title,release_year,gross,proportion_of_dialogue,after_2000
0,280,Betty,311,Woman,35.0,The Bridges of Madison County,1995,142.0,0.048639,False
1,280,Carolyn Johnson,873,Woman,,The Bridges of Madison County,1995,142.0,0.136534,False
2,280,Eleanor,138,Woman,,The Bridges of Madison County,1995,142.0,0.021583,False
3,280,Francesca Johns,2251,Woman,46.0,The Bridges of Madison County,1995,142.0,0.352049,False
4,280,Madge,190,Woman,46.0,The Bridges of Madison County,1995,142.0,0.029715,False
...,...,...,...,...,...,...,...,...,...,...
23043,9254,Lumiere,1063,Man,56.0,Beauty and the Beast,1991,452.0,0.104636,False
23044,9254,Maurice,1107,Man,71.0,Beauty and the Beast,1991,452.0,0.108967,False
23045,9254,Monsieur D'Arqu,114,Man,58.0,Beauty and the Beast,1991,452.0,0.011222,False
23046,9254,Mrs. Potts,564,Woman,66.0,Beauty and the Beast,1991,452.0,0.055517,False


There are a few important things to note about the DataFrame displayed here:

* Index
    * The ascending numbers in the very left-hand column of the DataFrame is called the Pandas *Index*. You can select rows based on the Index.
    * By default, the Index is a sequence of numbers starting with zero. However, you can change the Index to something else, such as one of the columns in your dataset.

* Truncation
    * The DataFrame is truncated, signaled by the ellipses in the middle `...` of every column.
    * The DataFrame is truncated because we set our default display settings to 100 rows. Anything more than 100 rows will be truncated. To display all the rows, we would need to alter Pandas' default display settings yet again.

* Rows x Columns
    * Pandas reports how many rows and columns are in this dataset at the bottom of the output (23,048 x 10 columns).

### Display First *n* Rows

To look at the first *n* rows in a DataFrame, we can use a method called `.head()`.

In [5]:
film_df.head(10)

Unnamed: 0,script_id,character,words,gender,age,title,release_year,gross,proportion_of_dialogue,after_2000
0,280,Betty,311,Woman,35.0,The Bridges of Madison County,1995,142.0,0.048639,False
1,280,Carolyn Johnson,873,Woman,,The Bridges of Madison County,1995,142.0,0.136534,False
2,280,Eleanor,138,Woman,,The Bridges of Madison County,1995,142.0,0.021583,False
3,280,Francesca Johns,2251,Woman,46.0,The Bridges of Madison County,1995,142.0,0.352049,False
4,280,Madge,190,Woman,46.0,The Bridges of Madison County,1995,142.0,0.029715,False
5,280,Michael Johnson,723,Man,38.0,The Bridges of Madison County,1995,142.0,0.113075,False
6,280,Robert Kincaid,1908,Man,65.0,The Bridges of Madison County,1995,142.0,0.298405,False
7,623,Bobby Korfin,328,Man,,15 Minutes,2001,37.0,0.036012,True
8,623,Daphne Handlova,409,Woman,28.0,15 Minutes,2001,37.0,0.044906,True
9,623,Deputy Chief Fi,347,Man,,15 Minutes,2001,37.0,0.038098,True


### Display Random Sample

To look at a random sample of rows, we can use the `.sample()` method.

In [6]:
film_df.sample(10)

Unnamed: 0,script_id,character,words,gender,age,title,release_year,gross,proportion_of_dialogue,after_2000
421,726,Sal,919,Woman,40.0,The Beach,2000,64.0,0.085022,True
13864,4575,George,249,Man,66.0,Made in Dagenham,2010,1.0,0.025187,True
20089,7742,Old Guy,409,Man,,The Rover,2014,1.0,0.072313,True
19741,7597,Don,577,Man,46.0,28 Weeks Later,2007,36.0,0.185531,True
10792,3473,Connie Zirpollo,350,Woman,57.0,The Surfer King,2006,,0.021974,True
1320,962,Jenna,295,Woman,,Gothika,2003,85.0,0.020705,True
19153,7325,Officer,139,Woman,36.0,Deliver Us from Evil,2014,32.0,0.020638,True
21668,8431,Gorobei Katayam,400,Man,34.0,Seven Samurai,1954,,0.073842,False
18075,6724,Mr. Parker,2916,Man,26.0,The Way of the Gun,2000,9.0,0.250688,True
20223,7808,Sara Thomas,2727,Woman,28.0,Serendipity,2001,76.0,0.309886,True


## Examine Data

### Shape

To explicitly check for how many rows vs columns make up a dataset, we can use the `.shape` method.

In [7]:
film_df.shape

(23048, 10)

There are 23,048 rows and 10 columns.

### Data Types

Just like Python has different data types, pandas has different data types, too. These data types are automatically assigned to columns when we read in a CSV file. We can check these Pandas data types with the [`.dtypes` method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html).



| **Pandas Data Type** |  **Explanation**                                                                                   |
|:-------------:|:---------------------------------------------------------------------------------------------------:|
| `object`         | string                                                                               |
| `float64`         | float                                               |
| `int64`       | integer                                                        |
| `datetime64`       |  date time              

In [8]:
film_df.dtypes

script_id                   int64
character                  object
words                       int64
gender                     object
age                       float64
title                      object
release_year                int64
gross                     float64
proportion_of_dialogue    float64
after_2000                   bool
dtype: object

It's important to always check the data types in your DataFrame. For example, sometimes numeric values will accidentally be interpreted as a string object. To perform calculations on this data, you would need to first convert that column from a string to an integer.

### Columns

We can also check the column names of the DataFrame with [`.columns`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.columns.html)

In [9]:
film_df.columns

Index(['script_id', 'character', 'words', 'gender', 'age', 'title',
       'release_year', 'gross', 'proportion_of_dialogue', 'after_2000'],
      dtype='object')

### Summary Statistics

In [10]:
film_df.describe(include='all')

Unnamed: 0,script_id,character,words,gender,age,title,release_year,gross,proportion_of_dialogue,after_2000
count,23048.0,23048,23048.0,23043,18263.0,23048,23048.0,19387.0,23048.0,23048
unique,,17544,,2,,1994,,,,2
top,,Doctor,,Man,,Lone Star,,,,True
freq,,37,,16132,,40,,,,12632
mean,4194.784623,,907.871486,,42.38296,,1998.13307,106.73637,0.086515,
std,2472.985787,,1399.593759,,59.718859,,14.746058,145.85823,0.107745,
min,280.0,,101.0,,3.0,,1929.0,0.0,0.001537,
25%,2095.0,,193.0,,30.0,,1992.0,22.0,0.019771,
50%,3694.0,,396.0,,39.0,,2001.0,56.0,0.042421,
75%,6219.75,,980.0,,50.0,,2009.0,136.0,0.104166,


Do you notice any outliers, anomalies, or potential problems here?

* ANSWER HERE
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *


The maximum value in the "age" column is 2013! That seems like an error.

In [11]:
film_df[film_df['age'] == 2013]

Unnamed: 0,script_id,character,words,gender,age,title,release_year,gross,proportion_of_dialogue,after_2000
11639,3737,Lucas Solomon,190,Man,2013.0,The Wolf of Wall Street,2013,125.0,0.011077,True


Let's drop this row from the dataset by using the `.drop()` method and the Index number of the row.

In [12]:
film_df = film_df.drop(11639)

Now if we look for it again, that row is gone.

In [13]:
film_df[film_df['age'] == 2013]

Unnamed: 0,script_id,character,words,gender,age,title,release_year,gross,proportion_of_dialogue,after_2000


### Rename Columns

In [14]:
film_df = film_df.rename(columns={'imdb_character_name': 'character', 'year': 'release_year'})

In [15]:
film_df.head()

Unnamed: 0,script_id,character,words,gender,age,title,release_year,gross,proportion_of_dialogue,after_2000
0,280,Betty,311,Woman,35.0,The Bridges of Madison County,1995,142.0,0.048639,False
1,280,Carolyn Johnson,873,Woman,,The Bridges of Madison County,1995,142.0,0.136534,False
2,280,Eleanor,138,Woman,,The Bridges of Madison County,1995,142.0,0.021583,False
3,280,Francesca Johns,2251,Woman,46.0,The Bridges of Madison County,1995,142.0,0.352049,False
4,280,Madge,190,Woman,46.0,The Bridges of Madison County,1995,142.0,0.029715,False


### Drop Columns

In [16]:
film_df = film_df.drop(columns='script_id')

### Missing Data

**.isna() / .notna()**

Pandas has special ways of dealing with missing data. As you may have already noticed, blank rows in a CSV file show up as `NaN` in a pandas DataFrame.

To filter and count the number of missing/not missing values in a dataset, we can use the special `.isna()` and `.notna()` methods on a DataFrame or Series object. 

The `.isna()` and `.notna()` methods return True/False pairs for each row, which we can use to filter the DataFrame for any rows that have information in a given column. 

In [17]:
film_df[film_df['gender'].isna()]

Unnamed: 0,character,words,gender,age,title,release_year,gross,proportion_of_dialogue,after_2000
7325,Tiny Tim,488,,,The Hebrew Hammer,2003,,0.03372,True
8010,Himself - Walki,312,,90.0,JFK,1991,145.0,0.005303,False
8011,Himself - With,224,,78.0,JFK,1991,145.0,0.003807,False
16902,Thomas Hergenro,216,,,Battle of the Year,2013,9.0,0.019626,True
18812,Randy,167,,,Hatchet III,2013,,0.033024,True


This is important information for the sake of better understanding our dataset. But it's also important because `NaN` values are treated as *floats*, not *strings*. If we tried to manipulate this column as text data, we would get an error. For this reason, we're going to replace or "fill" these `NaN` values with the string "No Character Data" by using the `.fillna()` method.

In [18]:
film_df['gender'] = film_df['gender'].fillna('No Gender Data')

In [19]:
film_df[film_df['gender'].isna()]

Unnamed: 0,character,words,gender,age,title,release_year,gross,proportion_of_dialogue,after_2000


### Check for Duplicates 

#### Duplicates

We can check for duplicate rows by using the `.duplicated()` method and setting the parameter `keep=False`, which will display all the duplicated values in the dataset — rather than just the first duplicated value `keep='first'` or the last duplicated value `keep='last'`.

In [20]:
film_df.duplicated(keep=False)

0        False
1        False
2        False
3        False
4        False
         ...  
23043    False
23044    False
23045    False
23046    False
23047    False
Length: 23047, dtype: bool

The output above is reporting whether each row in the dataset is a duplicate. We can use the `.duplicated()` method inside a filter to isolate only the rows in the dataframe that are exact duplicates.

In [21]:
film_df[film_df.duplicated(keep=False)]

Unnamed: 0,character,words,gender,age,title,release_year,gross,proportion_of_dialogue,after_2000


We can drop duplicates from the DataFrame with the `.drop_duplicates()` method and choose to keep the first instance of the duplicate or the last instance.

In [22]:
film_df = film_df.drop_duplicates(keep='first')

Now if we check the data for duplicates again, they should be all gone.

In [23]:
film_df[film_df.duplicated(keep=False)]

Unnamed: 0,character,words,gender,age,title,release_year,gross,proportion_of_dialogue,after_2000


## Clean and Transform Data

### Pandas `.str` Methods

Pandas has special [pandas string methods](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#string-methods). Many of them are very similar to Python string methods, except they will transform every single string value in a column, and we have to add `.str` to the method chain.

| **Pandas String Method** | **Explanation**                                                                                   |
|:-------------:|:---------------------------------------------------------------------------------------------------:|
| df['column_name']`.str.lower()`         | makes the string in each row lowercase                                                                                |
| df['column_name']`.str.upper()`         | makes the string in each row uppercase                                                |
| df['column_name']`.str.title()`         | makes the string in each row titlecase                                                |
| df['column_name']`.str.replace('old string', 'new string')`      | replaces `old string` with `new string` for each row |
| df['column_name']`.str.contains('some string')`      | tests whether string in each row contains "some string" |
| df['column_name']`.str.split('delim')`          | returns a list of substrings separated by the given delimiter |
| df['column_name']`.str.join(list)`         | opposite of split(), joins the elements in the given list together using the string                                                                        |
                                                            

For example, to transform every character's name in the "character" column from lowercase to uppercase, we can use `.str.upper()` 

In [24]:
film_df['character'].str.upper()

0                  BETTY
1        CAROLYN JOHNSON
2                ELEANOR
3        FRANCESCA JOHNS
4                  MADGE
              ...       
23043            LUMIERE
23044            MAURICE
23045    MONSIEUR D'ARQU
23046         MRS. POTTS
23047           WARDROBE
Name: character, Length: 23047, dtype: object

To transform every character's name in the "character" column to lowercase, we can use `.str.lower()`

In [25]:
film_df['character'].str.lower()

0                  betty
1        carolyn johnson
2                eleanor
3        francesca johns
4                  madge
              ...       
23043            lumiere
23044            maurice
23045    monsieur d'arqu
23046         mrs. potts
23047           wardrobe
Name: character, Length: 23047, dtype: object

We can use the `.str.contains()` to search for particular words or phrases in a column, such as "Star Wars."

In [26]:
film_df[film_df['title'].str.contains('Star Wars')]

Unnamed: 0,character,words,gender,age,title,release_year,gross,proportion_of_dialogue,after_2000
3017,Admiral Ackbar,199,Man,61.0,Star Wars: Episode VI - Return of the Jedi,1983,853.0,0.039096,False
3018,Ben 'Obi-Wan' K,462,Man,69.0,Star Wars: Episode VI - Return of the Jedi,1983,853.0,0.090766,False
3019,C-3Po,881,Man,37.0,Star Wars: Episode VI - Return of the Jedi,1983,853.0,0.173084,False
3020,Darth Vader,381,Man,48.0,Star Wars: Episode VI - Return of the Jedi,1983,853.0,0.074853,False
3021,Han Solo,835,Man,41.0,Star Wars: Episode VI - Return of the Jedi,1983,853.0,0.164047,False
3022,Lando Calrissia,379,Man,46.0,Star Wars: Episode VI - Return of the Jedi,1983,853.0,0.07446,False
3023,Luke Skywalker,915,Man,32.0,Star Wars: Episode VI - Return of the Jedi,1983,853.0,0.179764,False
3024,Princess Leia,359,Woman,27.0,Star Wars: Episode VI - Return of the Jedi,1983,853.0,0.07053,False
3025,The Emperor,516,Man,39.0,Star Wars: Episode VI - Return of the Jedi,1983,853.0,0.101375,False
3026,Yoda,163,Man,19.0,Star Wars: Episode VI - Return of the Jedi,1983,853.0,0.032024,False


We can use the `.str.contains()` to search for particular words or phrases in a column, such as "Mean Girls."

In [27]:
film_df[film_df['title'].str.contains('Mean Girls')]

Unnamed: 0,character,words,gender,age,title,release_year,gross,proportion_of_dialogue,after_2000
13938,Aaron Samuels,426,Man,23.0,Mean Girls,2004,120.0,0.05389,True
13939,Cady Heron,2798,Woman,18.0,Mean Girls,2004,120.0,0.353953,True
13940,Damian,624,Man,26.0,Mean Girls,2004,120.0,0.078937,True
13941,Gretchen Wiener,609,Woman,22.0,Mean Girls,2004,120.0,0.07704,True
13942,Janis Ian,907,Woman,22.0,Mean Girls,2004,120.0,0.114738,True
13943,Karen Smith,301,Woman,19.0,Mean Girls,2004,120.0,0.038077,True
13944,Mr. Duvall,365,Man,43.0,Mean Girls,2004,120.0,0.046173,True
13945,Mrs. George,125,Woman,33.0,Mean Girls,2004,120.0,0.015813,True
13946,Ms. Norbury,720,Woman,34.0,Mean Girls,2004,120.0,0.091082,True
13947,Regina George,1030,Woman,26.0,Mean Girls,2004,120.0,0.130297,True


## Applying Functions

With the `.apply()` method, we can run a function on every single row in a Pandas column or dataframe.

In [28]:
def make_text_title_case(text):
    title_case_text = text.title()
    return title_case_text

In [29]:
make_text_title_case("betty")

'Betty'

In [30]:
film_df['character'].apply(make_text_title_case)

0                  Betty
1        Carolyn Johnson
2                Eleanor
3        Francesca Johns
4                  Madge
              ...       
23043            Lumiere
23044            Maurice
23045    Monsieur D'Arqu
23046         Mrs. Potts
23047           Wardrobe
Name: character, Length: 23047, dtype: object

In [31]:
film_df['character'] = film_df['character'].apply(make_text_title_case)

In [32]:
film_df.sample(10)

Unnamed: 0,character,words,gender,age,title,release_year,gross,proportion_of_dialogue,after_2000
18717,Deb,114,Woman,,I Know What You Did Last Summer,1997,136.0,0.020152,False
15491,Reese Wilson,322,Woman,49.0,Urban Legend,1998,70.0,0.043042,False
4011,Reporter,156,Man,35.0,The American President,1995,130.0,0.004431,False
449,Salgado,103,Man,51.0,Being Human,1994,3.0,0.008315,False
15114,Worf,363,Man,42.0,Star Trek: Generations,1994,157.0,0.039603,False
22361,Bruce Wayne,1978,Man,38.0,The Dark Knight Rises,2012,489.0,0.193561,True
17924,Officer Don,157,Man,51.0,Career Opportunities,1991,23.0,0.014636,False
8697,Bob Zmuda,1062,Man,32.0,Man on the Moon,1999,59.0,0.062295,False
10472,Bill Pardy,2656,Man,35.0,Slither,2006,10.0,0.292511,True
4159,Henry V,279,Man,,Anonymous,2011,4.0,0.024973,True


### Filter DataFrame

We can filter the DataFrames for only characters who are men or women.

In [33]:
men_film_df = film_df[film_df['gender'] == 'Man']
men_film_df.head()

Unnamed: 0,character,words,gender,age,title,release_year,gross,proportion_of_dialogue,after_2000
5,Michael Johnson,723,Man,38.0,The Bridges of Madison County,1995,142.0,0.113075,False
6,Robert Kincaid,1908,Man,65.0,The Bridges of Madison County,1995,142.0,0.298405,False
7,Bobby Korfin,328,Man,,15 Minutes,2001,37.0,0.036012,True
9,Deputy Chief Fi,347,Man,,15 Minutes,2001,37.0,0.038098,True
10,Detective Eddie,2020,Man,58.0,15 Minutes,2001,37.0,0.221783,True


In [34]:
women_film_df = film_df[film_df['gender'] == 'Woman']
women_film_df.head()

Unnamed: 0,character,words,gender,age,title,release_year,gross,proportion_of_dialogue,after_2000
0,Betty,311,Woman,35.0,The Bridges of Madison County,1995,142.0,0.048639,False
1,Carolyn Johnson,873,Woman,,The Bridges of Madison County,1995,142.0,0.136534,False
2,Eleanor,138,Woman,,The Bridges of Madison County,1995,142.0,0.021583,False
3,Francesca Johns,2251,Woman,46.0,The Bridges of Madison County,1995,142.0,0.352049,False
4,Madge,190,Woman,46.0,The Bridges of Madison County,1995,142.0,0.029715,False


### Groupby

We can use the `.groupby()` function to group all the men characters in each film and sum up their total dialogue.

By adding a Python string slice, we can identify the top 20 films with the greatest proportion of men speaking.

Line Breaks:
If a line of code gets too long, you can create a line break with a backslash `\`

In [35]:
men_film_df.groupby('title')[['proportion_of_dialogue']]\
.sum().sort_values(by='proportion_of_dialogue', ascending=False)[:20]

Unnamed: 0_level_0,proportion_of_dialogue
title,Unnamed: 1_level_1
The Men Who Stare at Goats,1.0
Kagemusha,1.0
The Wild Bunch,1.0
Killing Them Softly,1.0
Stalag 17,1.0
There Will Be Blood,1.0
Fury,1.0
The Revenant,1.0
Saving Private Ryan,1.0
Crimson Tide,1.0


We can use the `.groupby()` function to group all the women characters in each film and sum up their total dialogue.

By adding a Python string slice, we can identify the top 20 films with the greatest proportion of women speaking.

In [36]:
women_film_df.groupby('title')[['proportion_of_dialogue']]\
.sum().sort_values(by='proportion_of_dialogue', ascending=False)[:20]

Unnamed: 0_level_0,proportion_of_dialogue
title,Unnamed: 1_level_1
The Descent,1.0
Now and Then,1.0
Precious,0.993541
Martyrs,0.96555
The Hand That Rocks the Cradle,0.93375
Agnes of God,0.922482
Heavenly Creatures,0.919368
The Help,0.916947
3 Women,0.89953
The Watermelon Woman,0.894676


## Reset Index

We can transform a Groupby object into a DataFrame with a regular Index by tacking on `.reset_index()`.

In [37]:
women_film_df.groupby('title')[['proportion_of_dialogue']]\
.sum().sort_values(by='proportion_of_dialogue', ascending=False).reset_index()

Unnamed: 0,title,proportion_of_dialogue
0,The Descent,1.000000
1,Now and Then,1.000000
2,Precious,0.993541
3,Martyrs,0.965550
4,The Hand That Rocks the Cradle,0.933750
...,...,...
1935,The Last Castle,0.011139
1936,The Damned United,0.010909
1937,Thirteen Days,0.010834
1938,Men in Black 3,0.007812


## Your turn!

Go back through this notebook, changing the code to explore the film corpus as you see fit.