# Curating Film Dialogue


## 1. Import the datasets

Create a new Jupyter notebook called CuratingFilmDialogue.ipynb and read in the three datasets from the Github repository https://github.com/matthewfdaniels/scripts/.


In [1]:
import pandas as pd

### metadata dataset


In [2]:
metadata = pd.read_csv('meta_data7.csv', encoding = 'latin-1')
metadata.head()

Unnamed: 0,script_id,imdb_id,title,year,gross,lines_data
0,1534,tt1022603,(500) Days of Summer,2009,37.0,7435445256774774443342577775657744434444564456...
1,1512,tt0147800,10 Things I Hate About You,1999,65.0,1777752320274533344457777722433777334443764677...
2,1514,tt0417385,12 and Holding,2005,,5461357777754212454544441367774433446547647753...
3,1517,tt2024544,12 Years a Slave,2013,60.0,4567334777777777777777447777756477777444777777...
4,1520,tt1542344,127 Hours,2010,20.0,453513352345765766777777773340


this is unique list of IMDB_IDs from the character_list file, with additional meta data, such as release year and domestic, inflation-adjusted gross.


### character list dataset


In [3]:
character_list = pd.read_csv('character_list5.csv', encoding = 'latin-1')
character_list.head()

Unnamed: 0,script_id,imdb_character_name,words,gender,age
0,280,betty,311,f,35.0
1,280,carolyn johnson,873,f,
2,280,eleanor,138,f,
3,280,francesca johns,2251,f,46.0
4,280,madge,190,f,46.0


this is the data that powers all of the calculations on polygraph.cool/films. It uses the most accurate script that we can find for a given film. People are understandably finding errors, so we will be updating this file as much as possible.


### character mapping dataset


In [4]:
character_mapping = pd.read_csv('character_mapping.csv', encoding = 'latin-1')
character_mapping.head()

Unnamed: 0,script_id,imdb_id,character_from_script,closest_character_name_from_imdb_match,closest_imdb_character_id
0,1,tt0147800,bianca,bianca stratford,nm0646351
1,1,tt0147800,cameron,cameron james,nm0330687
2,1,tt0147800,chastity,chastity,nm0005517
3,1,tt0147800,joey,joey donner,nm0005080
4,1,tt0147800,kat,kat stratford,nm0005466


## 2. Check for missing data


### metadata dataset


In [5]:
metadata.isna()

Unnamed: 0,script_id,imdb_id,title,year,gross,lines_data
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,False,True,False
3,False,False,False,False,False,False
4,False,False,False,False,False,False
...,...,...,...,...,...,...
1995,False,False,False,False,False,False
1996,False,False,False,False,False,False
1997,False,False,False,False,False,False
1998,False,False,False,False,True,False


In [6]:
import numpy as np
metadata_cols = ['script_id', 'imdb_id', 'title', 'year', 'gross', 'lines_data']
for col in metadata_cols:
    print(f'{col} : {np.unique(str(metadata[col].values))}\n')

script_id : ['[1534 1512 1514 ... 8158 3768 6491]']

imdb_id : ["['tt1022603' 'tt0147800' 'tt0417385' ... 'tt0120906' 'tt0421090'\n 'tt0443706']"]

title : ["['(500) Days of Summer' '10 Things I Hate About You' '12 and Holding' ...\n 'Zero Effect' 'Zerophilia' 'Zodiac']"]

year : ['[2009 1999 2005 ... 1998 2005 2007]']

gross : ['[37. 65. nan ...  3. nan 41.]']

lines_data : ["['743544525677477444334257777565774443444456445674543367553452777734237544553444343334444444467441433777777777777776634344344434244343433435535624644435776576434333377775756764434344466346764533566544444777533356445543543343334444535476332345777777777777776'\n '177775232027453334445777772243377733444376467744677740424516733144464355045563543423354653735714457434333434243540000354213722445102431377774150047353236346777770432577633435475477734777720434517632245454363064653552333354735636524467333433433433530001363'\n '54613577777542124545444413677744334465476477533416774357535660534376444724437777421544734354157437

In [7]:
metadata = pd.read_csv('meta_data7.csv', encoding = 'latin-1', na_values=['nan', 'False'])
new_metadata = metadata.dropna()
metadata.shape[0] - new_metadata.shape[0]

338

missing data for metadata: nan, False

I dropped 338 rows with missing data for the metadata dataset.


### character list dataset


In [8]:
character_list.isna()

Unnamed: 0,script_id,imdb_character_name,words,gender,age
0,False,False,False,False,False
1,False,False,False,False,True
2,False,False,False,False,True
3,False,False,False,False,False
4,False,False,False,False,False
...,...,...,...,...,...
23043,False,False,False,False,False
23044,False,False,False,False,False
23045,False,False,False,False,False
23046,False,False,False,False,False


In [9]:
cl_cols = ['script_id', 'imdb_character_name', 'words', 'gender', 'age']
for col in cl_cols:
    print(f'{col} : {np.unique(str(character_list[col].values))}\n')

script_id : ['[ 280  280  280 ... 9254 9254 9254]']

imdb_character_name : ['[\'betty\' \'carolyn johnson\' \'eleanor\' ... "monsieur d\'arqu" \'mrs. potts\'\n \'wardrobe\']']

words : ['[311 873 138 ... 114 564 121]']

gender : ["['f' 'f' 'f' ... 'm' 'f' 'f']"]

age : ['[35. nan nan ... 58. 66. 54.]']



In [10]:
character_list = pd.read_csv('character_list5.csv', encoding = 'latin-1', na_values=['nan', 'False'])
new_character_list = character_list.dropna()
character_list.shape[0] - new_character_list.shape[0]


4785

missing data for character list: nan, False

I dropped 4785 rows with missing data for the character list dataset.


### character mapping dataset


In [11]:
character_mapping.isna()

Unnamed: 0,script_id,imdb_id,character_from_script,closest_character_name_from_imdb_match,closest_imdb_character_id
0,False,False,False,False,False
1,False,False,False,False,False
2,False,False,False,False,False
3,False,False,False,False,False
4,False,False,False,False,False
...,...,...,...,...,...
99385,False,False,False,False,False
99386,False,False,False,False,False
99387,False,False,False,False,False
99388,False,False,False,True,False


In [12]:
cm_cols = ['script_id', 'imdb_id', 'character_from_script', 'closest_character_name_from_imdb_match', 'closest_imdb_character_id']
for col in cm_cols:
    print(f'{col} : {np.unique(str(character_mapping[col].values))}\n')

script_id : ['[   1    1    1 ... 9254 9254 9254]']

imdb_id : ["['tt0147800' 'tt0147800' 'tt0147800' ... 'tt0101414' 'tt0101414'\n 'tt0101414']"]

character_from_script : ["['bianca' 'cameron' 'chastity' ... 'mrs potts' 'townsfolk' 'wardrobe']"]

closest_character_name_from_imdb_match : ["['bianca stratford' 'cameron james' 'chastity' ... 'mrs. potts' nan\n 'wardrobe']"]

closest_imdb_character_id : ["['nm0646351' 'nm0330687' 'nm0005517' ... 'nm0001450' 'nm0621121'\n 'nm0941506']"]



In [13]:
character_mapping = pd.read_csv('character_mapping.csv', encoding = 'latin-1', na_values=['nan', 'nan\n', 'False'])
new_character_mapping = character_mapping.dropna()
character_mapping.shape[0] - new_character_mapping.shape[0]

115

missing data for character mapping: nan, False, nan\n

I dropped 115 rows with missing data for the character mapping dataset.


## 3. Analysis questions


### How could we tell if the amount of dialogue was increasing over time in movies? How might this influence the assessment about the breakdown of gender dialogue?


use the `lines_data` column in the `metadata` dataset to find the amount of dialogue in each movie and sort by by year and lines_data to see how it changed over time

lines_data:

- we assume that a minute of dialogue is roughly 14 lines
- So each numeral in the string is the number of MALE lines for half a minute. So if split up the string into groups of two and add the two the numerals, we have total number of male lines of roughly a minute of time.
- Each row is an array of [male lines out of 14 representing one minute, female lines out of 14 representing one minute]


gender dialogue:
map `character_list` and `metadata` datasets using the `script_id` column
`character_list` contains a `gender` column and `words` column, can see how the total number of words per film compares to the words said by characters of different genders within each movie


In [14]:
for idx,val in enumerate(new_metadata['lines_data']):
    my_str = str(val)
    #print(my_str)i
    my_sum = 0
    for i in my_str:
        if i.isdigit():
            my_sum += int(i)
    #print(my_sum)
    new_metadata.loc[idx, 'total_lines'] = my_sum
new_metadata

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_metadata.loc[idx, 'total_lines'] = my_sum
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_metadata.loc[idx, 'total_lines'] = my_sum


Unnamed: 0,script_id,imdb_id,title,year,gross,lines_data,total_lines
0,1534.0,tt1022603,(500) Days of Summer,2009.0,37.0,7435445256774774443342577775657744434444564456...,1215.0
1,1512.0,tt0147800,10 Things I Hate About You,1999.0,65.0,1777752320274533344457777722433777334443764677...,1036.0
3,1517.0,tt2024544,12 Years a Slave,2013.0,60.0,4567334777777777777777447777756477777444777777...,146.0
4,1520.0,tt1542344,127 Hours,2010.0,20.0,453513352345765766777777773340,263.0
5,6537.0,tt0450385,1408,2007.0,91.0,37677777777777777776777737566646444336777661,524.0
...,...,...,...,...,...,...,...
1642,,,,,,,450.0
1643,,,,,,,431.0
1647,,,,,,,180.0
1652,,,,,,,1012.0


In [15]:
new_metadata.sort_values(by='year', ascending= True)

Unnamed: 0,script_id,imdb_id,title,year,gross,lines_data,total_lines
547,8521.0,tt0021884,Frankenstein,1931.0,298.0,6343446777777777777777777777646447777777665775...,700.0
1824,1483.0,tt0032138,The Wizard of Oz,1939.0,839.0,0220005100004100267760000222343445342744445441...,
1042,3016.0,tt0031725,Ninotchka,1939.0,45.0,7777747757777777777545754333432344343343454343...,308.0
1794,6003.0,tt0041959,The Third Man,1949.0,8.0,7777777777777777777777777777747777443334435767...,
1072,4745.0,tt0047296,On the Waterfront,1954.0,185.0,7777553677777777465467777677744434343343554543...,457.0
...,...,...,...,...,...,...,...
1642,,,,,,,450.0
1643,,,,,,,431.0
1647,,,,,,,180.0
1652,,,,,,,1012.0


In [16]:
new_metadata.sort_values(by=['total_lines'], ascending= True)

Unnamed: 0,script_id,imdb_id,title,year,gross,lines_data,total_lines
1283,,,,,,,2.0
886,4553.0,tt0318155,Looney Tunes: Back in Action,2003.0,30.0,7555765477777753477777763677772556745245766564...,11.0
338,2029.0,tt0112697,Clueless,1995.0,113.0,5203350677316630521012342026300733222641636577...,21.0
177,,,,,,,30.0
1266,,,,,,,31.0
...,...,...,...,...,...,...,...
1994,5517.0,tt3312830,Youth,2015.0,2.0,7777777743477777777777777777777654336667757777...,
1995,3765.0,tt0403702,Youth in Revolt,2009.0,17.0,7766777656545344243247443314443342644634343374...,
1996,3766.0,tt1790885,Zero Dark Thirty,2012.0,104.0,5677677556654467677515744741445336433333000120...,
1997,8158.0,tt0120906,Zero Effect,1998.0,3.0,4777774477777647777777777755677755423677777777...,


In [17]:
merged_df = pd.merge(new_character_list, new_metadata, on= 'script_id')
merged_df

Unnamed: 0,script_id,imdb_character_name,words,gender,age,imdb_id,title,year,gross,lines_data,total_lines
0,280,betty,311,f,35.0,tt0112579,The Bridges of Madison County,1995.0,142.0,4332023434343443203433434334433434343434434344...,706.0
1,280,francesca johns,2251,f,46.0,tt0112579,The Bridges of Madison County,1995.0,142.0,4332023434343443203433434334433434343434434344...,706.0
2,280,madge,190,f,46.0,tt0112579,The Bridges of Madison County,1995.0,142.0,4332023434343443203433434334433434343434434344...,706.0
3,280,michael johnson,723,m,38.0,tt0112579,The Bridges of Madison County,1995.0,142.0,4332023434343443203433434334433434343434434344...,706.0
4,280,robert kincaid,1908,m,65.0,tt0112579,The Bridges of Madison County,1995.0,142.0,4332023434343443203433434334433434343434434344...,706.0
...,...,...,...,...,...,...,...,...,...,...,...
15548,9254,lumiere,1063,m,56.0,tt0101414,Beauty and the Beast,1991.0,452.0,3245753334377767774433634446467677732244465553...,452.0
15549,9254,maurice,1107,m,71.0,tt0101414,Beauty and the Beast,1991.0,452.0,3245753334377767774433634446467677732244465553...,452.0
15550,9254,monsieur d'arqu,114,m,58.0,tt0101414,Beauty and the Beast,1991.0,452.0,3245753334377767774433634446467677732244465553...,452.0
15551,9254,mrs. potts,564,f,66.0,tt0101414,Beauty and the Beast,1991.0,452.0,3245753334377767774433634446467677732244465553...,452.0


In [18]:
film_scripts = pd.read_csv('cleaned_pudding_data.csv')
film_scripts
films_year = film_scripts.groupby('year')['gross (inflation-adjusted)'].sum().reset_index()


In [19]:
merged_df.groupby('script_id')
merged_df

Unnamed: 0,script_id,imdb_character_name,words,gender,age,imdb_id,title,year,gross,lines_data,total_lines
0,280,betty,311,f,35.0,tt0112579,The Bridges of Madison County,1995.0,142.0,4332023434343443203433434334433434343434434344...,706.0
1,280,francesca johns,2251,f,46.0,tt0112579,The Bridges of Madison County,1995.0,142.0,4332023434343443203433434334433434343434434344...,706.0
2,280,madge,190,f,46.0,tt0112579,The Bridges of Madison County,1995.0,142.0,4332023434343443203433434334433434343434434344...,706.0
3,280,michael johnson,723,m,38.0,tt0112579,The Bridges of Madison County,1995.0,142.0,4332023434343443203433434334433434343434434344...,706.0
4,280,robert kincaid,1908,m,65.0,tt0112579,The Bridges of Madison County,1995.0,142.0,4332023434343443203433434334433434343434434344...,706.0
...,...,...,...,...,...,...,...,...,...,...,...
15548,9254,lumiere,1063,m,56.0,tt0101414,Beauty and the Beast,1991.0,452.0,3245753334377767774433634446467677732244465553...,452.0
15549,9254,maurice,1107,m,71.0,tt0101414,Beauty and the Beast,1991.0,452.0,3245753334377767774433634446467677732244465553...,452.0
15550,9254,monsieur d'arqu,114,m,58.0,tt0101414,Beauty and the Beast,1991.0,452.0,3245753334377767774433634446467677732244465553...,452.0
15551,9254,mrs. potts,564,f,66.0,tt0101414,Beauty and the Beast,1991.0,452.0,3245753334377767774433634446467677732244465553...,452.0


### How could we test if there was any relationship between the film’s gross value and the amount of dialogue in the film?

- map the relationship between the `gross` and `lines_data` columns in the `metadata` dataset
- calculate summary statistics for the different dataframes
- use different types of visualizations such as scatterplots
