# Analyzing the Gender Gap in Hollywood

# 0.0 NoteBook Objective

The objective of this notebook is to scrape data of the cast roles in IMDB movies from Wikipedia. The reason this is done is because, ImDB has a summarized version of cast roles. For instance instead of: 

> "Robert Downey Jr. as Tony Stark / Iron Man: An industrialist, genius inventor, and consummate playboy, he is CEO of Stark Industries and chief weapons manufacturer for the U.S. military."

The IMDB description simply says:

> "Robert Downey Jr. as Tony Stark / Iron Man"

And this clearly does not give enough information, for the purpose of our Gender gap analysis.


NOTE: This notebook took an entire **40 plus hours** to run!

# 1.0 Library Importing

In [8]:
import pandas as pd # To handle DataFrames efficiently
import numpy as np
import warnings # To suppress warnings to make notebook look cleaner
import missingno
from tqdm import tqdm # This is used to show progress of function application
import wikipedia # This is the Wikipedia API used to scrape the Cast
import requests
from bs4 import BeautifulSoup
import re

import math, time, random, datetime # Used to time our code execution were applicable

# 2.0 Importing DataFrames

In [9]:
# Import data_frame of pre-scraped IMDB movies
start = time.time()
warnings.filterwarnings("ignore")
df_movies = pd.read_csv('/Users/macbook/Google Drive/0. Ofilispeaks Business (Mac and Cloud)/9. Data Science/0. Python/General Assembly Training/Project 6/archive/IMDb movies.csv')
end = time.time()
print(f'It took {round((end-start),2)} seconds')

It took 1.09 seconds


In [10]:
df_movies.shape # check shape of IMDB dataframe

(85855, 22)

In [11]:
df_movies[df_movies['original_title']== 'Coming to America'] # Check to see if popular movie [Coming to America] is present

Unnamed: 0,imdb_title_id,title,original_title,year,date_published,genre,duration,country,language,director,...,actors,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics
23803,tt0094898,Il principe cerca moglie,Coming to America,1988,1988-09-30,"Comedy, Romance",116,USA,English,John Landis,...,"Paul Bates, Eddie Murphy, Garcelle Beauvais, F...",An extremely pampered African Prince travels t...,7.0,160374,$ 39000000,$ 128152301,$ 288752301,47.0,177.0,69.0


In [12]:
df_sample = df_movies.copy() # Copy the df_movies dataframe to a new DataSet df_sample

In [13]:
df_sample.head(1) # Inspect head of new dataframe

Unnamed: 0,imdb_title_id,title,original_title,year,date_published,genre,duration,country,language,director,...,actors,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics
0,tt0000009,Miss Jerry,Miss Jerry,1894,1894-10-09,Romance,45,USA,,Alexander Black,...,"Blanche Bayliss, William Courtenay, Chauncey D...",The adventures of a female reporter in the 1890s.,5.9,154,,,,,1.0,2.0


# 3.0 Scraping Wikipedia to get Cast Roles

In [14]:
# Function to Find imdb movie page (if it exists)
def find_movie_page(movie_title):
    try:
        return wikipedia.WikipediaPage(title = movie_title).section('Cast') #Finds the cast section of movie wikipage
    except:
        return 'missing' # returns missing if not found

In [15]:
tqdm.pandas() # This inititates the TQDM lirbrary with pandas for us to track progress

### 3.1 Split Up DataFrames to ensure we don't lose entire progress

In [7]:
df_sample_20 = df_sample[0:20000]
df_sample_40 = df_sample[20001:40000]
df_sample_60 = df_sample[40001:60000]
df_sample_80 = df_sample[60001:85855]

In [11]:
# Find actor url pages for entries in characters dataframe
df_sample_20['cast'] = df_sample_20['original_title'].progress_apply(find_movie_page)

100%|██████████| 20000/20000 [7:56:19<00:00,  1.43s/it]  


In [12]:
# Find actor url pages for entries in characters dataframe
df_sample_40['cast'] = df_sample_40['original_title'].progress_apply(find_movie_page)

100%|██████████| 19999/19999 [6:42:48<00:00,  1.21s/it]   


In [13]:
# Find actor url pages for entries in characters dataframe
df_sample_60['cast'] = df_sample_60['original_title'].progress_apply(find_movie_page)

100%|██████████| 19999/19999 [6:18:32<00:00,  1.14s/it]  


In [14]:
# Find actor url pages for entries in characters dataframe
df_sample_80['cast'] = df_sample_80['original_title'].progress_apply(find_movie_page)

100%|██████████| 25854/25854 [7:39:10<00:00,  1.07s/it]   


### 3.2 Merge Data Frames into One

In [44]:
frames = [df_sample_20,df_sample_40,df_sample_60,df_sample_80]
result = pd.concat(frames)

In [46]:
result = pd.DataFrame(data = result, columns = ['imdb_title_id','cast']) # Assign DataFrame pertinent columns

In [47]:
result.head() # Inspect Data  Frame head

Unnamed: 0,imdb_title_id,cast
0,tt0000009,"Blanche Bayliss (under the name ""Constance Art..."
1,tt0000574,There is considerable uncertainty over who app...
2,tt0001892,missing
3,tt0002101,
4,tt0002130,Salvatore Papa as Dante Alighieri\nArturo Piro...


In [48]:
# Write the DataFrame you created to a csv called 'adobeillustrators.csv'
result.to_csv('/Users/macbook/Google Drive/0. Ofilispeaks Business (Mac and Cloud)/9. Data Science/0. Python/General Assembly Training/Project 6/data/movies_cast.csv', index=False)
print('Submission CSV is ready!')

Submission CSV is ready!


# 4.0 Further Scrapping of Wikipedia
> We do this because Wikipedia has a set format for some movies. For example if you search for cast in [Iron Man] an error will happen and missing will be returned. However, if you add the release year of the film and the word film like [Iron Man (2008 film)] you will get the CAST section. So I had to create a further scrape to get these values.

**Re-import scrapped dataframe above for further scrapping**

In [4]:
start = time.time()
warnings.filterwarnings("ignore")
df_cast_wiki = pd.read_csv('/Users/macbook/Google Drive/0. Ofilispeaks Business (Mac and Cloud)/9. Data Science/0. Python/General Assembly Training/Project 6/data/movies_cast.csv')
end = time.time()
print(f'It took {round((end-start),2)} seconds')

It took 0.25 seconds


In [17]:
# Create DataFrame with only Cast missing or null values. These are the ones we want to re-scrape.
results = df_cast_wiki[(df_cast_wiki['cast'].isnull()) | (df_cast_wiki['cast']=='missing')]

In [18]:
# Inner merge the movies 
df_movies_update = pd.merge(df_movies, results, on='imdb_title_id', how='inner')

In [19]:
df_movies_update.head() # Inspect the head of the movies

Unnamed: 0,imdb_title_id,title,original_title,year,date_published,genre,duration,country,language,director,...,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics,cast
0,tt0001892,Den sorte drøm,Den sorte drøm,1911,1911-08-19,Drama,53,"Germany, Denmark",,Urban Gad,...,Two men of high rank are both wooing the beaut...,5.8,188,,,,,5.0,2.0,missing
1,tt0002101,Cleopatra,Cleopatra,1912,1912-11-13,"Drama, History",100,USA,English,Charles L. Gaskill,...,The fabled queen of Egypt's affair with Roman ...,5.2,446,$ 45000,,,,25.0,3.0,
2,tt0002199,"From the Manger to the Cross; or, Jesus of Naz...","From the Manger to the Cross; or, Jesus of Naz...",1912,1913,"Biography, Drama",60,USA,English,Sidney Olcott,...,"An account of the life of Jesus Christ, based ...",5.7,484,,,,,13.0,5.0,missing
3,tt0002423,Madame DuBarry,Madame DuBarry,1919,1919-11-26,"Biography, Drama, Romance",85,Germany,German,Ernst Lubitsch,...,"The story of Madame DuBarry, the mistress of L...",6.8,753,,,,,12.0,9.0,
4,tt0002445,Quo Vadis?,Quo Vadis?,1913,1913-03-01,"Drama, History",120,Italy,Italian,Enrico Guazzoni,...,"An epic Italian film ""Quo Vadis"" influenced ma...",6.2,273,ITL 45000,,,,7.0,5.0,


In [21]:
df_movies_update['year'] = df_movies_update['year'].astype(str) # Convert Year to string type

In [22]:
# Update to Wikipedia Format [Bad Boys (1992 Film)]
df_movies_update ['wiki_title'] = df_movies_update['original_title'] + ' ' + '(' + df_movies_update['year'] + ' ' + 'film' + ')'

In [23]:
df_movies_update.head() # Inspect ['wiki_title'] column to ensure format is correct

Unnamed: 0,imdb_title_id,title,original_title,year,date_published,genre,duration,country,language,director,...,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics,cast,wiki_title
0,tt0001892,Den sorte drøm,Den sorte drøm,1911,1911-08-19,Drama,53,"Germany, Denmark",,Urban Gad,...,5.8,188,,,,,5.0,2.0,missing,Den sorte drøm (1911 film)
1,tt0002101,Cleopatra,Cleopatra,1912,1912-11-13,"Drama, History",100,USA,English,Charles L. Gaskill,...,5.2,446,$ 45000,,,,25.0,3.0,,Cleopatra (1912 film)
2,tt0002199,"From the Manger to the Cross; or, Jesus of Naz...","From the Manger to the Cross; or, Jesus of Naz...",1912,1913,"Biography, Drama",60,USA,English,Sidney Olcott,...,5.7,484,,,,,13.0,5.0,missing,"From the Manger to the Cross; or, Jesus of Naz..."
3,tt0002423,Madame DuBarry,Madame DuBarry,1919,1919-11-26,"Biography, Drama, Romance",85,Germany,German,Ernst Lubitsch,...,6.8,753,,,,,12.0,9.0,,Madame DuBarry (1919 film)
4,tt0002445,Quo Vadis?,Quo Vadis?,1913,1913-03-01,"Drama, History",120,Italy,Italian,Enrico Guazzoni,...,6.2,273,ITL 45000,,,,7.0,5.0,,Quo Vadis? (1913 film)


In [24]:
# Find actor url pages for entries in characters dataframe
df_movies_update['cast_wiki'] = df_movies_update['wiki_title'].progress_apply(find_movie_page)

100%|██████████| 61986/61986 [11:53:37<00:00,  1.45it/s]    


In [25]:
# Write the DataFrame you created to a csv called 'movies_cast_wiki.csv'
df_movies_update.to_csv('/Users/macbook/Google Drive/0. Ofilispeaks Business (Mac and Cloud)/9. Data Science/0. Python/General Assembly Training/Project 6/data/movies_cast_wiki.csv', index=False)
print('CSV is ready!')

Submission CSV is ready!


In [27]:
df_movies_update.isnull().sum()

imdb_title_id                0
title                        0
original_title               0
year                         0
date_published               0
genre                        0
duration                     0
country                     56
language                   657
director                    80
writer                    1201
production_company        3575
actors                      68
description               1849
avg_vote                     0
votes                        0
budget                   44471
usa_gross_income         51490
worlwide_gross_income    39200
metascore                52774
reviews_from_users        6397
reviews_from_critics      9191
cast                     22599
wiki_title                   0
cast_wiki                  897
dtype: int64

# The END