# Analyzing the Gender Gap in Hollywood

# 0.0 NoteBook Objective

The objective of this notebook is to merge 5 dataframes to create one singular dataframe for our EDA analysis and modelling. Hang tight, things are going to get merged inside and outside!

# 1.0 Importing Libraries

In [29]:
# The usual suspects to get our notebook looking good
import pandas as pd
import numpy as np
import warnings
import missingno
from tqdm import tqdm
import math, time, random, datetime

In [30]:
tqdm.pandas() # This inititates the TQDM lirbrary with pandas for us to track progress

# 2.0 Importing DataSets

## 2.1 Import IMDB Names
This contains the names and date of all 297,705 actors and actresses in our Data Set.

In [31]:
# Import IMDb names
start = time.time()
warnings.filterwarnings("ignore")
df_names = pd.read_csv('/Users/macbook/Google Drive/0. Ofilispeaks Business (Mac and Cloud)/9. Data Science/0. Python/General Assembly Training/Project 6/archive/IMDb names.csv')
end = time.time()
print(f'It took {round((end-start),2)} seconds')

It took 2.46 seconds


In [32]:
df_names.head(1)

Unnamed: 0,imdb_name_id,name,birth_name,height,bio,birth_details,date_of_birth,place_of_birth,death_details,date_of_death,place_of_death,reason_of_death,spouses_string,spouses,divorces,spouses_with_children,children
0,nm0000001,Fred Astaire,Frederic Austerlitz Jr.,177.0,"Fred Astaire was born in Omaha, Nebraska, to J...","May 10, 1899 in Omaha, Nebraska, USA",1899-05-10,"Omaha, Nebraska, USA","June 22, 1987 in Los Angeles, California, USA ...",1987-06-22,"Los Angeles, California, USA",pneumonia,Robyn Smith (27 June 1980 - 22 June 1987) (hi...,2,0,1,2


In [33]:
df_names.shape

(297705, 17)

## 2.2 Import IMDB Movies
This contains names of imdb movies and other information like revenue and world-wide movie budget.

In [34]:
# Import IMDb movies
start = time.time()
warnings.filterwarnings("ignore")
df_movies = pd.read_csv('/Users/macbook/Google Drive/0. Ofilispeaks Business (Mac and Cloud)/9. Data Science/0. Python/General Assembly Training/Project 6/archive/IMDb movies.csv')
end = time.time()
print(f'It took {round((end-start),2)} seconds')

It took 1.19 seconds


In [35]:
df_movies.head()

Index(['imdb_title_id', 'title', 'original_title', 'year', 'date_published',
       'genre', 'duration', 'country', 'language', 'director', 'writer',
       'production_company', 'actors', 'description', 'avg_vote', 'votes',
       'budget', 'usa_gross_income', 'worlwide_gross_income', 'metascore',
       'reviews_from_users', 'reviews_from_critics'],
      dtype='object')

## 2.3 Import IMDB Ratings

This contains ratings of movies broken down.

In [36]:
# Import imdb ratings dataset
start = time.time()
warnings.filterwarnings("ignore")
df_ratings = pd.read_csv('/Users/macbook/Google Drive/0. Ofilispeaks Business (Mac and Cloud)/9. Data Science/0. Python/General Assembly Training/Project 6/archive/IMDb ratings.csv')
end = time.time()
print(f'It took {round((end-start),2)} seconds')

It took 0.59 seconds


In [37]:
df_ratings.head(1)

Unnamed: 0,imdb_title_id,weighted_average_vote,total_votes,mean_vote,median_vote,votes_10,votes_9,votes_8,votes_7,votes_6,...,females_30age_avg_vote,females_30age_votes,females_45age_avg_vote,females_45age_votes,top1000_voters_rating,top1000_voters_votes,us_voters_rating,us_voters_votes,non_us_voters_rating,non_us_voters_votes
0,tt0000009,5.9,154,5.9,6.0,12,4,10,43,28,...,5.7,13.0,4.5,4.0,5.7,34.0,6.4,51.0,6.0,70.0


## 2.4 Import IMDB Title Principals

This is a key dataset that tells us the category of the cast members.

In [38]:
# Import imdb title principals dataset
start = time.time()
warnings.filterwarnings("ignore")
df_title_principals = pd.read_csv('/Users/macbook/Google Drive/0. Ofilispeaks Business (Mac and Cloud)/9. Data Science/0. Python/General Assembly Training/Project 6/archive/IMDb title_principals.csv')
end = time.time()
print(f'It took {round((end-start),2)} seconds')

It took 1.02 seconds


In [39]:
df_title_principals.head()

Unnamed: 0,imdb_title_id,ordering,imdb_name_id,category,job,characters
0,tt0000009,1,nm0063086,actress,,"[""Miss Geraldine Holbrook (Miss Jerry)""]"
1,tt0000009,2,nm0183823,actor,,"[""Mr. Hamilton""]"
2,tt0000009,3,nm1309758,actor,,"[""Chauncey Depew - the Director of the New Yor..."
3,tt0000009,4,nm0085156,director,,
4,tt0000574,1,nm0846887,actress,,"[""Kate Kelly""]"


# 3.0 Import Scrapped Wikipedia DataSets for Merging

## 3.1 Import Wikipedia Populated Data Frame (Cast): Part 1

This is the first scrapped dataset from Wikipedia containing Role summary.

In [40]:
# Import Wikipedia DataFrame
start = time.time()
warnings.filterwarnings("ignore")
df_cast_wiki = pd.read_csv('/Users/macbook/Google Drive/0. Ofilispeaks Business (Mac and Cloud)/9. Data Science/0. Python/General Assembly Training/Project 6/data/movies_cast.csv')
end = time.time()
print(f'It took {round((end-start),2)} seconds')

It took 0.29 seconds


In [41]:
df_cast_wiki.shape

(85852, 2)

## 3.2 Import Wikipedia Populated Data Frame (Cast): Part 2

Because some movies on Wikipedia are listed as *Movie Name (1999 film)* I created another Wikipedia scrape to pull in casts of these movies.

In [45]:
# Import Wikipedia DataFrame
start = time.time()
warnings.filterwarnings("ignore")
df_cast_wiki_deep = pd.read_csv('/Users/macbook/Google Drive/0. Ofilispeaks Business (Mac and Cloud)/9. Data Science/0. Python/General Assembly Training/Project 6/data/movies_cast_wiki.csv')
end = time.time()
print(f'It took {round((end-start),2)} seconds')

It took 1.0 seconds


In [46]:
df_cast_wiki_deep.shape # Inspect shape of DataFrame

(61986, 25)

In [48]:
df_cast_wiki_deep = df_cast_wiki_deep[['imdb_title_id','cast_wiki']] # Assign Pertinent Columns to DataFrame

In [49]:
# Rename DataSet column
df_cast_wiki_deep.rename(columns={'cast_wiki': 'cast'},
          inplace=True, errors='raise')

In [51]:
df_cast_wiki_deep.shape # Inspect shape of DataFrame

(7138, 2)

## 3.3 Merge Wikipedia DataSets above into One: Part 3

In [52]:
# Perform an Outer Merge
df_wiki_merge = pd.merge(df_cast_wiki_deep, df_cast_wiki, on=['imdb_title_id','cast'], how='outer') 

In [53]:
df_wiki_merge.head() # Inspect DataFrame Head

Unnamed: 0,imdb_title_id,cast
0,tt0002101,Helen Gardner as Cleopatra - Queen of Egypt\nM...
1,tt0002423,"Pola Negri as Madame DuBarry, The Former Jeann..."
2,tt0002461,Robert Gemp ... King Edward IV\nFrederick Ward...
3,tt0002646,Olaf Fønss as Dr. Friedrich von Kammacher\nIda...
4,tt0003167,Henry B. Walthall .... John Howard Payne\nJos...


In [54]:
df_wiki_merge.shape # Inspect Final Merged Shape

(31004, 2)

## 3.4 Process Merged DataSet

After merging the dataset, we would need to process the cast set into one coherent dataset that separates the cluster of tesxt we have into discernable portions.

In [55]:
# Performs an inner merge to the original imdb dataset
df_movies_update = pd.merge(df_movies, df_wiki_merge, on='imdb_title_id', how='inner')

In [56]:
df_movies_update.shape # Inspect shape of dataframe

(31004, 23)

In [57]:
df_movies_update.head(1) # Inspect head of dataframe

Unnamed: 0,imdb_title_id,title,original_title,year,date_published,genre,duration,country,language,director,...,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics,cast
0,tt0000009,Miss Jerry,Miss Jerry,1894,1894-10-09,Romance,45,USA,,Alexander Black,...,The adventures of a female reporter in the 1890s.,5.9,154,,,,,1.0,2.0,"Blanche Bayliss (under the name ""Constance Art..."


### 3.4.1 Function to Handle Scrapped DataSets That are separated by *\n*
This function breaks out our text into 3 discernable peices:
1. Actor/Actress Name
2. Actor/Actress Character name in the movies
3. Actor/Actress Character Role in the movies

In [58]:
def cast_roles (section):
    
    if section == 'missing': # This line is to handle error thrown in columns where cast == 'missing'
        return 'missing'
    
    try:
        test = section.replace("\'", "").split('\n') # Split up scrapped cast data into list of paragraphs
    except:
        return 'missing'
    
    emptydict = [] # This list is to store zipped dictionary 
    
    empty = [] # This is to store the actors names
    for i in range(0,len(test)):
        try:
            empty.append(test[i].split(' as')[0])
        except:
            continue
        
    character = [] # This is to store the characters they played names
    for i in range(0,len(test)):
        start = test[i].find('as') + 3
        end = test[i].find(':', start)
        character.append(test[i][start:end])
        
    role = [] # This is to store the roles
    for i in range(0,len(test)):
        if ':' not in test[i]:
            try:
                role.append(test[i].split(' as ')[1])
            except:
                continue

        else:    
            start = test[i].find(':') + 1
            end = test[i].find('. ', start)
            role.append(test[i][start:end])
        
    dictlist = ['name','character','role'] # This stores the dictionary keys
    zipped = list(zip(empty,character,role)) # Zip all 3 lists above into list of lists
    
    for i in zipped:

        emptydict.append(dict(zip(dictlist,i))) # This converts list of lists to dictionary
        
    return emptydict

In [59]:
# Find get cast dictionary
df_movies_update['cast_dict'] = df_movies_update['cast'].progress_apply(cast_roles)

df_movies_updated = df_movies_update.copy()

df_movies_updated = df_movies_updated[['imdb_title_id','original_title','cast_dict']]

df_movies_updated = df_movies_updated[(df_movies_updated['cast_dict'].notnull()) & (df_movies_updated['cast_dict']!='missing')]

Idx=df_movies_updated.set_index(['imdb_title_id','original_title']).cast_dict.progress_apply(pd.Series).stack().index
# https://stackoverflow.com/questions/46220941/pandas-unpack-a-column-with-list-of-dict-values-into-multiple-columns

df_movies_updated = pd.DataFrame(df_movies_updated.set_index(['imdb_title_id']).cast_dict.progress_apply(pd.Series).stack().values.tolist(),index=Idx).reset_index().drop('level_2',1)
# https://stackoverflow.com/questions/46220941/pandas-unpack-a-column-with-list-of-dict-values-into-multiple-columns

100%|██████████| 31004/31004 [00:01<00:00, 22932.22it/s]
100%|██████████| 31004/31004 [00:08<00:00, 3871.92it/s]
100%|██████████| 31004/31004 [00:07<00:00, 3985.82it/s]


### 3.4.2 Function to Handle Scrapped DataSets That are not separated by *\n*
For some reason, the text in wikipedia in some cases was not separated by a paragraph space. So we did not find an occurence of '\n', so this function breaks out our text into 3 discernable peices but is a slight modification of the above to handle this peculiar situation:
1. Actor/Actress Name
2. Actor/Actress Character name in the movies
3. Actor/Actress Character Role in the movies

In [60]:
def cast_roles_two (section):
    
    if section == 'missing': # This line is to handle error thrown in columns where cast == 'missing'
        return 'missing'
    
    
    try:
        test = section.replace("\n", " ")
        test = test.replace("\'", "").split('.')# Split up scrapped cast data into list of paragraphs
    except:
        return 'missing'
    
    emptydict = [] # This list is to store zipped dictionary 
    
    empty = [] # This is to store the actors names
    for i in range(0,len(test)):
        if ':' not in test[i]:
            continue
        else:
            try:
                empty.append(test[i].split(' as')[0].replace('[^\w\s]',''))
            except:
                continue

    character = [] # This is to store the characters they played names
    for i in range(0,len(test)):
        if ':' not in test[i]:
            continue
        else:
            start = test[i].find('as') + 3
            end = test[i].find(':', start)
            character.append(test[i][start:end])
        
    role = []
    for i in range(0,len(test)):
        if ':' not in test[i]:
            continue

        else:    
            role.append(test[i].split(':')[1])
#             start = test[i].find(':') + 1
#             end = test[i].find('.', start)
#             role.append(test[i][start:end])
        
    dictlist = ['name','character','role'] # This stores the dictionary keys
    zipped = list(zip(empty,character,role)) # Zip all 3 lists above into list of lists
    
    for i in zipped:

        emptydict.append(dict(zip(dictlist,i))) # This converts list of lists to dictionary
        
    return emptydict

In [61]:
# Find get cast dictionary
df_movies_update['cast_dict'] = df_movies_update['cast'].progress_apply(cast_roles_two)

df_movies_updated_two = df_movies_update.copy()

df_movies_updated_two = df_movies_updated_two[['imdb_title_id','original_title','cast_dict']]

df_movies_updated_two = df_movies_updated_two[(df_movies_updated_two['cast_dict'].notnull()) & (df_movies_updated_two['cast_dict']!='missing')]

Idx=df_movies_updated_two.set_index(['imdb_title_id','original_title']).cast_dict.progress_apply(pd.Series).stack().index
# https://stackoverflow.com/questions/46220941/pandas-unpack-a-column-with-list-of-dict-values-into-multiple-columns

df_movies_updated_two = pd.DataFrame(df_movies_updated_two.set_index(['imdb_title_id']).cast_dict.progress_apply(pd.Series).stack().values.tolist(),index=Idx).reset_index().drop('level_2',1)
# https://stackoverflow.com/questions/46220941/pandas-unpack-a-column-with-list-of-dict-values-into-multiple-columns

100%|██████████| 31004/31004 [00:00<00:00, 124424.07it/s]
100%|██████████| 31004/31004 [00:07<00:00, 4338.29it/s]
100%|██████████| 31004/31004 [00:07<00:00, 4212.02it/s] 


### 3.4.3 Joining Of our Two DataSets Above

In [63]:
# Add the 2 dataframes together
df_movies_updated = pd.concat([df_movies_updated, df_movies_updated_two])

In [64]:
# Remove the punctuation marks created by second wikipedia scrapping
df_movies_updated_two['name'] = df_movies_updated_two['name'].str.replace(r'[^\w\s]+', '')

In [65]:
df_movies_updated.shape # Inspect shape of the resulting dataframe

(312223, 5)

In [66]:
df_movies_updated.head() # Inspect head of resulting dataframe

Unnamed: 0,imdb_title_id,original_title,name,character,role
0,tt0000009,Miss Jerry,"Blanche Bayliss (under the name ""Constance Art...",Miss Geraldine Holbrook (Miss Jerry,Miss Geraldine Holbrook (Miss Jerry)
1,tt0000009,Miss Jerry,William Courtenay,Walter Hamilto,Walter Hamilton
2,tt0000009,Miss Jerry,Chauncey Depew,Himself (Director of the New York Central Rail...,Himself (Director of the New York Central Rail...
3,tt0000574,The Story of the Kelly Gang,There is considerable uncertainty over who app...,ere is considerable uncertainty over who appea...,Dan Kelly
4,tt0000574,The Story of the Kelly Gang,,,the stunt double for the actress playing Kate ...


# 4.0 Align NameIDs with df_movies_updated

Now we need to link our newly created dataframe in section 3.0 above to the imdb name ids. This is to allow us link and merge the tables.

In [67]:
# Create a copy of the names dataframe and assign it pertinent columns ['imdb_name_id','name']
scrape_df = df_names.copy()
scrape_df = scrape_df[['imdb_name_id','name']]
scrape_df.head()

Unnamed: 0,imdb_name_id,name
0,nm0000001,Fred Astaire
1,nm0000002,Lauren Bacall
2,nm0000003,Brigitte Bardot
3,nm0000004,John Belushi
4,nm0000005,Ingmar Bergman


#### We then zip the Dataframe togetehr to create a dictionary where the name ID is the key and the name is the value

In [68]:
imdb_dictionary = dict(zip(scrape_df['imdb_name_id'],scrape_df['name']))
# Zip the 2 columns into a dictionary

#### Function to assign name_id key based on name value

In [69]:
def imdb_to_wiki (value):
    for k,v in imdb_dictionary.items():
        if v == value:
            return k
        else:
            continue

#### Application of our naming Function

In [71]:
df_movies_updated['imdb_name_id'] = df_movies_updated['name'].progress_apply(imdb_to_wiki)
# TQDM apply and track progress of key extraction from Dictionary

100%|██████████| 312223/312223 [46:52<00:00, 111.02it/s] 


#### Here we inspect to see how many name_ids were properly linked

In [74]:
df_movies_updated.isnull().sum() # Check null value sums

imdb_title_id          0
original_title         0
name                   0
character              0
role                   0
imdb_name_id      110722
dtype: int64

In [76]:
df_movies_updated.dropna(subset=['imdb_name_id'], inplace= True)

# 5.0 Merge Updated Movie Data Frame (with roles) to Principals DataFrame

With the name IDs in place, we merge it with the IMDB Principal Titles DataFrame, this will enable use link to some interesting data features, such as movie ordering.

In [77]:
df_title_principals.shape # Inspect shape of DataFrame

(835513, 6)

In [78]:
df_movies_updated.shape # Inspect shape of DataFrame

(201501, 6)

In [79]:
df = pd.merge(df_title_principals, df_movies_updated, on=["imdb_title_id","imdb_name_id"], how='inner')
# Inner Merge the 2 dataframes

In [80]:
df.head() # Inspect head of DataFrame, here we see how well we scrapped and merged datasets

Unnamed: 0,imdb_title_id,ordering,imdb_name_id,category,job,characters,original_title,name,character,role
0,tt0000009,2,nm0183823,actor,,"[""Mr. Hamilton""]",Miss Jerry,William Courtenay,Walter Hamilto,Walter Hamilton
1,tt0000009,3,nm1309758,actor,,"[""Chauncey Depew - the Director of the New Yor...",Miss Jerry,Chauncey Depew,Himself (Director of the New York Central Rail...,Himself (Director of the New York Central Rail...
2,tt0000574,1,nm0846887,actress,,"[""Kate Kelly""]",The Story of the Kelly Gang,Elizabeth Tait,the stunt double for the actress playing Kate ...,Steve Hart
3,tt0002101,1,nm0306947,actress,,"[""Cleopatra - Queen of Egypt""]",Cleopatra,Helen Gardner,Cleopatra - Queen of Egyp,Cleopatra - Queen of Egypt
4,tt0002101,2,nm0801774,actress,,"[""Iras - An Attendant""]",Cleopatra,Pearl Sindelar,Iras - An attendan,Iras - An attendant


In [82]:
df.shape

(89736, 10)

# 6.0 Final Merge to Movie imdb DataFrame
This links our current dataframe to the imdb movie dataframe

In [99]:
df_final = pd.merge(df_movies, df, on=["imdb_title_id"], how='inner')
# Inner Merge with movie IMDB dataframe

In [100]:
df_final = pd.merge(df_names, df_final, on=["imdb_name_id"], how='inner')
# Inner Merge resulting with name IMDB dataframe

In [101]:
df_final.shape # DataFrame Shape

(89736, 47)

In [102]:
df_final.columns # Inspect DataFrame column titles

Index(['imdb_name_id', 'name_x', 'birth_name', 'height', 'bio',
       'birth_details', 'date_of_birth', 'place_of_birth', 'death_details',
       'date_of_death', 'place_of_death', 'reason_of_death', 'spouses_string',
       'spouses', 'divorces', 'spouses_with_children', 'children',
       'imdb_title_id', 'title', 'original_title_x', 'year', 'date_published',
       'genre', 'duration', 'country', 'language', 'director', 'writer',
       'production_company', 'actors', 'description', 'avg_vote', 'votes',
       'budget', 'usa_gross_income', 'worlwide_gross_income', 'metascore',
       'reviews_from_users', 'reviews_from_critics', 'ordering', 'category',
       'job', 'characters', 'original_title_y', 'name_y', 'character', 'role'],
      dtype='object')

In [103]:
df_final.head().T # Transpose Head of Final DataFrame for inspection

Unnamed: 0,0,1,2,3,4
imdb_name_id,nm0000001,nm0000001,nm0000001,nm0000001,nm0000001
name_x,Fred Astaire,Fred Astaire,Fred Astaire,Fred Astaire,Fred Astaire
birth_name,Frederic Austerlitz Jr.,Frederic Austerlitz Jr.,Frederic Austerlitz Jr.,Frederic Austerlitz Jr.,Frederic Austerlitz Jr.
height,177.0,177.0,177.0,177.0,177.0
bio,"Fred Astaire was born in Omaha, Nebraska, to J...","Fred Astaire was born in Omaha, Nebraska, to J...","Fred Astaire was born in Omaha, Nebraska, to J...","Fred Astaire was born in Omaha, Nebraska, to J...","Fred Astaire was born in Omaha, Nebraska, to J..."
birth_details,"May 10, 1899 in Omaha, Nebraska, USA","May 10, 1899 in Omaha, Nebraska, USA","May 10, 1899 in Omaha, Nebraska, USA","May 10, 1899 in Omaha, Nebraska, USA","May 10, 1899 in Omaha, Nebraska, USA"
date_of_birth,1899-05-10,1899-05-10,1899-05-10,1899-05-10,1899-05-10
place_of_birth,"Omaha, Nebraska, USA","Omaha, Nebraska, USA","Omaha, Nebraska, USA","Omaha, Nebraska, USA","Omaha, Nebraska, USA"
death_details,"June 22, 1987 in Los Angeles, California, USA ...","June 22, 1987 in Los Angeles, California, USA ...","June 22, 1987 in Los Angeles, California, USA ...","June 22, 1987 in Los Angeles, California, USA ...","June 22, 1987 in Los Angeles, California, USA ..."
date_of_death,1987-06-22,1987-06-22,1987-06-22,1987-06-22,1987-06-22


#### Further Cleaning and Processing of Final DataFrame

In [104]:
# Delete Date of Birth that are Null
df_final.dropna(subset=['date_of_birth'],inplace=True)

In [105]:
# Select first 4 digits of date_of_birth
df_final['date_of_birth'] = [x[:4] for x in df_final['date_of_birth']]

In [106]:
# convert date of birth and year to numerics, and coerce strings to Nan
df_final['year'] = pd.to_numeric(df_final['year'], errors='coerce')
df_final['date_of_birth'] = pd.to_numeric(df_final['date_of_birth'],errors='coerce')

In [107]:
# remove $ and other non-numeric from data set
df_final['budget'] = df_final['budget'].str.replace('[^\w\s]','')
df_final['usa_gross_income'] =  df_final['usa_gross_income'].str.replace('[^\w\s]','')
df_final['worlwide_gross_income'] = df_final['worlwide_gross_income'].str.replace('[^\w\s]','')

In [108]:
# convert USD to numerics, and coerce strings to Nan
df_final['budget'] = pd.to_numeric(df_final['budget'], errors='coerce')
df_final['usa_gross_income'] = pd.to_numeric(df_final['usa_gross_income'], errors='coerce')
df_final['worlwide_gross_income'] = pd.to_numeric(df_final['worlwide_gross_income'], errors='coerce')

In [110]:
df_final['age_at_release'] = df_final['year'] - df_final['date_of_birth'] # Check age of actor at movie release

In [111]:
df_final.head() #Inspect DataFrame

Unnamed: 0,imdb_name_id,name_x,birth_name,height,bio,birth_details,date_of_birth,place_of_birth,death_details,date_of_death,...,reviews_from_critics,ordering,category,job,characters,original_title_y,name_y,character,role,age_at_release
0,nm0000001,Fred Astaire,Frederic Austerlitz Jr.,177.0,"Fred Astaire was born in Omaha, Nebraska, to J...","May 10, 1899 in Omaha, Nebraska, USA",1899.0,"Omaha, Nebraska, USA","June 22, 1987 in Los Angeles, California, USA ...",1987-06-22,...,42.0,1,actor,,"[""Guy Holden""]",The Gay Divorcee,Fred Astaire,Guy Holde,Guy Holden,35.0
1,nm0000001,Fred Astaire,Frederic Austerlitz Jr.,177.0,"Fred Astaire was born in Omaha, Nebraska, to J...","May 10, 1899 in Omaha, Nebraska, USA",1899.0,"Omaha, Nebraska, USA","June 22, 1987 in Los Angeles, California, USA ...",1987-06-22,...,20.0,2,actor,,"[""Huck Haines""]",Roberta,Fred Astaire,Huc,Huck,36.0
2,nm0000001,Fred Astaire,Frederic Austerlitz Jr.,177.0,"Fred Astaire was born in Omaha, Nebraska, to J...","May 10, 1899 in Omaha, Nebraska, USA",1899.0,"Omaha, Nebraska, USA","June 22, 1987 in Los Angeles, California, USA ...",1987-06-22,...,74.0,1,actor,,"[""Lucky Garnett""]",Swing Time,Fred Astaire,"John ""Lucky"" Garnet","John ""Lucky"" Garnett",37.0
3,nm0000001,Fred Astaire,Frederic Austerlitz Jr.,177.0,"Fred Astaire was born in Omaha, Nebraska, to J...","May 10, 1899 in Omaha, Nebraska, USA",1899.0,"Omaha, Nebraska, USA","June 22, 1987 in Los Angeles, California, USA ...",1987-06-22,...,17.0,1,actor,,"[""Jerry Halliday""]",A Damsel in Distress,Fred Astaire,Jerr,Jerry,38.0
4,nm0000001,Fred Astaire,Frederic Austerlitz Jr.,177.0,"Fred Astaire was born in Omaha, Nebraska, to J...","May 10, 1899 in Omaha, Nebraska, USA",1899.0,"Omaha, Nebraska, USA","June 22, 1987 in Los Angeles, California, USA ...",1987-06-22,...,23.0,1,actor,,"[""Peter P. Peters aka Petrov""]",Shall We Dance,Fred Astaire,"Peter P. ""Petrov"" Peter","Peter P. ""Petrov"" Peters",38.0


#### Delete Duplicates

In [112]:
# remove duplicates
df_final = df_final.drop_duplicates(subset = ['imdb_name_id','imdb_title_id'], keep = 'first')

#### Select The Key Features

In [116]:
df_final.columns

Index(['imdb_name_id', 'name_x', 'birth_name', 'height', 'bio',
       'birth_details', 'date_of_birth', 'place_of_birth', 'death_details',
       'date_of_death', 'place_of_death', 'reason_of_death', 'spouses_string',
       'spouses', 'divorces', 'spouses_with_children', 'children',
       'imdb_title_id', 'title', 'original_title_x', 'year', 'date_published',
       'genre', 'duration', 'country', 'language', 'director', 'writer',
       'production_company', 'actors', 'description', 'avg_vote', 'votes',
       'budget', 'usa_gross_income', 'worlwide_gross_income', 'metascore',
       'reviews_from_users', 'reviews_from_critics', 'ordering', 'category',
       'job', 'characters', 'original_title_y', 'name_y', 'character', 'role',
       'age_at_release'],
      dtype='object')

In [119]:
df_final.head(1).T

Unnamed: 0,0
imdb_name_id,nm0000001
name_x,Fred Astaire
birth_name,Frederic Austerlitz Jr.
height,177.0
bio,"Fred Astaire was born in Omaha, Nebraska, to J..."
birth_details,"May 10, 1899 in Omaha, Nebraska, USA"
date_of_birth,1899.0
place_of_birth,"Omaha, Nebraska, USA"
death_details,"June 22, 1987 in Los Angeles, California, USA ..."
date_of_death,1987-06-22


In [121]:
df_final = df_final[['imdb_title_id', 'original_title_x','imdb_name_id', 
       'name_x', 'height', 'date_of_birth', 'place_of_birth',
       'spouses', 'divorces', 'children', 'year','genre', 'duration', 'country',
       'production_company', 'avg_vote',
       'budget', 'usa_gross_income', 'worlwide_gross_income', 'metascore',
       'reviews_from_users', 'reviews_from_critics', 'ordering', 'category',
       'role', 'age_at_release']]

#### Export

In [123]:
# Write the DataFrame you created to a csv
df_final.to_csv('/Users/macbook/Google Drive/0. Ofilispeaks Business (Mac and Cloud)/9. Data Science/0. Python/General Assembly Training/Project 6/data/for_EDA.csv', index=False)
print('CSV is exported!')

CSV is exported!
