## Eryk and Jagan's Dataset Assembly 

####  Notebook 1 of 2 for Module 1 Project

* Student names:  **Eryk Wdowiak** and **Jagandeep Singh**
* Student pace:  full-time
* Scheduled project review date:  10 July
* Instructor name:  Fangfang Lee
* Blog post URL:  https://www.wdowiak.me/gotta-blog?answer=Yes

This notebook assembles a dataset of movies.  After loading the individual datafiles,
the information is stored in dictionaries which are then used to join all of the 
information into one dataframe, where each observation is a unique movie.

A second notebook analyzes the merged dataset.

###  load Python packages and list datafiles

In [1]:
##  import python packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import pickle

In [2]:
##  get a list of our data files
!ls zippedData/

bom.movie_gross.csv.gz	      imdb.title.ratings.csv.gz
imdb.name.basics.csv.gz       rt.movie_info.tsv.gz
imdb.title.akas.csv.gz	      rt.reviews.tsv.gz
imdb.title.basics.csv.gz      tmdb.movies.csv.gz
imdb.title.crew.csv.gz	      tn.movie_budgets.csv.gz
imdb.title.principals.csv.gz  us-bea_gdp-deflator.csv.gz


###  collect the IMDB movie titles and identifiers into dictionaries

In [3]:
##  first load the movie titles and identifiers
imdb_title_basics = pd.read_csv('zippedData/imdb.title.basics.csv.gz',compression='gzip')
#print(imdb_title_basics.shape)

In [4]:
imdb_title_basics.head()

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"


In [5]:
#imdb_title_basics.loc[(imdb_title_basics['tconst']=='tt9916170')|(imdb_title_basics['tconst']=='tt9915436')|(imdb_title_basics['tconst']=='tt5820812')]
#imdb_title_basics.loc[(imdb_title_basics['primary_title']=='Titanic')]

In [6]:
##  create two dictionaries -- one of titles and one of IDs
title_dict = {}
ttlid_dict = {}

##  loop through to create the keys
for i in range(imdb_title_basics.shape[0]):
    ##  get the values
    tconst = imdb_title_basics['tconst'].loc[i]
    primary_title = imdb_title_basics['primary_title'].loc[i]
    
    ##  set up the dictionaries
    title_dict[primary_title] = { 'primary_title' : primary_title }    
    ttlid_dict[tconst] = primary_title 
    
##  now build the title dictionary
for i in range(imdb_title_basics.shape[0]):
    ##  get the values
    tconst        = imdb_title_basics['tconst'].loc[i]
    primary_title = imdb_title_basics['primary_title'].loc[i]
    start_year    = imdb_title_basics['start_year'].loc[i]
    genres        = imdb_title_basics['genres'].loc[i]

    ##  append tconst to title_dict
    try:
        title_dict[primary_title]['tconst'].append(tconst)
    except:
        title_dict[primary_title]['tconst'] = [tconst]

    ##  append start_year to title_dict
    try:
        title_dict[primary_title]['start_year'].append(start_year)
    except:
        title_dict[primary_title]['start_year'] = [start_year]

    ##  append genres to title_dict
    try:
        title_dict[primary_title]['genres'].append(genres)
    except:
        title_dict[primary_title]['genres'] = [genres]

In [7]:
# display(title_dict['Titanic'])
# display(ttlid_dict['tt2495766'])
# display(ttlid_dict['tt8852130'])

In [8]:
# display(title_dict['The Rehearsal'])
# display(ttlid_dict['tt5820812'])
# display(ttlid_dict['tt9916170'])

###  collect movie cost and revenue data into dictionaries

In [9]:
##  next load the movie cost and revenue data
tn_movie_budgets = pd.read_csv('zippedData/tn.movie_budgets.csv.gz',compression='gzip')
#print(tn_movie_budgets.shape)
tn_movie_budgets.head()
#list(tn_movie_budgets.columns)

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"


In [10]:
##  define some conversion functions

##  string of dollars to float
def dollars_to_float(inlist):
    return list(map(lambda x: float(x.replace("$","").replace(",","")), inlist))

##  date string to integer year
def date_to_yrint(datestr):
    return list(map(lambda x: int(x.split()[-1]), datestr))

In [11]:
##  convert strings to float
tn_movie_budgets['production_budget'] = dollars_to_float(tn_movie_budgets['production_budget'])
tn_movie_budgets['domestic_gross'] = dollars_to_float(tn_movie_budgets['domestic_gross'])
tn_movie_budgets['worldwide_gross'] = dollars_to_float(tn_movie_budgets['worldwide_gross'])

##  divide by one million
tn_movie_budgets['production_budget'] = tn_movie_budgets['production_budget'] / 1000000
tn_movie_budgets['domestic_gross'] = tn_movie_budgets['domestic_gross'] / 1000000
tn_movie_budgets['worldwide_gross'] = tn_movie_budgets['worldwide_gross'] / 1000000

##  calculate profit
tn_movie_budgets['profit'] = tn_movie_budgets['worldwide_gross'] - tn_movie_budgets['production_budget']

##  convert release year to integer
tn_movie_budgets['release_year'] = date_to_yrint(tn_movie_budgets['release_date'])

##  convert release date to date
tn_movie_budgets['release_date'] = tn_movie_budgets['release_date'].astype('datetime64[ns]') 
tn_movie_budgets['release_month'] = tn_movie_budgets['release_date'].dt.month

In [12]:
#tn_movie_budgets.head()

In [13]:
##  get GDP deflator, so that we can adjust revenue and cost for inflation
gdp_deflator_data = pd.read_csv('zippedData/us-bea_gdp-deflator.csv.gz',
                                compression='gzip',header=4,nrows=2)

gdp_deflator_data = pd.DataFrame(gdp_deflator_data.iloc[1].transpose()[2:])
gdp_deflator_data.rename(columns={0: "year", 1: "gdp_deflator"},inplace=True);

##  make dictionary
deflate_dict = {}
for yr in range(1929,2020):
    deflate_dict[str(yr)] = gdp_deflator_data.loc[str(yr)][0] / 100

##  for 2020, just use 2019 value
deflate_dict['2020'] = deflate_dict['2019']

##  for years up to 1928, just use 1929 value
for yr in range(1900,1929):
    deflate_dict[str(yr)] = deflate_dict['1929']

In [14]:
#deflate_dict['1975']

In [15]:
##  adjust for inflation
budget = []
domestic = []
worldwide = []
profit = []

for i in range(tn_movie_budgets.shape[0]):
    ##  get year and its deflator
    year = tn_movie_budgets['release_year'].iloc[i]
    deflator = deflate_dict[str(year)]

    ##  append to the lists
    budget.append(tn_movie_budgets['production_budget'].iloc[i] / deflator)
    domestic.append(tn_movie_budgets['domestic_gross'].iloc[i] / deflator)
    worldwide.append(tn_movie_budgets['worldwide_gross'].iloc[i] / deflator)
    profit.append(tn_movie_budgets['profit'].iloc[i] / deflator)

##  replace them in the dataframe
tn_movie_budgets['production_budget'] = budget
tn_movie_budgets['domestic_gross'] = domestic
tn_movie_budgets['worldwide_gross'] = worldwide
tn_movie_budgets['profit'] = profit

##  take logs
tn_movie_budgets['ln_budget'] = np.log(tn_movie_budgets['production_budget'])
tn_movie_budgets['ln_domestic'] = np.log(tn_movie_budgets['domestic_gross'])
tn_movie_budgets['ln_worldwide'] = np.log(tn_movie_budgets['worldwide_gross'])

del budget, domestic, worldwide, profit



In [16]:
##  print some information
#tn_movie_budgets.columns
#tn_movie_budgets.info()
tn_movie_budgets.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross,profit,release_year,release_month,ln_budget,ln_domestic,ln_worldwide
0,1,2009-12-18,Avatar,447.349585,800.500637,2922.345669,2474.996083,2009,12,6.10334,6.685237,7.980142
1,2,2011-05-20,Pirates of the Caribbean: On Stranger Tides,418.475713,245.687718,1065.720739,647.245026,2011,5,6.036619,5.504061,6.971407
2,3,2019-06-07,Dark Phoenix,311.540344,38.063421,133.305755,-178.23459,2019,6,5.741529,3.639254,4.892645
3,4,2015-05-01,Avengers: Age of Ultron,315.708051,438.329849,1339.814894,1024.106843,2015,5,5.754818,6.082972,7.200287
4,5,2017-12-15,Star Wars Ep. VIII: The Last Jedi,294.079448,575.339427,1221.516733,927.437285,2017,12,5.68385,6.35496,7.107849


In [17]:
# tn_movie_budgets.loc[(tn_movie_budgets['movie']=='King Kong')]

In [18]:
# tn_movie_budgets.loc[tn_movie_budgets.movie == 'Titanic']

In [19]:
##  create some histograms
#tn_movie_budgets.release_year.hist();
#tn_movie_budgets.profit.hist();

In [20]:
##  now create profit dictionary
profit_dict = {}

for i in range(tn_movie_budgets.shape[0]):
    ##  get the values
    movie_title   = tn_movie_budgets['movie'].loc[i]
    release_year  = tn_movie_budgets['release_year'].loc[i]
    release_month = tn_movie_budgets['release_month'].loc[i]
    release_date  = tn_movie_budgets['release_date'].loc[i]
    profit_dict[movie_title] = {
        'movie_title'   : movie_title,
        'release_year'  : release_year,
        'release_month' : release_month,
        'release_date'  : release_date
    }
    
for i in range(tn_movie_budgets.shape[0]):
    ##  get the values again
    movie_title  = tn_movie_budgets['movie'].loc[i]
    release_year = tn_movie_budgets['release_year'].loc[i]
    release_month = tn_movie_budgets['release_month'].loc[i]
    release_date = tn_movie_budgets['release_date'].loc[i]

    ##  let's assume that most recent movie is the movie in the IMDB
    if (release_year >= profit_dict[movie_title]['release_year']):
        ##  add them to the dictionary
        profit_dict[movie_title] = {
            'movie_title'   : movie_title,
            'release_year'  : release_year,
            'release_month' : release_month,
            'release_date'  : release_date,
            'production_budget' : tn_movie_budgets['production_budget'].loc[i],
            'domestic_gross' : tn_movie_budgets['domestic_gross'].loc[i],
            'worldwide_gross' : tn_movie_budgets['worldwide_gross'].loc[i],
            'profit' : tn_movie_budgets['profit'].loc[i],
            'ln_budget' : tn_movie_budgets['ln_budget'].loc[i],
            'ln_domestic' : tn_movie_budgets['ln_domestic'].loc[i],
            'ln_worldwide' : tn_movie_budgets['ln_worldwide'].loc[i]
    }

In [21]:
# profit_dict['Titanic']

###  collect the IMDB principals data into a dictionary

In [22]:
imdb_title_principals = pd.read_csv('zippedData/imdb.title.principals.csv.gz',compression='gzip')
# print(imdb_title_principals.shape)
# print('number of unique titles: ' + str(len(list(imdb_title_principals.tconst.unique()))))
# print('unique categories:')
# print(list(imdb_title_principals.category.unique()))
imdb_title_principals.head(10)
#list(imdb_title_principals.columns)

Unnamed: 0,tconst,ordering,nconst,category,job,characters
0,tt0111414,1,nm0246005,actor,,"[""The Man""]"
1,tt0111414,2,nm0398271,director,,
2,tt0111414,3,nm3739909,producer,producer,
3,tt0323808,10,nm0059247,editor,,
4,tt0323808,1,nm3579312,actress,,"[""Beth Boothby""]"
5,tt0323808,2,nm2694680,actor,,"[""Steve Thomson""]"
6,tt0323808,3,nm0574615,actor,,"[""Sir Lachlan Morrison""]"
7,tt0323808,4,nm0502652,actress,,"[""Lady Delia Morrison""]"
8,tt0323808,5,nm0362736,director,,
9,tt0323808,6,nm0811056,producer,producer,


In [23]:
##  let's create a dictionary of principals
principals_dict = {}

##  loop through to create keys and build the dictionary
for i in range(imdb_title_principals.shape[0]):
    ##  get the values
    tconst = imdb_title_principals['tconst'].loc[i]
    nconst = imdb_title_principals['nconst'].loc[i]
    category = imdb_title_principals['category'].loc[i]

    ##  create the keys
    try:
        principals_dict[tconst]['tconst'] = tconst
    except:
        principals_dict[tconst] = {'tconst' : tconst}
    
    ##  append nconst to the appropriate role
    if (category in ['producer','director','actor','actress']):
        try:
            principals_dict[tconst][category].append(nconst)
        except:
            principals_dict[tconst][category] = [nconst]

In [24]:
# principals_dict['tt0323808']

###  collect the IMDB ratings data into a dictionary

In [25]:
##  now let's include the ratings
imdb_title_ratings = pd.read_csv('zippedData/imdb.title.ratings.csv.gz',compression='gzip')
# print(imdb_title_ratings.shape)
# print('number of unique titles: ' + str(len(list(imdb_title_ratings.tconst.unique()))))
imdb_title_ratings.head()
#list(imdb_title_ratings.columns)

Unnamed: 0,tconst,averagerating,numvotes
0,tt10356526,8.3,31
1,tt10384606,8.9,559
2,tt1042974,6.4,20
3,tt1043726,4.2,50352
4,tt1060240,6.5,21


In [26]:
##  let's create a dictionary of ratings
ratings_dict = {}

##  loop through to build the dictionary
for i in range(imdb_title_ratings.shape[0]):
    ##  get the values
    tconst = imdb_title_ratings['tconst'].loc[i]
    rating = imdb_title_ratings['averagerating'].loc[i]
    nuvote = imdb_title_ratings['numvotes'].loc[i]

    ##  create the keys
    ratings_dict[tconst] = {
        'tconst' : tconst,
        'rating' : rating,
        'nuvote' : nuvote
    }

In [27]:
# ratings_dict['tt1043726']

###  now merge those dictionaries into a dataframe of movies

Movies at the intersection of `profit_dict` and `title_dict` 
will be the observations in our dataframe.

In [28]:
##  now find titles in "profit_dict" that match titles in "title_dict"
profit_keys = profit_dict.keys()
title_keys  = title_dict.keys()

##  titles that are at the intersection
profit_titles = list(set(profit_keys) & set(title_keys))

In [29]:
##  columns for merged dataframe
tconst = []
movie_title = []
genres = []
release_year = []
release_month = []
release_date = []
production_budget = []
domestic_gross = []
worldwide_gross = []
profit = []
ln_budget = []
ln_domestic = []
ln_worldwide = []

##  and the principals 
actresses = []
actors = []
directors = []
producers = []

##  and the ratings
ratings = []
nuvotes = []

##  for each movie, append to those lists
for movie in profit_titles:
    tconst.append(title_dict[movie]['tconst'])
    movie_title.append(profit_dict[movie]['movie_title'])
    ##  we'll pick up genres below
    release_year.append(profit_dict[movie]['release_year'])
    release_month.append(profit_dict[movie]['release_month'])
    release_date.append(profit_dict[movie]['release_date'])
    production_budget.append(profit_dict[movie]['production_budget'])
    domestic_gross.append(profit_dict[movie]['domestic_gross'])
    worldwide_gross.append(profit_dict[movie]['worldwide_gross'])
    profit.append(profit_dict[movie]['profit'])
    ln_budget.append(profit_dict[movie]['ln_budget'])
    ln_domestic.append(profit_dict[movie]['ln_domestic'])
    ln_worldwide.append(profit_dict[movie]['ln_worldwide'])
    
    ##  movie principals 
    mv_actresses = []
    mv_actors = []
    mv_directors = []
    mv_producers = []
    
    ##  movie genres
    mv_genres = []
    
    ##  movie ratings
    mv_ratings = []
    mv_nuvotes = []

    for ttlid in title_dict[movie]['tconst']:
        try:
            mv_actresses.append(principals_dict[ttlid]['actress'])
        except:
            blah = 'blah'
    
        try:
            mv_actors.append(principals_dict[ttlid]['actor'])
        except:
            blah = 'blah'
        
        try:
            mv_directors.append(principals_dict[ttlid]['director'])
        except:
            blah = 'blah'
            
        try:
            mv_producers.append(principals_dict[ttlid]['producer'])
        except:
            blah = 'blah'
            
        try:
            mv_ratings.append(ratings_dict[ttlid]['rating'])
        except:
            blah = 'blah'
            
        try:
            mv_nuvotes.append(ratings_dict[ttlid]['nuvote'])
        except:
            blah = 'blah'
            
        try:
            comma = ','
            genre_str = comma.join(title_dict[ttlid_dict[ttlid]]['genres'])           
            mv_genres.append(genre_str)
        except:
            blah = 'blah'


    ##  append principals
    actresses.append(mv_actresses)
    actors.append(mv_actors)
    directors.append(mv_directors)
    producers.append(mv_producers)

    ##  make genres one long string, then split it and make list unique
    comma = ','
    mv_genres = comma.join(mv_genres)
    mv_genres = mv_genres.split(comma)
    genres.append(list(set(mv_genres)))
    
    ##  just return rating with most votes
    try:
        which_max = mv_nuvotes.index(max(mv_nuvotes))
        ratings.append(mv_ratings[which_max])
        nuvotes.append(mv_nuvotes[which_max])
    except:
        ratings.append(None)
        nuvotes.append(None)

##  now create a merged dataframe
movie_df = pd.DataFrame({'tconst' : tconst,
                         'movie_title' : movie_title, 
                         'genres' : genres,
                         'release_year' : release_year,
                         'release_month' : release_month,
                         'release_date' : release_date,
                         'production_budget' : production_budget, 
                         'domestic_gross' : domestic_gross,
                         'worldwide_gross' : worldwide_gross,
                         'profit' : profit,
                         'ln_budget' : ln_budget, 
                         'ln_domestic' : ln_domestic,
                         'ln_worldwide' : ln_worldwide,                        
                         'actresses' : actresses, 
                         'actors' : actors, 
                         'directors' : directors, 
                         'producers' : producers, 
                         'ratings' : ratings, 
                         'nuvotes' : nuvotes
                        })

del tconst, movie_title, genres, release_year, release_month, release_date
del production_budget, domestic_gross, worldwide_gross, profit
del ln_budget, ln_domestic, ln_worldwide
del actresses, actors, directors, producers, ratings, nuvotes

In [30]:
#movie_df.info()
# movie_df.shape

In [31]:
# movie_df.head()

###  collect personal names from IMDB and create dictionary

In [32]:
##  and let's get their real names!
imdb_name_basics = pd.read_csv('zippedData/imdb.name.basics.csv.gz',compression='gzip')
# print(imdb_name_basics.shape)
# print('number of unique nconst: ' + str(len(list(imdb_name_basics.nconst.unique()))))
imdb_name_basics.head()
#list(imdb_name_basics.columns)

Unnamed: 0,nconst,primary_name,birth_year,death_year,primary_profession,known_for_titles
0,nm0061671,Mary Ellen Bauder,,,"miscellaneous,production_manager,producer","tt0837562,tt2398241,tt0844471,tt0118553"
1,nm0061865,Joseph Bauer,,,"composer,music_department,sound_department","tt0896534,tt6791238,tt0287072,tt1682940"
2,nm0062070,Bruce Baum,,,"miscellaneous,actor,writer","tt1470654,tt0363631,tt0104030,tt0102898"
3,nm0062195,Axel Baumann,,,"camera_department,cinematographer,art_department","tt0114371,tt2004304,tt1618448,tt1224387"
4,nm0062798,Pete Baxter,,,"production_designer,art_department,set_decorator","tt0452644,tt0452692,tt3458030,tt2178256"


In [33]:
##  let's create a dictionary of ratings
names_dict = {}

##  loop through to build the dictionary
for i in range(imdb_name_basics.shape[0]):
    ##  get the values
    nconst = imdb_name_basics['nconst'].loc[i]
    prname = imdb_name_basics['primary_name'].loc[i]

    ##  create the keys
    names_dict[nconst] = prname

###  add the movie studios to the dataframe

In [34]:
bom_movie_gross = pd.read_csv('zippedData/bom.movie_gross.csv.gz',compression='gzip')
bom_movie_gross.head()
#list(bom_movie_gross.columns)

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010


In [35]:
##  to put profit and studios together, let's merge bom_movie_gross and tn_movie_budgets
merged = pd.merge(movie_df, bom_movie_gross[['title', 'studio']], left_on = 'movie_title', right_on = 'title', how='left')
merged.drop(columns=['title'], inplace = True)

In [36]:
merged.head()

Unnamed: 0,tconst,movie_title,genres,release_year,release_month,release_date,production_budget,domestic_gross,worldwide_gross,profit,ln_budget,ln_domestic,ln_worldwide,actresses,actors,directors,producers,ratings,nuvotes,studio
0,[tt1709143],The Last Days on Mars,"[Sci-Fi, Adventure, Horror]",2013,12,2013-12-06,10.417179,0.023669,0.256856,-10.160322,2.343456,-3.743605,-1.359239,"[[nm0304801, nm0931404]]","[[nm0000630, nm0000480]]",[[nm1099711]],"[[nm1095938, nm0474138]]",5.5,33137.0,
1,[tt6175394],Posledniy bogatyr,"[Comedy, Action, Adventure]",2017,12,2017-12-31,7.885411,0.0,28.480427,20.595016,2.065014,-inf,3.349217,"[[nm4754338, nm2611557]]","[[nm5683792, nm0492249]]",[[nm2979103]],[],6.6,1853.0,
2,"[tt2100675, tt2118761, tt2407082, tt3671042, t...",Turbulence,"[Documentary, Thriller, Drama, Comedy, Romance...",1997,1,1997-01-10,73.880046,15.49167,15.49167,-58.388375,4.302443,2.740302,2.740302,"[[nm3377137], [nm4840301], [nm0000539, nm06955...","[[nm4753587, nm4754728, nm3860142, nm4754752],...","[[nm1282420], [nm4754642, nm4753670], [nm52876...","[[nm1538517], [nm5285813], [nm4815964]]",5.3,249.0,
3,[tt1742336],Your Sister's Sister,"[Comedy, Drama]",2012,6,2012-06-15,0.12,1.597486,3.090593,2.970593,-2.120264,0.468431,1.128363,"[[nm1289434, nm1679669]]","[[nm0243233, nm1898126]]",[[nm1119645]],[[nm2693744]],6.7,24780.0,IFC
4,[tt4882548],Burn Your Maps,[Adventure],2019,6,2019-06-21,7.120922,0.0,0.0,-7.120922,1.963037,-inf,-inf,"[[nm0267812, nm0000515]]","[[nm5016878, nm0190744]]",[[nm1471001]],"[[nm0813309, nm2078509, nm0004799, nm0456614]]",7.9,240.0,


In [37]:
##  Getting director names from imdb_name_basics and putting them in the columns
##  This function searches imdb_name_basics and finds the primary name for each identifier

def get_names(name_ids):
    names = []
    if isinstance(name_ids, list):
        for items in name_ids:
            for nconst in items:
                try:
                    names.append(names_dict[nconst])
                except:
                    blah = 'blah'
    return names

merged['directors'] = merged['directors'].apply(get_names)
merged['producers'] = merged['producers'].apply(get_names)
merged['actors']    = merged['actors'].apply(get_names)
merged['actresses'] = merged['actresses'].apply(get_names)

In [38]:
##  correcting the data for Avatar
merged.loc[(merged.movie_title == 'Avatar'),'directors'] = ['James Cameron']
merged.loc[(merged.movie_title == 'Avatar'),'actresses'] = ['Zoe Saldana']
merged.loc[(merged.movie_title == 'Avatar'),'actors'] = ['Sam Worthington']
merged.loc[(merged.movie_title == 'Avatar'),'producers'] = ['Jon Landau']
merged.loc[(merged.movie_title == 'Avatar'),'genres'] = ['Action']

##  correcting the data for Titanic
merged.loc[(merged.movie_title == 'Titanic'),'directors'] = ['James Cameron']
merged.loc[(merged.movie_title == 'Titanic'),'actresses'] = ['Kate Winslet']
merged.loc[(merged.movie_title == 'Titanic'),'actors'] = ['Leonardo DiCaprio']
merged.loc[(merged.movie_title == 'Titanic'),'producers'] = ['James Cameron']

###  save the dataframe for later use

In [39]:
pickle.dump(merged,open("movie-df_2020-07-09b.p","wb"))

##  Continue on!

This concludes our assembly of the movie dataset.  In our next notebook, we conduct
a statistical analysis of the data.