# Data Imputation and Feature Engineering

This notebook will feature data imputation for some key features and also some feature enginnering to create new features.

## Index

- [Imports](#Imports)
- [Utils](#Utils)
- [Data Imputation](#Data-Imputation)
    - [Genres](#Genres)
    - [Start Year](#Start-Year)
    - [Runtime](#Runtime)
- [Feature Engineering](#Feature-Engineering)
    - [Professional Quality](#Professional-Quality)
    - [Number of regions](#Number-of-regions)
- [Final Imputation and Clean-Up](#Final-Imputation-and-Clean-Up)
- [One hot encoding the genres](#One-hot-encoding-the-genres)

## Imports

In [2]:
from tmdbv3api import Movie

In [3]:
from tmdbv3api import TMDb

In [None]:
API_KEY = '5b9105a64cdd16b8cc9259f36cae74d0'

In [4]:
import requests

In [5]:
tmdb = TMDb()
tmdb.api_key = API_KEY

In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [7]:
from collections import Counter

In [8]:
import os 
import sys

In [9]:
from tqdm import tqdm

In [10]:
import json

In [11]:
import tqdm.notebook as tq

In [12]:
from pandas import Panel

  """Entry point for launching an IPython kernel.


In [13]:
tqdm.pandas()

  from pandas import Panel


In [14]:
from pandarallel import pandarallel

In [15]:
pandarallel.initialize(progress_bar=True)

INFO: Pandarallel will run on 4 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.


## Utils

In [101]:
def split_names(x):
    
    """
    A df function that will split a string of names for further processing
    """
    
    return [n for n in x.split(',') if (not pd.isna(x) and not x == '')]

## Data Imputation

The first step is to impute missing data from our datasets.
Feature to be imputed are:
- Genres
- runtime
- Start Year


### Genres

**Strategy**

Genres provide a strong indicator for the IMDb rating of a film. We can see that certain genres like horror get lower overall scores while genres such as drama or war films get higher scores on average.

To impute the genre we follow the following steps:
1. Find all titles that do not have a genre associated with them.
2. Use the Crew and Principals tables to get a set of people who have worked on those titles
3. Get a list of other titles that they are known for 
4. Get the most common genres of those titles and impute them for our title

The working hypothesis for this process is that cast and crew generally tend to work within similar genres and this provides a good proxy for what genre category the film can be a part of.

In [11]:
title_rating = pd.read_csv('processed/title_rating.csv')

In [12]:
title_rating.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes
0,tt0000502,movie,Bohemios,Bohemios,0,1970-01-01 00:00:00.000001905,,100.0,,4.5,14
1,tt0000574,movie,The Story of the Kelly Gang,The Story of the Kelly Gang,0,1970-01-01 00:00:00.000001906,,70.0,"Action,Adventure,Biography",6.0,754
2,tt0000591,movie,The Prodigal Son,L'enfant prodigue,0,1970-01-01 00:00:00.000001907,,90.0,Drama,4.6,17
3,tt0000615,movie,Robbery Under Arms,Robbery Under Arms,0,1970-01-01 00:00:00.000001907,,,Drama,4.5,23
4,tt0000630,movie,Hamlet,Amleto,0,1970-01-01 00:00:00.000001908,,,Drama,3.8,24


In [13]:
title_rating.genres.fillna('', inplace = True)

We have 11,666 values that need to be imputed

In [15]:
crew = pd.read_csv('processed/title_rating_crew.csv')
principal = pd.read_csv('processed/title_rating_principal.csv')

In [16]:
crew['directors'].replace({'\\N': ''}, inplace = True)
crew['writers'].replace({'\\N': ''}, inplace = True)
principal['nconst'].replace({'\\N': ''}, inplace = True)

In [17]:
names = pd.read_csv('processed/name_basics.csv')

In [20]:
#creating some useful data stuctures for further processing
title_genre = title_rating[['tconst', 'genres']]

In [21]:
title_genre['genres'] = title_genre['genres'].apply(lambda x: split_names(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [22]:
## Creating a mapping between title and genres for the imputation function described later
title_genre_dict = title_genre.set_index('tconst').to_dict()['genres']

In [23]:
names['knownForTitles'].fillna('', inplace = True)

In [24]:
names = names[['nconst', 'knownForTitles']]

In [25]:
names['knownForTitles'] = names['knownForTitles'].apply(lambda x: split_names(x))


In [28]:
names

Unnamed: 0,nconst,knownForTitles
0,nm0000001,"[tt0050419, tt0031983, tt0072308, tt0053137]"
1,nm0000002,"[tt0037382, tt0038355, tt0071877, tt0117057]"
2,nm0000003,"[tt0057345, tt0056404, tt0054452, tt0049189]"
3,nm0000004,"[tt0080455, tt0078723, tt0072562, tt0077975]"
4,nm0000005,"[tt0050986, tt0083922, tt0060827, tt0069467]"
...,...,...
964582,nm9993616,[tt4844148]
964583,nm9993650,[tt8739208]
964584,nm9993690,[tt7888884]
964585,nm9993691,[tt7888884]


In [27]:
names['nconst'].dropna(inplace = True) 

In [29]:
#Creating a mapping between names and known titles for the imputation function described later
names_title_dict = names.set_index('nconst').to_dict()['knownForTitles']

In [66]:
def impute_genre(title, names_title_dict, 
                 title_genre_dict, 
                 principal,
                 crew):
    """
    Function to impute the the genre of a movie. Designed to worked with df.apply
    
    params
    
    title: id of the title to be imputed
    names_title_dict: dict which maps the name of a professional to their known works
    title_genre_dict: maps titles to their respective genres
    principal: The principal dataframe
    crew: the Crew dataframe
    
    returns
    A string which is a comma separated list of genres
    """
    
    people = set()
    
    #Getting al the people that have worked on that title
    people.update(principal[principal['tconst'] == title]['nconst'].values)
    people.update(split_names(crew[crew['tconst'] == title]['directors'].values[0]))
    people.update(split_names(crew[crew['tconst'] == title]['writers'].values[0]))
    
    related = set()
#     print(people)
    #Getting all the tiles that the crew have worked on previously
    for p in people:
#         print(names_title_dict[p])
        try:
            related.update(names_title_dict[p])
        except:
            pass
    # getting the counts of the genres 
    genres = []
    for r in related:
        try:
            genres = genres+title_genre_dict[r]
        except:
            pass
        
    c = Counter(genres)
    comm = c.most_common(2)
#     print(comm)
    impute = []
    #Returning the 2 most common genres that the crew has worked on
    for item in comm:
        impute.append(item[0])
        
    return impute
    

In [72]:
#Imputing the values into the title_genre map
for key in tqdm(title_genre_dict.keys(), position=0, leave=True):
    if title_genre_dict[key] == []:
        title_genre_dict[key] = impute_genre(key, names_title_dict, 
                                             title_genre_dict, 
                                             principal,
                                             crew)
        

100%|██████████| 323834/323834 [22:46<00:00, 382.43it/s] 

[A[A                                                  
100%|██████████| 323834/323834 [22:46<00:00, 237.03it/s]






In [81]:
#Saving for later use
with open('processed/title_genre_dict.json', 'w') as f:
    json.dump(title_genre_dict, f)

In [84]:
#Converting the lists into string compatip\ble with the original dataset
for k in tqdm(list(title_genre_dict.keys()),  position=0, leave=True):
    title_genre_dict[k] = ','.join(title_genre_dict[k])

100%|██████████| 323834/323834 [00:00<00:00, 1790366.87it/s]






In [86]:
t_g_df = pd.DataFrame.from_dict(title_genre_dict, orient='index')

In [88]:
title_genre_new = pd.merge(title_genre, t_g_df, left_on='tconst', right_on=t_g_df)

In [89]:
title_genre_new

Unnamed: 0,tconst,genres,0
0,tt0000502,[],"Comedy,Musical"
1,tt0000574,"[Action, Adventure, Biography]","Action,Adventure,Biography"
2,tt0000591,[Drama],Drama
3,tt0000615,[Drama],Drama
4,tt0000630,[Drama],Drama
...,...,...,...
323829,tt9916362,"[Drama, History]","Drama,History"
323830,tt9916428,"[Adventure, History, War]","Adventure,History,War"
323831,tt9916460,[Comedy],Comedy
323832,tt9916538,[Drama],Drama


In [99]:
#Merging our imputed data with our original dataset
title_rating_new = pd.merge(title_rating, title_genre_new, left_on='tconst', right_on = 'tconst')

In [106]:
title_rating_new.drop(columns=['genres_y', 'genres_x'], inplace=True)

In [109]:
title_rating_new[title_rating_new[0] == '']

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,averageRating,numVotes,0
13,tt0001010,movie,Protección de un convoy de víveres en el puent...,Protección de un convoy de víveres en el puent...,0,1970-01-01 00:00:00.000001909,,,4.6,14,
15,tt0001038,movie,Sherlock Holmes VI,Sherlock Holmes VI,0,1970-01-01 00:00:00.000001910,,,3.8,21,
19,tt0001113,movie,Amor gitano,Amor gitano,0,1970-01-01 00:00:00.000001910,,,5.1,16,
277,tt0005040,movie,Butter,Butter,0,1970-01-01 00:00:00.000001916,,,6.7,11,
366,tt0005869,movie,Pasionaria,Pasionaria,0,1970-01-01 00:00:00.000001915,,,4.4,11,
...,...,...,...,...,...,...,...,...,...,...,...
318737,tt9013770,movie,The Witches of Gambaga,The Witches of Gambaga,0,2011-01-01,,55.0,7.0,9,
318887,tt9031770,movie,Ogar: Will of Steel,Ogar: Will of Steel,0,2017-01-01,,82.0,5.2,5,
319534,tt9114062,tvMovie,Family Classics: Scrooge (1951) II,Family Classics: Scrooge (1951) II,0,2018-01-01,,,7.4,11,
322768,tt9723258,movie,Little Wound's Warriors,Little Wound's Warriors,0,2017-01-01,,57.0,6.6,12,


We can see that there are 540 titles which do not have crew associated with them. For these titles we impute the most common genres as an intermediate guess

In [113]:
title_rating_new.rename(columns={0:'genres'}, inplace = True)

In [115]:
#saving intermediate results
title_rating_new.to_csv('processed/title_genre_new.csv', index = False)

In [16]:
title_rating_new = pd.read_csv('processed/title_genre_new.csv')

### Start Year

**Strategy**

The start year points to the release date for a movie.
We can use the TMDB API to impute these values.



In [17]:
def get_release_date(name):
    
    """
    Makes an api call to the TMDB api service to return release dates
    """
#     print(name)
    try:
        #Search for a movie by name
        movie = Movie()
        #Get the release date
        res = movie.search(name)[0]['release_date']
        print(pd.to_datetime(res))
        return pd.to_datetime(res)
    except:
        #In case the api is not able to find the required movie, we return a value of 1960
        #Most movies that do not have a release date are usually older
        return pd.to_datetime('1960')
    

In [18]:
title_rating_new['startYear'] = title_rating_new.progress_apply(lambda x: get_release_date(x['primaryTitle']) if pd.isna(x['startYear']) else x['startYear'], axis = 1)

 15%|█▍        | 47781/323834 [00:07<17:03, 269.71it/s] 

1976-05-31 00:00:00


 17%|█▋        | 56231/323834 [00:07<08:14, 540.91it/s]

1987-06-01 00:00:00
2012-10-28 00:00:00


 42%|████▏     | 135216/323834 [00:09<00:05, 31766.98it/s]

2004-05-13 00:00:00


 47%|████▋     | 151762/323834 [00:10<00:05, 32260.99it/s]

1983-01-01 00:00:00


 49%|████▉     | 159981/323834 [00:10<00:05, 29885.91it/s]

2014-08-08 00:00:00
2005-11-01 00:00:00


 52%|█████▏    | 167724/323834 [00:10<00:06, 23656.37it/s]

1998-06-05 00:00:00


 54%|█████▍    | 175781/323834 [00:11<00:07, 19494.30it/s]

2020-11-20 00:00:00
2003-03-01 00:00:00


 59%|█████▊    | 190116/323834 [00:11<00:05, 22457.12it/s]

2020-01-17 00:00:00
2019-07-27 00:00:00


 60%|█████▉    | 192971/323834 [00:11<00:07, 17425.94it/s]

2020-02-13 00:00:00


 61%|██████▏   | 198755/323834 [00:12<00:06, 20066.79it/s]

2020-11-06 00:00:00


 65%|██████▍   | 209452/323834 [00:12<00:04, 24374.16it/s]

2020-12-03 00:00:00


 67%|██████▋   | 217126/323834 [00:12<00:03, 26892.57it/s]

2021-08-15 00:00:00


 70%|███████   | 226916/323834 [00:13<00:04, 23850.48it/s]

2001-05-12 00:00:00


 77%|███████▋  | 249822/323834 [00:14<00:02, 24723.46it/s]

2012-11-26 00:00:00


 80%|███████▉  | 257526/323834 [00:14<00:02, 25936.77it/s]

1949-12-29 00:00:00


 91%|█████████ | 293486/323834 [00:15<00:00, 32583.21it/s]

2020-10-12 00:00:00


 93%|█████████▎| 301505/323834 [00:15<00:00, 23307.71it/s]

2021-01-15 00:00:00
2017-07-17 00:00:00


100%|█████████▉| 322361/323834 [00:16<00:00, 16231.20it/s]

2016-04-14 00:00:00


100%|██████████| 323834/323834 [00:17<00:00, 18817.21it/s]

2013-05-21 00:00:00





In [19]:
title_rating_new.isnull().sum()

tconst                 0
titleType              0
primaryTitle           0
originalTitle          0
isAdult                0
startYear              0
endYear           323834
runtimeMinutes     35737
averageRating          0
numVotes               0
genres               540
dtype: int64

In [20]:
title_rating_new.drop(columns=['endYear'], inplace = True)

### Runtime
**Strategy**

We can use the TMDB API to get the runtimes as well

We search fo rthe moie by title and make a second call to another API which returns the runtime. If we are unable to find the movie, we return the mean runtime of 93 mins.

In [24]:
no_runtime = title_rating_new[title_rating_new['runtimeMinutes'].isna()]['primaryTitle'].values

In [22]:
#Ignoring all nulls, this returns the mean runtime
title_rating_new['runtimeMinutes'].mean()

92.72670663005863

In [26]:
def get_runtime(name):
    """
    Makes an api call to the TMDB api service to return the runtime
    params:
        name: The primaryTile of the movie
    returns:
        the Movie runtime
    """
    try:
        movie = Movie()
        res = movie.search(name)[0]
        movie_id = res['id']
        
        url = f"https://api.themoviedb.org/3/movie/{movie_id}?api_key={API_KEY}"
        
        res = requests.get(url).json()
        return res['runtime']
    except:
        #returing mean runtime ignoring nulls
        return 93
        

In [27]:
#Adding to a dictionary which will be joined to the dataframe later
runtime_dict = {}
for item in tqdm(no_runtime, position=0, leave=True):
    runtime_dict[item] = get_runtime(item)

100%|██████████| 35737/35737 [2:22:04<00:00,  4.19it/s]   


In [28]:
with open('runtime_impute_dict.json', 'w') as f:
    json.dump(runtime_dict, f)


In [35]:
#Add imputed values back to tour dataset
title_rating_new['runtimeMinutes'] = title_rating_new.progress_apply(lambda x: runtime_dict[x['primaryTitle']] if pd.isna(x['runtimeMinutes']) else x['runtimeMinutes'], axis = 1)




  0%|          | 0/323834 [00:00<?, ?it/s][A[A[A


  0%|          | 82/323834 [00:00<06:34, 819.95it/s][A[A[A


  1%|▏         | 4406/323834 [00:00<04:34, 1161.91it/s][A[A[A


  2%|▏         | 7482/323834 [00:00<03:13, 1633.43it/s][A[A[A


  3%|▎         | 10032/323834 [00:00<02:18, 2271.12it/s][A[A[A


  4%|▍         | 13560/323834 [00:00<01:38, 3157.35it/s][A[A[A


  6%|▌         | 17830/323834 [00:00<01:09, 4371.95it/s][A[A[A


  7%|▋         | 22091/323834 [00:00<00:50, 5982.55it/s][A[A[A


  8%|▊         | 25914/323834 [00:00<00:37, 8009.28it/s][A[A[A


  9%|▉         | 30127/323834 [00:00<00:27, 10579.79it/s][A[A[A


 11%|█         | 34083/323834 [00:01<00:21, 13559.51it/s][A[A[A


 12%|█▏        | 37951/323834 [00:01<00:16, 16840.08it/s][A[A[A


 13%|█▎        | 41936/323834 [00:01<00:13, 20368.21it/s][A[A[A


 14%|█▍        | 45805/323834 [00:01<00:11, 23606.10it/s][A[A[A


 15%|█▌        | 49682/323834 [00:01<00:10, 26743.39it/s][A

In [36]:
title_rating_new.isnull().sum()

tconst               0
titleType            0
primaryTitle         0
originalTitle        0
isAdult              0
startYear            0
runtimeMinutes    1730
averageRating        0
numVotes             0
genres             540
dtype: int64

We fill up the missed values with the mean runtime of 93 mins

In [37]:
title_rating_new['runtimeMinutes'].fillna(93, inplace = True)

In [38]:
title_rating_new.to_csv('processed/title_rating_new.csv', index = False)

## Feature Engineering

### Professional Quality

A strong intuitive predictor of the quality of a movie comes from the cast and the crew of that movie. A good set of features will encode some notion of the people working on the film into our dataset.

We create the following features to achieve that result
1. cast_mean: The mean rating of all the titles that a cast member is known for
2. cast_std:  The standard deviation of the ratings of all the titles that a cast member is known for
3. cast_max: The max rating of all the titles that cast member is known for. We choose max here to account for the psychological effect of a rater. If a member of the cast or the crew has been in a highly rated film, there may be a higher chance that a user will be biased towards a higher rating
4. cast_exp: The number of titles that a cast member has worked on. Higher experience may point to better performances and better ratings
1. crew_mean: The mean rating of all the titles a crew member is known for
2. crew_std: The standard deviation of the ratings of all the titles that a crew member is known for
3. crew_max: The max rating of all the titles that crew member is known for. We choose max here to account for the psychological effect of a rater. If a member of the cast or the crew has been in a highly rated film, there may be a higher chance that a user will be biased towards a higher rating. This effect may be less pronounced for the crew
4. crew_exp: The number of titles that a crew member has worked on. Higher experience may point to better performances and better ratings

A distinction was made between cast and crew

Cast includes actors, actresses and people who have appeared as themselves in titles.

The crew includes all other roles like directors, writers, composers etc.

In [40]:
title_rating_principal = pd.read_csv('processed/title_rating_principal.csv')

In [43]:
#Getting the mean, std and max rating for each professional
prof_qual_df = title_rating_principal.groupby('nconst').agg({"averageRating": [np.mean, np.std, max]})

In [44]:
prof_qual_df.to_csv('processed/prof_qual_df.csv')

In [72]:
prof_qual_df

Unnamed: 0_level_0,averageRating,averageRating,averageRating
Unnamed: 0_level_1,mean,std,max
nconst,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
nm0000001,6.869697,0.834445,9.0
nm0000002,6.910000,1.032946,9.2
nm0000003,6.008475,1.271340,8.9
nm0000004,6.778571,0.772786,7.9
nm0000005,6.904464,0.761860,8.4
...,...,...,...
nm9993616,6.500000,,6.5
nm9993650,5.300000,,5.3
nm9993690,6.300000,,6.3
nm9993691,6.300000,,6.3


Get mean values for our features in case we have unknown crew that we need to impute data for.

In [100]:
prof_qual_df['averageRating']['mean'].mean()

6.259640942673017

In [101]:
prof_qual_df['averageRating']['std'].mean()


0.9623025831076207

In [102]:
prof_qual_df['averageRating']['max'].mean()

6.634021493829783

In [46]:
title_rating_principal.isnull().sum()

tconst                  0
titleType               0
primaryTitle            0
originalTitle           0
isAdult                 0
startYear             273
endYear           2885447
runtimeMinutes     287924
genres              90058
averageRating           0
numVotes                0
ordering                0
nconst                  0
category                0
job                     0
characters              0
dtype: int64

In [58]:
title_rating_principal['category'].unique()

array(['actor', 'director', 'writer', 'cinematographer', 'actress',
       'producer', 'composer', 'production_designer', 'self', 'editor',
       'archive_footage', 'archive_sound'], dtype=object)

In [59]:
role_cat_dict = {'cast': ['actor', 'actress', 'self'],
                 'crew': ['director', 'writer', 'cinematographer',
                          'producer', 'composer', 'production_designer',
                          'editor','archive_footage', 'archive_sound']}

In [54]:
roles_df = title_rating_principal.groupby('nconst').agg({"category": lambda x: list(x)})

In [56]:
roles_df['category'] = roles_df['category'].apply(lambda x: set(x)) 

In [57]:
roles_df

Unnamed: 0_level_0,category
nconst,Unnamed: 1_level_1
nm0000001,"{actor, archive_footage, self}"
nm0000002,"{archive_footage, actress, self}"
nm0000003,"{archive_footage, actress, self}"
nm0000004,"{actor, archive_footage}"
nm0000005,"{writer, producer, actor, director, archive_fo..."
...,...
nm9993616,{actor}
nm9993650,{actor}
nm9993690,{actor}
nm9993691,{actress}


In [60]:
roles_df.to_csv('processed/roles_df.csv')

In [116]:
#Create a data frame which maps each title to the cast and crew who worked on that title
cc_df = title_rating_principal.groupby('tconst').agg({'nconst':lambda x: list(x)})

In [117]:
cc_df

Unnamed: 0_level_0,nconst
tconst,Unnamed: 1_level_1
tt0000502,"[nm0215752, nm0252720, nm0063413, nm0657268, n..."
tt0000574,"[nm0675239, nm0846887, nm0846894, nm1431224, n..."
tt0000591,"[nm0906197, nm0332182, nm1323543, nm1759558, n..."
tt0000615,"[nm3071427, nm0581353, nm0888988, nm0240418, n..."
tt0000630,"[nm0624446, nm0143333, nm0000636, nm0209738]"
...,...
tt9916362,"[nm5813626, nm3766704, nm0107165, nm0266723, n..."
tt9916428,"[nm3611859, nm9445072, nm8594703, nm0422639, n..."
tt9916460,"[nm8796794, nm10538444, nm8691452, nm10538443,..."
tt9916538,"[nm4700236, nm8678236, nm1417182, nm10041459, ..."


In [118]:
def get_prof_metrics(x):
    
    
    """
    Generates the mean, std and max ratings for both cast and crew
    
    params:
        x: dataframe series that maps a title to the cast and crew who have worked on taht title 
    """
    cc = x['nconst']
    
    cast_mean = []
    cast_std = []
    cast_max = []
    crew_mean = []
    crew_std = []
    crew_max = []
    # we iterate through every person who has worked on the film and also what jobs they have historically done
    for person in cc:
#         print(person)
        #Get all the roles the person has played 
        roles = roles_df.loc[person]['category']
        for role in roles:
            #If they were a cast member their score is added to the cast statistics
            if role in role_cat_dict['cast']:
                #Use the prof quality table to get statistics for eact cast member
                cast_mean.append(prof_qual_df.loc[person]['averageRating']['mean'])
                cast_std.append(prof_qual_df.loc[person]['averageRating']['std'])
                cast_max.append(prof_qual_df.loc[person]['averageRating']['max'])
                
            elif role in role_cat_dict['crew']:
                #Use the prof quality table to get statistics for eact crew member
                #If they were a crew member their score is added to the crew statistics
                crew_mean.append(prof_qual_df.loc[person]['averageRating']['mean'])
                crew_std.append(prof_qual_df.loc[person]['averageRating']['std'])
                crew_max.append(prof_qual_df.loc[person]['averageRating']['max'])
    if cast_mean:            
        x['cast_mean'] = sum(cast_mean)/len(cast_mean)
    else:
        #If we are unable to find information about cast and crew, we impute with the global means
        x['cast_mean'] = 6.259640942673017
    if cast_std:
        x['cast_std'] = sum(cast_std)/len(cast_std)
    else:
        #If we are unable to find information about cast and crew, we impute with the global means
        x['cast_std'] = 0.9623025831076207
        
    if cast_max:
        x['cast_max'] = sum(cast_max)/len(cast_max)
    else:
        #If we are unable to find information about cast and crew, we impute with the global means
        x['cast_max'] = 6.634021493829783
    if crew_mean:
        
        x['crew_mean'] = sum(crew_mean)/len(crew_mean)
    else:
        #If we are unable to find information about cast and crew, we impute with the global means
        x['crew_mean'] = 6.259640942673017
        
    if crew_std:
        x['crew_std'] = sum(crew_std)/len(crew_std)
    else:
        #If we are unable to find information about cast and crew, we impute with the global means
        x['crew_std'] = 0.9623025831076207
        
    if crew_max:
        x['crew_max'] = sum(crew_max)/len(crew_max)
    else:
        #If we are unable to find information about cast and crew, we impute with the global means
        x['crew_max'] = 6.634021493829783
    
    return x
#     print(roles)

In [120]:
cc_df = cc_df.apply(lambda x: get_prof_metrics(x), axis = 1)










  0%|          | 468/323111 [00:19<1:38:37, 54.53it/s][A[A[A[A[A[A[A[A[A

In [129]:
# Creating a dataframe which maps a professional to the number of titles thay have worked on 
prof_exp_df = title_rating_principal.groupby('nconst').agg({"tconst":'count'})

In [133]:
prof_exp_df.mean()

tconst    3.002193
dtype: float64

In [147]:
def get_prof_experience(x):
    """
    Generates the exp metric for both cast and crew
    
    params:
        x: dataframe series that maps a title to the cast and crew who have worked on taht title 
    """
    cc = x['nconst']
    crew_exp = []
    cast_exp = []
    for person in cc:
#         print(person)
        roles = roles_df.loc[person]['category']
        for role in roles:
            #Use the prof experience table to get statistics for eact cast member
            if role in role_cat_dict['cast']:
                cast_exp.append(prof_exp_df.loc[person]['tconst'])
            elif role in role_cat_dict['crew']:
                crew_exp.append(prof_exp_df.loc[person]['tconst'])
    if cast_exp:            
        x['cast_exp'] = sum(cast_exp)/len(cast_exp)
    else:
        #if we are unable to find information about crew experience we return the global mean of 3 titles
        x['cast_exp'] = 3
    if crew_exp:
    
        x['crew_exp'] = sum(crew_exp)/len(crew_exp)
    else:
        x['crew_exp'] = 3
        
    
    return x

In [None]:
cc_df = cc_df.parallel_apply(lambda x: get_prof_experience(x), axis = 1)

In [51]:
cc_df.to_csv('processed/cc_df.csv')

Merging our created features with the master dataset

In [52]:
cc_df = pd.read_csv('processed/cc_df.csv')

In [57]:
title_rating_prof = pd.merge(title_rating_new, cc_df, on='tconst', how='left')

In [58]:
title_rating_prof

Unnamed: 0.1,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,runtimeMinutes,averageRating,numVotes,genres,Unnamed: 0,nconst,cast_mean,cast_std,cast_max,crew_mean,crew_std,crew_max,cast_exp,crew_exp
0,tt0000502,movie,Bohemios,Bohemios,0,1970-01-01 00:00:00.000001905,100.0,4.5,14,"Comedy,Musical",0.0,"['nm0215752', 'nm0252720', 'nm0063413', 'nm065...",4.500000,,4.500000,5.357143,1.016246,6.366667,1.000000,5.000000
1,tt0000574,movie,The Story of the Kelly Gang,The Story of the Kelly Gang,0,1970-01-01 00:00:00.000001906,70.0,6.0,754,"Action,Adventure,Biography",1.0,"['nm0675239', 'nm0846887', 'nm0846894', 'nm143...",6.000000,,6.000000,5.802778,,6.033333,1.000000,1.833333
2,tt0000591,movie,The Prodigal Son,L'enfant prodigue,0,1970-01-01 00:00:00.000001907,90.0,4.6,17,Drama,2.0,"['nm0906197', 'nm0332182', 'nm1323543', 'nm175...",5.178571,,5.725000,4.800000,0.282843,5.000000,4.000000,2.000000
3,tt0000615,movie,Robbery Under Arms,Robbery Under Arms,0,1970-01-01 00:00:00.000001907,96.0,4.5,23,Drama,3.0,"['nm3071427', 'nm0581353', 'nm0888988', 'nm024...",4.800000,,5.100000,4.766667,,5.133333,1.333333,2.333333
4,tt0000630,movie,Hamlet,Amleto,0,1970-01-01 00:00:00.000001908,130.0,3.8,24,Drama,4.0,"['nm0624446', 'nm0143333', 'nm0000636', 'nm020...",5.337500,1.242302,6.650000,5.801958,1.409807,7.575000,4.500000,163.500000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
323829,tt9916362,movie,Coven,Akelarre,0,2020-01-01,92.0,6.4,4447,"Drama,History",323106.0,"['nm5813626', 'nm3766704', 'nm0107165', 'nm026...",6.148864,,7.575000,6.353927,,7.028571,17.750000,6.428571
323830,tt9916428,movie,The Secret of China,Hong xing zhao yao Zhong guo,0,2019-01-01,93.0,3.8,14,"Adventure,History,War",323107.0,"['nm3611859', 'nm9445072', 'nm8594703', 'nm042...",4.542778,,5.083333,4.691667,1.100755,6.000000,4.500000,7.500000
323831,tt9916460,tvMovie,Pink Taxi,Pink Taxi,0,2019-01-01,10.0,9.3,17,Comedy,323108.0,"['nm8796794', 'nm10538444', 'nm8691452', 'nm10...",9.300000,,9.300000,9.300000,,9.300000,1.000000,1.000000
323832,tt9916538,movie,Kuambil Lagi Hatiku,Kuambil Lagi Hatiku,0,2019-01-01,123.0,8.3,6,Drama,323109.0,"['nm4700236', 'nm8678236', 'nm1417182', 'nm100...",7.267708,,8.300000,7.083557,,8.300000,9.250000,15.555556


In [59]:
title_rating_prof.to_csv('processed/title_rating_prof.csv', index = False)

### Number of regions

The number of features a movie was shown is a strong indicator of how popular a movie is. More popular movies generally tend to be better rated. Capturing the number of regions the title was featured can help the model predict its IMDB rating

In [60]:
title_rating_akas = pd.read_csv('processed/title_rating_aka.csv')

In [61]:
title_rating_akas

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000502,movie,Bohemios,Bohemios,0,1970-01-01 00:00:00.000001905,,100.0,,4.5,14,tt0000502,1,Bohemios,\N,\N,original,\N,1
1,tt0000502,movie,Bohemios,Bohemios,0,1970-01-01 00:00:00.000001905,,100.0,,4.5,14,tt0000502,2,Bohemios,ES,\N,imdbDisplay,\N,0
2,tt0000574,movie,The Story of the Kelly Gang,The Story of the Kelly Gang,0,1970-01-01 00:00:00.000001906,,70.0,"Action,Adventure,Biography",6.0,754,tt0000574,10,The Story of the Kelly Gang,AU,\N,imdbDisplay,\N,0
3,tt0000574,movie,The Story of the Kelly Gang,The Story of the Kelly Gang,0,1970-01-01 00:00:00.000001906,,70.0,"Action,Adventure,Biography",6.0,754,tt0000574,1,Kelly bandájának története,HU,\N,imdbDisplay,\N,0
4,tt0000574,movie,The Story of the Kelly Gang,The Story of the Kelly Gang,0,1970-01-01 00:00:00.000001906,,70.0,"Action,Adventure,Biography",6.0,754,tt0000574,2,Ned Kelly and His Gang,AU,\N,imdbDisplay,\N,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2105360,tt9916428,movie,The Secret of China,Hong xing zhao yao Zhong guo,0,2019-01-01,,,"Adventure,History,War",3.8,14,tt9916428,5,Hong xing zhao yao Zhong guo,CN,\N,\N,\N,0
2105361,tt9916460,tvMovie,Pink Taxi,Pink Taxi,0,2019-01-01,,,Comedy,9.3,17,tt9916460,1,Ροζ Ταξί,GR,\N,imdbDisplay,\N,0
2105362,tt9916460,tvMovie,Pink Taxi,Pink Taxi,0,2019-01-01,,,Comedy,9.3,17,tt9916460,2,Pink Taxi,\N,\N,original,\N,1
2105363,tt9916538,movie,Kuambil Lagi Hatiku,Kuambil Lagi Hatiku,0,2019-01-01,,123.0,Drama,8.3,6,tt9916538,1,Kuambil Lagi Hatiku,ID,\N,\N,\N,0


In [63]:
#Get the number of regions each title was featured in
regions_df = title_rating_akas.groupby('tconst').agg({'title':'count'})


In [66]:
title_rating_prof_reg = pd.merge(title_rating_prof, regions_df, left_on='tconst', right_on=regions_df.index, how = 'left')

In [68]:
title_rating_prof_reg.rename(columns={'title':'numRegions'}, inplace=True)

In [69]:
title_rating_prof_reg.to_csv('processed/title_rating_prof_reg.csv')

In [82]:
title_rating_prof_reg = pd.read_csv('processed/title_rating_prof_reg.csv')

## Final Imputation and Clean-Up

This section fills out any nulls missed in the previous steps

In [83]:
title_rating_prof_reg.isnull().sum()

Unnamed: 0             0
tconst                 0
titleType              0
primaryTitle           0
originalTitle          0
isAdult                0
startYear              0
runtimeMinutes         0
averageRating          0
numVotes               0
genres               540
Unnamed: 0.1         723
nconst               723
cast_mean            723
cast_std          150713
cast_max             723
crew_mean            723
crew_std          155979
crew_max             723
cast_exp             723
crew_exp             723
numRegions          1084
dtype: int64

In [98]:
try:
    title_rating_prof_reg.drop(columns = ['nconst'], inplace = True)
    title_rating_prof_reg.drop(columns = ['Unnamed: 0.1'], inplace = True)
    
except:
    pass
    

#### Imputing the remaining nulls

In [85]:
#Imputing mean performance with the over all-mean
title_rating_prof_reg.cast_mean.fillna(6.2, inplace=True)
title_rating_prof_reg.crew_mean.fillna(6.2, inplace=True)

In [86]:
#Imputing std with 0 (these people have worked only on one title)

title_rating_prof_reg.cast_std.fillna(0, inplace=True)
title_rating_prof_reg.crew_std.fillna(0, inplace=True)

In [87]:
#Imputing max with the mean of the max 

title_rating_prof_reg.cast_max.fillna(6.6, inplace=True)
title_rating_prof_reg.crew_max.fillna(6.6, inplace=True)

In [88]:
#Imputing experience with a value of 1 

title_rating_prof_reg.cast_exp.fillna(1, inplace=True)
title_rating_prof_reg.crew_exp.fillna(1, inplace=True)

In [89]:
#Imputing numregions with 1 (The movie has to be featured in atleast one region)

title_rating_prof_reg.numRegions.fillna(1, inplace=True)


In [99]:
title_rating_prof_reg.isnull().sum()

tconst              0
titleType           0
primaryTitle        0
originalTitle       0
isAdult             0
startYear           0
runtimeMinutes      0
averageRating       0
numVotes            0
genres            540
cast_mean           0
cast_std            0
cast_max            0
crew_mean           0
crew_std            0
crew_max            0
cast_exp            0
crew_exp            0
numRegions          0
dtype: int64

Imputing the missing genres

In [105]:
all_genres = []
title_rating_prof_reg['genres'].dropna().apply(lambda x: all_genres.extend(split_names(x)))

0         None
1         None
2         None
3         None
4         None
          ... 
323829    None
323830    None
323831    None
323832    None
323833    None
Name: genres, Length: 323294, dtype: object

In [107]:
c = Counter(all_genres)

In [142]:
c.items()

dict_items([('Comedy', 82117), ('Musical', 7334), ('Action', 33599), ('Adventure', 19393), ('Biography', 10721), ('Drama', 149327), ('Fantasy', 9494), ('Romance', 35598), ('Crime', 27514), ('Thriller', 26179), ('War', 6669), ('Family', 13427), ('History', 9558), ('Sci-Fi', 7555), ('Documentary', 54829), ('Western', 5339), ('Mystery', 11873), ('Horror', 20677), ('Music', 8533), ('Sport', 3902), ('Animation', 5925), ('Film-Noir', 764), ('News', 681), ('Adult', 4518), ('Reality-TV', 127), ('Talk-Show', 22), ('Short', 25), ('Game-Show', 7)])

In [143]:
#unique genres
genres = set()
for i in c.items():
    genres.add(i[0])

In [144]:
genres

{'Action',
 'Adult',
 'Adventure',
 'Animation',
 'Biography',
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Family',
 'Fantasy',
 'Film-Noir',
 'Game-Show',
 'History',
 'Horror',
 'Music',
 'Musical',
 'Mystery',
 'News',
 'Reality-TV',
 'Romance',
 'Sci-Fi',
 'Short',
 'Sport',
 'Talk-Show',
 'Thriller',
 'War',
 'Western'}

In [109]:
c.most_common(3)

[('Drama', 149327), ('Comedy', 82117), ('Documentary', 54829)]

Imputing the remaing genres with the most popular categories

In [111]:
title_rating_prof_reg.fillna('Drama,Comedy', inplace=True)

In [124]:
title_rating_prof_reg.to_csv('processed/title_rating_prof_reg.csv', index = False)

### One hot encoding the genres

In [134]:
title_rating_prof_reg = pd.read_csv('processed/title_rating_prof_reg.csv')

In [123]:
title_rating_prof_reg['genres'] = title_rating_prof_reg['genres'].apply(lambda x: split_names(x))

0                      [Comedy, Musical]
1         [Action, Adventure, Biography]
2                                [Drama]
3                                [Drama]
4                                [Drama]
                       ...              
323829                  [Drama, History]
323830         [Adventure, History, War]
323831                          [Comedy]
323832                           [Drama]
323833                  [Drama, Romance]
Name: genres, Length: 323834, dtype: object

In [128]:
from sklearn.preprocessing import MultiLabelBinarizer

In [150]:
mlb = MultiLabelBinarizer()
mlb.fit(list(title_rating_prof_reg['genres'].apply(lambda x: x.split(','))))
mlb.classes_

array(['Action', 'Adult', 'Adventure', 'Animation', 'Biography', 'Comedy',
       'Crime', 'Documentary', 'Drama', 'Family', 'Fantasy', 'Film-Noir',
       'Game-Show', 'History', 'Horror', 'Music', 'Musical', 'Mystery',
       'News', 'Reality-TV', 'Romance', 'Sci-Fi', 'Short', 'Sport',
       'Talk-Show', 'Thriller', 'War', 'Western'], dtype=object)

In [155]:
#Create a dataframe with each genre onehot encoded
onehot_genres_df = pd.DataFrame(mlb.transform(list(title_rating_prof_reg['genres'].apply(lambda x: x.split(',')))), columns=mlb.classes_)

In [157]:
title_rating_prof_reg_one = title_rating_prof_reg.join(onehot_genres_df)

In [159]:
title_rating_prof_reg_one

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,runtimeMinutes,averageRating,numVotes,genres,...,News,Reality-TV,Romance,Sci-Fi,Short,Sport,Talk-Show,Thriller,War,Western
0,tt0000502,movie,Bohemios,Bohemios,0,1970-01-01 00:00:00.000001905,100.0,4.5,14,"Comedy,Musical",...,0,0,0,0,0,0,0,0,0,0
1,tt0000574,movie,The Story of the Kelly Gang,The Story of the Kelly Gang,0,1970-01-01 00:00:00.000001906,70.0,6.0,754,"Action,Adventure,Biography",...,0,0,0,0,0,0,0,0,0,0
2,tt0000591,movie,The Prodigal Son,L'enfant prodigue,0,1970-01-01 00:00:00.000001907,90.0,4.6,17,Drama,...,0,0,0,0,0,0,0,0,0,0
3,tt0000615,movie,Robbery Under Arms,Robbery Under Arms,0,1970-01-01 00:00:00.000001907,96.0,4.5,23,Drama,...,0,0,0,0,0,0,0,0,0,0
4,tt0000630,movie,Hamlet,Amleto,0,1970-01-01 00:00:00.000001908,130.0,3.8,24,Drama,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
323829,tt9916362,movie,Coven,Akelarre,0,2020-01-01,92.0,6.4,4447,"Drama,History",...,0,0,0,0,0,0,0,0,0,0
323830,tt9916428,movie,The Secret of China,Hong xing zhao yao Zhong guo,0,2019-01-01,93.0,3.8,14,"Adventure,History,War",...,0,0,0,0,0,0,0,0,1,0
323831,tt9916460,tvMovie,Pink Taxi,Pink Taxi,0,2019-01-01,10.0,9.3,17,Comedy,...,0,0,0,0,0,0,0,0,0,0
323832,tt9916538,movie,Kuambil Lagi Hatiku,Kuambil Lagi Hatiku,0,2019-01-01,123.0,8.3,6,Drama,...,0,0,0,0,0,0,0,0,0,0


In [160]:
title_rating_prof_reg_one.to_csv('processed/title_rating_prof_reg_one.csv', index = False)