# Data Cleaning Notebook


## TODO: 
    * Work on aggregating the title.akas DF. Each movie has multiple rows so we need to determine how we want to 
      merge this into our main one as each movie will have multiple entries 
    * Work on merging the list of actors into our list. Need to get a list of the actors for a given movie, merge it, 
      then one-hot encode it. 
    * Figure out how to featurize the movie title, maybe NLP vector would be the best but thats somewhat complicated 
      Could do things like length of the title, whether it contains nouns, etc. 

This notebook contains the relevant code for aggregating our CSV's into a singular one that can then be used in our models. 

Contains the One-Hot encoding, and featurization of the columns. 

In [1]:
import os
import pandas as pd 
import numpy as np 

#Descriptions of sets available at https://www.imdb.com/interfaces/

In [2]:
# Loading the Sets: 
sets = {} 

for file in os.listdir('DataSets'):
    file = file.replace('.tsv', '') 
    sets[file] = pd.read_csv(f"DataSets/{file}.tsv", sep='\t')
    if 'tconst' in sets[file].keys():
        sets[file].set_index('tconst',inplace=True)
    elif 'titleId' in sets[file].keys():
        sets[file].set_index('titleId', inplace=True) 
    else:
        print("Cant set index for ", file)
    print(f"Loaded {file}")
print("Loaded all datasets") 

Loaded title.ratings
Loaded title.principals


  interactivity=interactivity, compiler=compiler, result=result)


Loaded title.akas
Cant set index for  name.basics
Loaded name.basics


  interactivity=interactivity, compiler=compiler, result=result)


Loaded title.basics
Loaded title.crew
Loaded all datasets


# What we need from each dataset: 
### title.akas 
    * Language
    * region 
### title.basics
    * titleType -> Used to only keep movies, used in filtering. We should filter before we split, etc. 
    * primaryTitle -> We only care about the most popular title so use this one 
    * isAdult -> Why not include these 
    * startYear -> Year movie released, good to see to potentially capture trends of a timeperiod 
    * runtimeMinutes -> Good to know for how long a movie is 
    * genres -> One hot encoding on the genres it has 
### title.crew 
    * directors -> One hot encoding on top 100 directors 
    * writers -> One hot encoding on top 100 writers 
### title.episode 
    * We dont care about episodes so skip this one 
### title.principals  - Info about cast / crew for titles 
    * nconst -> Useful for determining which actor is who. Can use this with one-hot encoding for each movie to 
                determine top 100 actors and whether or not they were in a movie or not 
### title.ratings 
    * averageRating -> Weight average of all individual ratings, used as our target variable 
    * numVotes -> The number of votes it received - useful to somehow include this in our target, would want to 
                  weight training samples with move votes w/ more importance, can be used with models that allow that 
### name.basics
    * Can potentially include this later on in our featurizations if we need info about the people, currently I think 
      just their unique id from title.principals should be more than enough to capture actors. 

# Data Cleaning: 
    1) First get ids of all entries that are movies (So we are not including things that arent movies) 
    
    2) Then filter the raitings CSV to only include just movies, then move to actually just only keep the first X number of movies with a certain raitings count. 

In [7]:
print("DFs we have loaded: ", sets.keys())

DFs we have loaded:  dict_keys(['title.ratings', 'title.principals', 'title.akas', 'name.basics', 'title.basics', 'title.crew'])


In [11]:
## FILTERING SO WE ONLY HAVE MOVIES: 
df = sets['title.basics']

# Keep only the rows that have a titleType of movie
df = df[df['titleType'] == 'movie']

# The row ids that are just movies 
movie_ids = list(df.index)
#first 10 just to make sure its ids 
print(movie_ids[:10])
print(len(movie_ids))


['tt0000502', 'tt0000574', 'tt0000591', 'tt0000615', 'tt0000630', 'tt0000675', 'tt0000679', 'tt0000739', 'tt0000793', 'tt0000814']
606395


In [12]:
# Shifting over to the title.ratings CSV to get the first X number of movies with the most reviews 
df = sets['title.ratings']

#Filtering the df to only include the ids that were explicitly movies 
df = df.loc[movie_ids]

# Sorting the df by numVotes so its from most number of reviews -> less 
df.sort_values(by='numVotes', inplace=True, ascending=False)
# I dont know why I hardcoded these but checks 100k to 20k in batches of 10k  
for splitter in [100000, 90000, 80000, 70000, 60000, 50000, 40000, 30000, 20000]:
    top = list(df['numVotes'][:splitter])
    print(f"Total number of movies: {splitter} : Lowest raiting count: {top[-1]}")


# We will go with 30k for now to keep it simple :) 
TOTAL_MOVIES_TO_KEEP = 70000 

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike
  """


Total number of movies: 100000 : Lowest raiting count: 130.0
Total number of movies: 90000 : Lowest raiting count: 166.0
Total number of movies: 80000 : Lowest raiting count: 216.0
Total number of movies: 70000 : Lowest raiting count: 288.0
Total number of movies: 60000 : Lowest raiting count: 395.0
Total number of movies: 50000 : Lowest raiting count: 569.0
Total number of movies: 40000 : Lowest raiting count: 870.0
Total number of movies: 30000 : Lowest raiting count: 1481.0
Total number of movies: 20000 : Lowest raiting count: 2997.0


In [13]:
# The ids that meet the criteria we are utilizing now in our main df. 
movie_ids = list(df['numVotes'][:TOTAL_MOVIES_TO_KEEP].index)
# Filter the DataFrame to only keep those ids :) 
print(df.shape)
df = df.loc[movie_ids]
print(df.shape)

# Our output final df, named final_df for ease. Contains the top 30k movies for the number of votes. 
final_df = df 

(606395, 2)
(70000, 2)


In [14]:
#IGNORING AKA RIGHT NOW AS IT HAS DUPLICATES SO WE NEED TO FIGURE OUT WHETHER TO AGGREGATE OR NOT

# If we ever need to change the columns we are merging into it, do so here :) 
AKA_COLS = ['region', 'language'] 
BASICS_COLS = ['isAdult', 'startYear', 'runtimeMinutes', 'genres', 'primaryTitle']

# Grab the respective dfs, only grab the rows that are our movie ids we have chosen to work with 
df_aka = sets['title.akas'].loc[movie_ids]
df_basic = sets['title.basics'].loc[movie_ids]

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike
  


In [None]:
df_aka

In [15]:
# Filtering the dfs to only be the columns we are actually interested in 
df_aka = df_aka[AKA_COLS]
df_basic = df_basic[BASICS_COLS]

# Merging AKA and Basic  - ignoring right now 
# merged = df_aka.join(df_basic)

# Joining the merged above df into our final df 
# final_df = final_df.join(merged) 

final_df = final_df.join(df_basic) 

final_df

Unnamed: 0_level_0,averageRating,numVotes,isAdult,startYear,runtimeMinutes,genres,primaryTitle
tconst,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
tt0111161,9.3,2569002.0,0,1994,142,Drama,The Shawshank Redemption
tt0468569,9.1,2534021.0,0,2008,152,"Action,Crime,Drama",The Dark Knight
tt1375666,8.8,2254701.0,0,2010,148,"Action,Adventure,Sci-Fi",Inception
tt0137523,8.8,2021525.0,0,1999,139,Drama,Fight Club
tt0109830,8.8,1981606.0,0,1994,142,"Drama,Romance",Forrest Gump
...,...,...,...,...,...,...,...
tt0022134,5.7,288.0,0,1931,70,"Drama,Romance",Arizona
tt5449088,5.9,288.0,0,2015,119,"Action,Comedy",Enemies In-Law
tt2671390,6.7,288.0,0,2013,95,"Drama,Romance",Paradzhanov
tt0491321,6.1,288.0,0,1983,91,"Adventure,Comedy,Drama",Bas Belasi


In [16]:
CREW_COLS = ['directors','writers']
# Merging The Crew df into our main Df: Should probably do one hot encoding here before we actually merge it but idk 
crew = sets['title.crew'].loc[movie_ids]
crew = crew[CREW_COLS]

final_df = final_df.join(crew)


In [17]:
final_df

Unnamed: 0_level_0,averageRating,numVotes,isAdult,startYear,runtimeMinutes,genres,primaryTitle,directors,writers
tconst,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
tt0111161,9.3,2569002.0,0,1994,142,Drama,The Shawshank Redemption,nm0001104,"nm0000175,nm0001104"
tt0468569,9.1,2534021.0,0,2008,152,"Action,Crime,Drama",The Dark Knight,nm0634240,"nm0634300,nm0634240,nm0275286,nm0004170"
tt1375666,8.8,2254701.0,0,2010,148,"Action,Adventure,Sci-Fi",Inception,nm0634240,nm0634240
tt0137523,8.8,2021525.0,0,1999,139,Drama,Fight Club,nm0000399,"nm0657333,nm0880243"
tt0109830,8.8,1981606.0,0,1994,142,"Drama,Romance",Forrest Gump,nm0000709,"nm0343165,nm0744839"
...,...,...,...,...,...,...,...,...,...
tt0022134,5.7,288.0,0,1931,70,"Drama,Romance",Arizona,nm0782707,"nm0728307,nm0858501"
tt5449088,5.9,288.0,0,2015,119,"Action,Comedy",Enemies In-Law,nm4060924,"nm7911206,nm2148630"
tt2671390,6.7,288.0,0,2013,95,"Drama,Romance",Paradzhanov,"nm0042848,nm3403938",nm3403938
tt0491321,6.1,288.0,0,1983,91,"Adventure,Comedy,Drama",Bas Belasi,nm0862605,nm0876847


In [18]:
# Grabs the top X number of categories, defaulted to 100. Utilized by our one-hot encoding system to
# determine what the most used number of categories are 
def top_categories(df, col_name, TOTAL_TO_KEEP = 100):
    """
        Provided a dataframe and the column name, returns the top 100 categories for this column 
        @param df: The DataFrame we are searching over 
        @param col_name: The column name we are examining the categories for 
        @param TOTAL_TO_KEEP: Constant specifying the total number of entries to keep. 
        :return A list of the top 100 categories for this column. 
    """
    frequencies = {}
    for entry in list(df[col_name]):
        if entry is np.nan:
            continue 
        # If we have a list of items iterate over each of them 
        if ',' in entry:
            for entry_sub in entry.split(','):
                if entry_sub in frequencies:
                    frequencies[entry_sub] += 1
                else:
                    frequencies[entry_sub] = 1 
        else:
            # Just a singular item so can compare it here directly. 
            if entry in frequencies:
                frequencies[entry] += 1
            else:
                frequencies[entry] = 1 
    
    lof_frequencies = []
    for key in frequencies.keys():\
        lof_frequencies.append( (key, frequencies[key]) )
    lof_frequencies.sort(key = (lambda pair: pair[1]))
        
    return [pair[0] for pair in lof_frequencies[:TOTAL_TO_KEEP]]


def encode_row(row, category):
    """
        Encodes a specific row / entry in our dataframe. If the row is a list, checks to see if the 
        category value exists in it, if it is not a list just checks to see if the row is equal to the category 
    """
    if ',' in row:
        return 1 if category in row.split(',') else 0 
    else:
        return 1 if category == row else 0 

def encode_non_categorical_row(row, lof_categories):
    """
        Encodes a row specifically looking to see if the value is not in our list of categories. If any of the 
        values does not exist in our list of categories return 1, else 0 
    """
    if ',' in row:
        for category in lof_categories:
            if category in row:
                return 1
        return 0 
    else: 
        return 1 if row not in lof_categories else 0 

# Converts a specific column to a one-hot encoding version of it.
def encode_column(df, col_name, TOTAL_TO_KEEP = 100):
    df = df.copy()
    """
        Provided a dataframe and column to encode, mutates a copy of the dataframe to have that a one-hot encoding of that 
        specific column. Will remove that specific column from the dataframe and replace it with TOTAL_TO_KEEP 
        columns for that value plus one more column to handle any categorical variable that was not in the top 
        TOTAL_TO_KEEP categories. 
        
        @param df: The dataframe we are mutating 
        @param col_name: The column we are encoding, this column is removed from the df and replaced with the encodings
        @param TOTAL_TO_KEEP: Number of categories we want to display, defaulted to 100 
        :return A Copy of the dataframe with the encoding 
    """
    categories = top_categories(df, col_name, TOTAL_TO_KEEP=TOTAL_TO_KEEP)
    
    # Our encoded columns for the dataset goes here 
    encoded_cols = {}
    
    lof_column = list(df[col_name])
    
    for category in categories: 
        encoded_column = [encode_row(row, category) for row in lof_column]
        
        encoded_col_name = f"{col_name}_{category}"
        df[encoded_col_name] = encoded_column
    # If any of the entries is not in our lof-categories featurize this under the non_100 category 
    df[f"{col_name}_non_100_category"] = [encode_row(row, categories) for row in lof_column]
    df.drop(columns=[col_name], inplace=True)
    return df 


In [19]:
# The purpose of this is to turn any string columns into integers so they aren't encoded
# TODO: Setting the value to 0 is incomplete. Find the mean first and set it after (this is difficult as it needs a numeric value before mean can be found)
final_df['startYear'] = final_df['startYear'].replace("\\N", np.NaN)
final_df['startYear'] = pd.to_numeric(final_df['startYear'])
final_df['startYear'] = final_df['startYear'].replace("\\N", final_df['startYear'].mean())

final_df['runtimeMinutes'] = final_df['runtimeMinutes'].replace("\\N", np.NaN)
final_df['runtimeMinutes'] = pd.to_numeric(final_df['runtimeMinutes'])
final_df['runtimeMinutes'] = final_df['runtimeMinutes'].replace(np.NaN, final_df['runtimeMinutes'].mean())

In [20]:
# TODO - Dont run one-hot encoding on the title / movie title, find different way to featurize this.
# COMMENTS ON WORK: I wanted to also try to encode regions but I noticed that pretty much most movies on this are available in most regions. They probably wouldn't really be useful features.
final_df

Unnamed: 0_level_0,averageRating,numVotes,isAdult,startYear,runtimeMinutes,genres,primaryTitle,directors,writers
tconst,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
tt0111161,9.3,2569002.0,0,1994.0,142.0,Drama,The Shawshank Redemption,nm0001104,"nm0000175,nm0001104"
tt0468569,9.1,2534021.0,0,2008.0,152.0,"Action,Crime,Drama",The Dark Knight,nm0634240,"nm0634300,nm0634240,nm0275286,nm0004170"
tt1375666,8.8,2254701.0,0,2010.0,148.0,"Action,Adventure,Sci-Fi",Inception,nm0634240,nm0634240
tt0137523,8.8,2021525.0,0,1999.0,139.0,Drama,Fight Club,nm0000399,"nm0657333,nm0880243"
tt0109830,8.8,1981606.0,0,1994.0,142.0,"Drama,Romance",Forrest Gump,nm0000709,"nm0343165,nm0744839"
...,...,...,...,...,...,...,...,...,...
tt0022134,5.7,288.0,0,1931.0,70.0,"Drama,Romance",Arizona,nm0782707,"nm0728307,nm0858501"
tt5449088,5.9,288.0,0,2015.0,119.0,"Action,Comedy",Enemies In-Law,nm4060924,"nm7911206,nm2148630"
tt2671390,6.7,288.0,0,2013.0,95.0,"Drama,Romance",Paradzhanov,"nm0042848,nm3403938",nm3403938
tt0491321,6.1,288.0,0,1983.0,91.0,"Adventure,Comedy,Drama",Bas Belasi,nm0862605,nm0876847


In [21]:
### Encoding of title.principals to get the actors for a movie 
df = sets['title.principals']
# Filtering actors to just be the 
df = df.loc[movie_ids]

## We already encoded directors / writers so drop these from our table 
df = df.loc[(df['category'] != 'director') & (df['category'] != 'writer')]

# Grab the top 100 most common actors 
top_actors = top_categories(df, 'nconst')

df

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike
  after removing the cwd from sys.path.


Unnamed: 0_level_0,ordering,nconst,category,job,characters
tconst,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
tt0111161,10.0,nm0290358,editor,\N,\N
tt0111161,1.0,nm0000209,actor,\N,"[""Andy Dufresne""]"
tt0111161,2.0,nm0000151,actor,\N,"[""Ellis Boyd 'Red' Redding""]"
tt0111161,3.0,nm0348409,actor,\N,"[""Warden Norton""]"
tt0111161,4.0,nm0006669,actor,\N,"[""Heywood""]"
...,...,...,...,...,...
tt0086578,2.0,nm0730181,actor,\N,"[""Moody Mudinsky""]"
tt0086578,3.0,nm0842228,actress,\N,"[""Mascha""]"
tt0086578,4.0,nm0381450,actor,\N,"[""Frank""]"
tt0086578,8.0,nm1569037,composer,\N,\N


In [22]:
# group = df.groupby('nconst')
# top = group.sum().sort_values(by='ordering', ascending=False)
# top_actors = list(top.index)[:100]


In [23]:
df.dropna(inplace=True) 
movie_ids = list(set(movie_ids).intersection(set(df.index)))
final_df = final_df.loc[movie_ids]

In [24]:
# Each key corresponds to a Movie, holds a dictionary of top actors and 1 / 0 designation for if it has that 
# Note, I thought it would be good to include the non_top_100 actors as just the total count they have instead of 1/0
# Can change this if it isnt accurate, but I thought it better captured the random actors. 
top_actor_encodings = {} 
# Pandas .loc is extremely expensive for such large tables, so its better to do this in one pass 
# Only doing the .loc once saves a ton of computation for each movie 


for movie_id in set(df.index):
    top_actor_encodings[movie_id] = {}
    lof_actors = df.loc[movie_id]
    sof_actors = set(lof_actors['nconst'])
    for actor in top_actors:
        if actor in sof_actors:
            top_actor_encodings[movie_id][actor] = 1 
        else:
            top_actor_encodings[movie_id][actor] = 0
    number_of_non_top = len(sof_actors.difference(set(top_actors)))
    top_actor_encodings[movie_id]['non_100'] = number_of_non_top

In [25]:
# Now go through and actually create the one-hot encoding columns 

# Some movies are missing data so we drop those 
movie_ids = set(final_df.index).intersection(set(df.index))
final_df = final_df.loc[movie_ids] 
print(final_df.shape)

# Adding the one-hot encoded columns 
for actor in top_actors + ['non_100']:
    column = []
    for movie_id in final_df.index:
        # If this movie has no data, we can just put zero as they do not have that actor 
        if movie_id not in top_actor_encodings:
            column.append(0)
            continue 
#         if movie_id not in top_actor_encodings:
#             bad_ids.add()
        
        column.append(top_actor_encodings[movie_id][actor])
    col_name = f"actor_{actor}"
    final_df[col_name] = column


            

(69955, 9)


In [26]:
final_df

Unnamed: 0_level_0,averageRating,numVotes,isAdult,startYear,runtimeMinutes,genres,primaryTitle,directors,writers,actor_nm0555550,...,actor_nm2505304,actor_nm7692367,actor_nm4025731,actor_nm0906525,actor_nm0542635,actor_nm0166787,actor_nm3675884,actor_nm0661791,actor_nm1796730,actor_non_100
tconst,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
tt3032300,6.4,963.0,0,2013.0,80.0,"Adventure,Animation,Comedy",Oggy and the Cockroaches: The Movie,nm0419829,"nm0706893,nm0419829",0,...,0,0,0,0,0,0,0,0,0,5
tt0066844,6.2,1213.0,0,1971.0,105.0,Western,Blindman,nm0049728,"nm0031026,nm0148437,nm0025816,nm0061437",0,...,0,0,0,0,0,0,0,0,0,7
tt10329084,6.5,437.0,0,2019.0,144.0,Drama,99,nm5794652,nm5794652,0,...,0,0,0,0,0,0,0,0,0,9
tt3328442,5.3,2242.0,0,2015.0,102.0,"Drama,Horror,Mystery",Residue,nm3421685,nm0365666,0,...,0,0,0,0,0,0,0,0,0,8
tt0041841,7.0,10359.0,0,1949.0,100.0,"Action,Drama,Romance",Sands of Iwo Jima,nm0245385,"nm0113689,nm0335455",0,...,0,0,0,0,0,0,0,0,0,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
tt0072342,6.4,643.0,0,1975.0,101.0,Drama,Pleasure Party,nm0001031,nm0337541,0,...,0,0,0,0,0,0,0,0,0,9
tt0130350,6.5,1314.0,0,1992.0,174.0,"Action,Drama,Thriller",Vishwatma,nm0706800,nm0706800,0,...,0,0,0,0,0,0,0,0,0,7
tt5639388,4.5,1942.0,0,2016.0,128.0,"Horror,Mystery,Romance",Raaz Reboot,nm0080333,"nm0080333,nm0223475",0,...,0,0,0,0,0,0,0,0,0,8
tt7845306,7.9,404.0,0,2017.0,76.0,Documentary,Where Dreams Go To Die,nm3375370,\N,0,...,0,0,0,0,0,0,0,0,0,6


In [27]:
# Finally, encode the following columns 
for col in ['genres', 'directors', 'writers']:
    final_df = encode_column(final_df, col)
final_df 

Unnamed: 0_level_0,averageRating,numVotes,isAdult,startYear,runtimeMinutes,primaryTitle,actor_nm0555550,actor_nm0245596,actor_nm0068501,actor_nm0001825,...,writers_nm9748540,writers_nm3688509,writers_nm2522077,writers_nm1160330,writers_nm2441891,writers_nm0179041,writers_nm0482974,writers_nm2381441,writers_nm10505382,writers_non_100_category
tconst,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
tt3032300,6.4,963.0,0,2013.0,80.0,Oggy and the Cockroaches: The Movie,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
tt0066844,6.2,1213.0,0,1971.0,105.0,Blindman,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
tt10329084,6.5,437.0,0,2019.0,144.0,99,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
tt3328442,5.3,2242.0,0,2015.0,102.0,Residue,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
tt0041841,7.0,10359.0,0,1949.0,100.0,Sands of Iwo Jima,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
tt0072342,6.4,643.0,0,1975.0,101.0,Pleasure Party,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
tt0130350,6.5,1314.0,0,1992.0,174.0,Vishwatma,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
tt5639388,4.5,1942.0,0,2016.0,128.0,Raaz Reboot,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
tt7845306,7.9,404.0,0,2017.0,76.0,Where Dreams Go To Die,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [39]:
# Incase we want to experiment with this 
## Title featurization

# titles = list(final_df['primaryTitle'])
# titles = [",".join(entry.split(' ')) for entry in titles]
# final_df['primaryTitle'] = titles

AttributeError: 'list' object has no attribute 'split'

In [42]:
# final_df['primaryTitle'] = [",".join(entry) for entry in final_df['primaryTitle']]
# final_df = encode_column(final_df, 'primaryTitle') 

In [43]:
# final_df

Unnamed: 0_level_0,averageRating,numVotes,isAdult,startYear,runtimeMinutes,actor_nm0555550,actor_nm0245596,actor_nm0068501,actor_nm0001825,actor_nm0138287,...,primaryTitle_calidad,primaryTitle_Farhad,primaryTitle_Nikal,primaryTitle_Arakulo,primaryTitle_Virago,primaryTitle_Bottine,primaryTitle_Buyers,primaryTitle_Burns,primaryTitle_Grimsby,primaryTitle_non_100_category
tconst,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
tt3032300,6.4,963.0,0,2013.0,80.0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
tt0066844,6.2,1213.0,0,1971.0,105.0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
tt10329084,6.5,437.0,0,2019.0,144.0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
tt3328442,5.3,2242.0,0,2015.0,102.0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
tt0041841,7.0,10359.0,0,1949.0,100.0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
tt0072342,6.4,643.0,0,1975.0,101.0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
tt0130350,6.5,1314.0,0,1992.0,174.0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
tt5639388,4.5,1942.0,0,2016.0,128.0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
tt7845306,7.9,404.0,0,2017.0,76.0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [44]:
# final_df.to_csv('title_encodings') 