# Data preprocessing

First steps are to build the final dataset from the multiple datasets obtained from [IMBd](https://www.imdb.com/interfaces/).
The datasets are:
- title_basics.tsv
- title_principals.tsv
- title_crew.tsv
- title_ratings.tsv
- name_basics.tsv

The final dataset should contain the next attributes:
- titleId, title, type, year, runtimeMinutes, genres, 

Due to the size of the raw datasets combined (5.05 GB), some data reduction will probably have to be done as well.

# Movielens dataset

In [1]:
# Libraries imports and function declarations
import pandas as pd
import numpy as np

In [2]:
def convert_genres_list(x):
    if not x:
        return np.NaN
    # If we rearch this point, it means we've got a string
    # with at least 1 genre.
    return x.lower().split("|")

df_main = pd.read_csv(
    "datasets/ml-25m/movies.csv", sep=",",
    converters= {
        'genres': convert_genres_list
    })

In [3]:
df_main.head(2)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),"[adventure, animation, children, comedy, fantasy]"
1,2,Jumanji (1995),"[adventure, children, fantasy]"


In [4]:
df_main.shape

(62423, 3)

In [5]:
df_main.isnull().sum()

movieId    0
title      0
genres     0
dtype: int64

In [6]:
def process_year(movie):
    # Format: movie_name (year)
    # We'll try to get the highest index for the parenthesis.
    # If any of them are not found or the cast has an invalid input return -1 as year.
    try:
        start = movie.rindex('(')
        end = movie.rindex(')')
        
        year = int(movie[start+1:end])
        return year
    except:
        return -1    
    
df_main['year'] = df_main['title'].apply(lambda x: process_year(x))

In [7]:
-1 in df_main['year'] # All years have been successfully converted!

False

In [8]:
# Delete year from name
def process_name(movie):
    # Format: movie_name (year)
    # We'll try to get the highest index for the parenthesis.
    try:
        start = movie.rindex('(')
        new_movie = movie[:start].strip()
        return new_movie
    except:
        return movie

df_main['title'] = df_main['title'].apply(lambda x: process_name(x))

In [9]:
df_main.head(5)

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,"[adventure, animation, children, comedy, fantasy]",1995
1,2,Jumanji,"[adventure, children, fantasy]",1995
2,3,Grumpier Old Men,"[comedy, romance]",1995
3,4,Waiting to Exhale,"[comedy, drama, romance]",1995
4,5,Father of the Bride Part II,[comedy],1995


Once we have this dataset cleaned, we need to aggregate information from the IMDb dataset.

Informatio such as titleType, director, writer and main actors will try to be added.

---
# IMDb dataset

In [94]:
df_im = pd.read_csv(
    "datasets/IMDb/title_basics.tsv", sep="\t",
    usecols= ['tconst', 'titleType', 'primaryTitle']
    )

In [95]:
df_im.shape

(8699991, 3)

In [96]:
df_im.isnull().sum()

tconst          0
titleType       0
primaryTitle    8
dtype: int64

In [97]:
df_im['titleType'].unique()

array(['short', 'movie', 'tvEpisode', 'tvSeries', 'tvShort', 'tvMovie',
       'tvMiniSeries', 'tvSpecial', 'video', 'videoGame', 'tvPilot'],
      dtype=object)

In [98]:
# We must drop all columns whose type is not short, movie, tvShort or tvMovie.
# (we're unsure whether Movielense's dataset includes shorts, we'll keep them just in case).
df_im = df_im.loc[df_im['titleType'].isin([
        'short', 'movie', 'tvShort', 'tvMovie'
    ])]

In [99]:
df_im.shape # ~7 million rows dropped.

(1603640, 3)

In [100]:
df_im.isnull().sum()

tconst          0
titleType       0
primaryTitle    0
dtype: int64

In [101]:
def convert_list(x):
    if not x:
        return np.NaN
    if x == '\\N':
        return np.NaN
    return x.split(',')[0]

df_crew = pd.read_csv(
    "datasets/IMDb/title_crew.tsv", sep="\t",
    converters= {
        'directors': convert_list,
        'writers': convert_list
    })

In [102]:
df_crew.head(3)

Unnamed: 0,tconst,directors,writers
0,tt0000001,nm0005690,
1,tt0000002,nm0721526,
2,tt0000003,nm0721526,


In [103]:
df_im = pd.merge(df_im, df_crew, how='left', on=['tconst'])

In [104]:
df_im.head(3)

Unnamed: 0,tconst,titleType,primaryTitle,directors,writers
0,tt0000001,short,Carmencita,nm0005690,
1,tt0000002,short,Le clown et ses chiens,nm0721526,
2,tt0000003,short,Pauvre Pierrot,nm0721526,


In [105]:
df_im.shape

(1603640, 5)

In [106]:
df_name = pd.read_csv(
    "datasets/IMDb/name_basics.tsv", sep="\t",
    usecols=[
        'nconst',
        'primaryName'
    ])

In [107]:
df_name.head(3)

Unnamed: 0,nconst,primaryName
0,nm0000001,Fred Astaire
1,nm0000002,Lauren Bacall
2,nm0000003,Brigitte Bardot


In [108]:
df_im = pd.merge(df_im, df_name,
    how='left',
    left_on=['directors'], right_on=['nconst']
    )

# Drop extra column, rename director's column name
df_im = df_im.drop(columns='nconst')
df_im = df_im.rename(columns={'primaryName':'directorName'})

In [110]:
df_im = pd.merge(df_im, df_name,
    how='left',
    left_on=['writers'], right_on=['nconst']
    )

df_im = df_im.drop(columns='nconst')
df_im = df_im.rename(columns={'primaryName':'writerName'})


In [114]:
df_im = df_im.drop(columns=['directors','writers'])
df_im = df_im.rename(columns={
    'directorName':'director',
    'writerName':'writer'
    })


In [115]:
df_im.head(10)

Unnamed: 0,tconst,titleType,primaryTitle,director,writer
0,tt0000001,short,Carmencita,William K.L. Dickson,
1,tt0000002,short,Le clown et ses chiens,Émile Reynaud,
2,tt0000003,short,Pauvre Pierrot,Émile Reynaud,
3,tt0000004,short,Un bon bock,Émile Reynaud,
4,tt0000005,short,Blacksmith Scene,William K.L. Dickson,
5,tt0000006,short,Chinese Opium Den,William K.L. Dickson,
6,tt0000007,short,Corbett and Courtney Before the Kinetograph,William Heise,
7,tt0000008,short,Edison Kinetoscopic Record of a Sneeze,William K.L. Dickson,
8,tt0000009,short,Miss Jerry,Alexander Black,Alexander Black
9,tt0000010,short,Leaving the Factory,Louis Lumière,


# Merge of the two datasets

In [120]:
df_main.shape 

(62423, 4)

In [119]:
len(set(df_main['title']).intersection(set(df_im['primaryTitle'])))

40916

IMDb's dataset contains information for 40916 of the 62423 movies from the Movielense dataset, or around 65% of the movies.

In [123]:
prueba = pd.merge(df_main, df_im,
    how = 'inner',
    left_on=['title'], right_on=['primaryTitle']
    )

In [124]:
prueba.head(10)

Unnamed: 0,movieId,title,genres,year,tconst,titleType,primaryTitle,director,writer
0,1,Toy Story,"[adventure, animation, children, comedy, fantasy]",1995,tt0114709,movie,Toy Story,John Lasseter,John Lasseter
1,1,Toy Story,"[adventure, animation, children, comedy, fantasy]",1995,tt0847116,short,Toy Story,An Barbier,An Barbier
2,2,Jumanji,"[adventure, children, fantasy]",1995,tt0113497,movie,Jumanji,Joe Johnston,Jonathan Hensleigh
3,2,Jumanji,"[adventure, children, fantasy]",1995,tt14981590,short,Jumanji,Director Ali,
4,3,Grumpier Old Men,"[comedy, romance]",1995,tt0113228,movie,Grumpier Old Men,Howard Deutch,Mark Steven Johnson
5,4,Waiting to Exhale,"[comedy, drama, romance]",1995,tt0114885,movie,Waiting to Exhale,Forest Whitaker,Terry McMillan
6,5,Father of the Bride Part II,[comedy],1995,tt0113041,movie,Father of the Bride Part II,Charles Shyer,Albert Hackett
7,6,Heat,"[action, crime, thriller]",1995,tt0068688,movie,Heat,Paul Morrissey,John Hallowell
8,6,Heat,"[action, crime, thriller]",1995,tt0093164,movie,Heat,Jerry Jameson,William Goldman
9,6,Heat,"[action, crime, thriller]",1995,tt0113277,movie,Heat,Michael Mann,Michael Mann
