# Audible Data Science technical challenge

## Movie Rating Prediction using IMDB Datasets

Use the publicly available IMDB Datasets to build a model that predicts a movie’s average rating.

### 1. Data Gathering

Web-scraping the datasets from the dataset page and storing them in the `data` folder

In [None]:
import os
import requests

In [None]:
# !pip install beautifulsoup4

In [None]:
from bs4 import BeautifulSoup

In [None]:
url = "https://datasets.imdbws.com"
folder = "data"

In [None]:
# Make the data directory if it doesn't exist
if not os.path.isdir(folder):
    os.mkdir(folder)

# Send request
response = requests.get(url)
response.raise_for_status()  # Raise exception if invalid response

# Parse HTML
soup = BeautifulSoup(response.text, 'html.parser')

# Find and download all datasets
for link in soup.find_all('a'):
    href = link.get('href')
    if href.endswith('.tsv.gz'):
        print(f"Downloading {href} ...")

        # Send request
        r = requests.get(href, stream=True)
        r.raise_for_status()  # Raise exception if invalid response

        # Download to file
        with open(os.path.join(folder, os.path.basename(href)), 'wb') as f:
            for chunk in r.iter_content(chunk_size=8192):
                f.write(chunk)

print("Download completed!")

Downloading https://datasets.imdbws.com/name.basics.tsv.gz ...
Downloading https://datasets.imdbws.com/title.akas.tsv.gz ...
Downloading https://datasets.imdbws.com/title.basics.tsv.gz ...
Downloading https://datasets.imdbws.com/title.crew.tsv.gz ...
Downloading https://datasets.imdbws.com/title.episode.tsv.gz ...
Downloading https://datasets.imdbws.com/title.principals.tsv.gz ...
Downloading https://datasets.imdbws.com/title.ratings.tsv.gz ...
Download completed!


### 2. Data Preprocessing

In [None]:
import pandas as pd

import matplotlib.pyplot as plt

#### Load the data

> Each dataset is contained in a gzipped, tab-separated-values (TSV) formatted file in the UTF-8 character set. The first line in each file contains headers that describe what is in each column. A `\N` is used to denote that a particular field is missing or null for that title/name.

Loading only few datasets because of resource limitation

In [None]:
title_basics = pd.read_csv('data/title.basics.tsv.gz', sep='\t', na_values='\\N')
title_basics.head()

  title_basics = pd.read_csv('data/title.basics.tsv.gz', sep='\t', na_values='\\N')


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0.0,1894.0,,1.0,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0.0,1892.0,,5.0,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0.0,1892.0,,4.0,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0.0,1892.0,,12.0,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0.0,1893.0,,1.0,"Comedy,Short"


In [None]:
title_ratings = pd.read_csv('data/title.ratings.tsv.gz', sep='\t', na_values='\\N')
title_ratings.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1990
1,tt0000002,5.8,265
2,tt0000003,6.5,1851
3,tt0000004,5.5,178
4,tt0000005,6.2,2636


#### Merge datasets

In [None]:
titles = title_basics.merge(title_ratings, on='tconst')
titles.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes
0,tt0000001,short,Carmencita,Carmencita,0.0,1894.0,,1.0,"Documentary,Short",5.7,1990
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0.0,1892.0,,5.0,"Animation,Short",5.8,265
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0.0,1892.0,,4.0,"Animation,Comedy,Romance",6.5,1851
3,tt0000004,short,Un bon bock,Un bon bock,0.0,1892.0,,12.0,"Animation,Short",5.5,178
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0.0,1893.0,,1.0,"Comedy,Short",6.2,2636


#### Filter the relevant data

Checking the types of titles in the dataset

In [None]:
# Get counts of each title type
titles['titleType'].value_counts()

tvEpisode       656643
movie           294797
short           149223
tvSeries         88691
tvMovie          50938
video            50546
tvMiniSeries     15304
videoGame        14942
tvSpecial        11312
tvShort           2169
Name: titleType, dtype: int64

Not all titles are movies. Therefore, filtering the data to include only movies (`titleType` is 'movie')

Assuming, we are predicting only for movies (not considering tvmovies)

In [None]:
movies = titles[titles['titleType'] == 'movie']
movies.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes
8,tt0000009,movie,Miss Jerry,Miss Jerry,0.0,1894.0,,45.0,Romance,5.3,206
144,tt0000147,movie,The Corbett-Fitzsimmons Fight,The Corbett-Fitzsimmons Fight,0.0,1897.0,,100.0,"Documentary,News,Sport",5.3,480
331,tt0000502,movie,Bohemios,Bohemios,0.0,1905.0,,100.0,,4.1,15
363,tt0000574,movie,The Story of the Kelly Gang,The Story of the Kelly Gang,0.0,1906.0,,70.0,"Action,Adventure,Biography",6.0,840
371,tt0000591,movie,The Prodigal Son,L'enfant prodigue,0.0,1907.0,,90.0,Drama,4.4,20


Checking for the type of features and null values

In [None]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 294797 entries, 8 to 1334560
Data columns (total 11 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   tconst          294797 non-null  object 
 1   titleType       294797 non-null  object 
 2   primaryTitle    294797 non-null  object 
 3   originalTitle   294797 non-null  object 
 4   isAdult         294797 non-null  float64
 5   startYear       294761 non-null  float64
 6   endYear         0 non-null       float64
 7   runtimeMinutes  265751 non-null  object 
 8   genres          284666 non-null  object 
 9   averageRating   294797 non-null  float64
 10  numVotes        294797 non-null  int64  
dtypes: float64(4), int64(1), object(6)
memory usage: 27.0+ MB


#### Handling null values

In [None]:
movies.isnull().sum()

tconst                 0
titleType              0
primaryTitle           0
originalTitle          0
isAdult                0
startYear             36
endYear           294797
runtimeMinutes     29046
genres             10131
averageRating          0
numVotes               0
dtype: int64

Missing values are present in four features
1. `startYear`
2. `endYear`
3. `runtimeMinutes`
4. `genres`

`startYear`: Since there are very few missing values in this column, we can fill them with the median of the rest of the column



In [None]:
movies['startYear'] = movies['startYear'].fillna(movies['startYear'].median())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies['startYear'] = movies['startYear'].fillna(movies['startYear'].median())


`endYear`: This column has no non-null values at all. It's likely this field is not useful for the prediction, let's drop the feature



In [None]:
movies = movies.drop(columns='endYear')

In [None]:
movies.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,runtimeMinutes,genres,averageRating,numVotes
8,tt0000009,movie,Miss Jerry,Miss Jerry,0.0,1894.0,45.0,Romance,5.3,206
144,tt0000147,movie,The Corbett-Fitzsimmons Fight,The Corbett-Fitzsimmons Fight,0.0,1897.0,100.0,"Documentary,News,Sport",5.3,480
331,tt0000502,movie,Bohemios,Bohemios,0.0,1905.0,100.0,,4.1,15
363,tt0000574,movie,The Story of the Kelly Gang,The Story of the Kelly Gang,0.0,1906.0,70.0,"Action,Adventure,Biography",6.0,840
371,tt0000591,movie,The Prodigal Son,L'enfant prodigue,0.0,1907.0,90.0,Drama,4.4,20


`runtimeMinutes`: Fill in the missing values with the median

I realized that the field is an object type, which suggests it might contain non-numeric values. We need to convert it to a numeric type, and then impute accordingly

In [None]:
movies['runtimeMinutes'] = pd.to_numeric(movies['runtimeMinutes'], errors='coerce')
movies['runtimeMinutes'] = movies['runtimeMinutes'].fillna(movies['runtimeMinutes'].median())

`genres`: There are several strategies to handle the missing values in this column:
1. Fill them with a placeholder value like 'Unknown'
2. Fill them with the top 3 genres of the director of the movie

Going with the first option since it is easy

If I had more time and resources, I would choose the second option.
1. Identify the director of each movie: Load the `title.crew.tsv.gz` dataset, which contains the directors of each title.
2. For each director, identify their top 3 genres: Examine all the movies they've directed and the genres of those movies. We could create a DataFrame where each row corresponds to a director, and the columns contain the count of each genre. Then, for each director, you can find the top 3 genres.
3. Impute the missing genres of a movie with the top 3 genres of its director: Now that you have the top 3 genres of each director, we can use this information to fill in the missing genres of the movies they've directed.

In [None]:
movies['genres'] = movies['genres'].fillna('Unknown')

In [None]:
movies.isnull().sum()

tconst            0
titleType         0
primaryTitle      0
originalTitle     0
isAdult           0
startYear         0
runtimeMinutes    0
genres            0
averageRating     0
numVotes          0
dtype: int64

#### One-hot encoding

Machine learning models cannot work with categorical data, they require numerical inputs. We need to encode the `genres` into a numeric format

The disadvantage of this process is it increases the dataset's size (adds more features)

In [None]:
# Split the genres string into a list of strings
movies['genres'] = movies['genres'].apply(lambda x: x.split(","))

In [None]:
movies.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,runtimeMinutes,genres,averageRating,numVotes
8,tt0000009,movie,Miss Jerry,Miss Jerry,0.0,1894.0,45.0,[Romance],5.3,206
144,tt0000147,movie,The Corbett-Fitzsimmons Fight,The Corbett-Fitzsimmons Fight,0.0,1897.0,100.0,"[Documentary, News, Sport]",5.3,480
331,tt0000502,movie,Bohemios,Bohemios,0.0,1905.0,100.0,[Unknown],4.1,15
363,tt0000574,movie,The Story of the Kelly Gang,The Story of the Kelly Gang,0.0,1906.0,70.0,"[Action, Adventure, Biography]",6.0,840
371,tt0000591,movie,The Prodigal Son,L'enfant prodigue,0.0,1907.0,90.0,[Drama],4.4,20


Use `MultiLabelBinarizer` for one-hot encoding

In [None]:
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()

# One-hot encoding of genres
one_hot_genres = mlb.fit_transform(movies['genres'])

# Convert this array into a DataFrame
one_hot_genres_df = pd.DataFrame(one_hot_genres, columns=mlb.classes_)

one_hot_genres_df.head()

Unnamed: 0,Action,Adult,Adventure,Animation,Biography,Comedy,Crime,Documentary,Drama,Family,...,News,Reality-TV,Romance,Sci-Fi,Sport,Talk-Show,Thriller,Unknown,War,Western
0,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,1,0,0,...,1,0,0,0,1,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,1,0,1,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
# Convert this array into a DataFrame and rename the columns
one_hot_genres_df = pd.DataFrame(one_hot_genres, columns=['genre_'+str(cls) for cls in mlb.classes_])

In [None]:
one_hot_genres_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294797 entries, 0 to 294796
Data columns (total 27 columns):
 #   Column             Non-Null Count   Dtype
---  ------             --------------   -----
 0   genre_Action       294797 non-null  int64
 1   genre_Adult        294797 non-null  int64
 2   genre_Adventure    294797 non-null  int64
 3   genre_Animation    294797 non-null  int64
 4   genre_Biography    294797 non-null  int64
 5   genre_Comedy       294797 non-null  int64
 6   genre_Crime        294797 non-null  int64
 7   genre_Documentary  294797 non-null  int64
 8   genre_Drama        294797 non-null  int64
 9   genre_Family       294797 non-null  int64
 10  genre_Fantasy      294797 non-null  int64
 11  genre_Film-Noir    294797 non-null  int64
 12  genre_History      294797 non-null  int64
 13  genre_Horror       294797 non-null  int64
 14  genre_Music        294797 non-null  int64
 15  genre_Musical      294797 non-null  int64
 16  genre_Mystery      294797 non-null  in

In [None]:
# Reset index of your main DataFrame and construct final DataFrame
movies.reset_index(drop=True, inplace=True)
movies = pd.concat([movies, one_hot_genres_df], axis=1)

movies.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,runtimeMinutes,genres,averageRating,numVotes,...,genre_News,genre_Reality-TV,genre_Romance,genre_Sci-Fi,genre_Sport,genre_Talk-Show,genre_Thriller,genre_Unknown,genre_War,genre_Western
0,tt0000009,movie,Miss Jerry,Miss Jerry,0.0,1894.0,45.0,[Romance],5.3,206,...,0,0,1,0,0,0,0,0,0,0
1,tt0000147,movie,The Corbett-Fitzsimmons Fight,The Corbett-Fitzsimmons Fight,0.0,1897.0,100.0,"[Documentary, News, Sport]",5.3,480,...,1,0,0,0,1,0,0,0,0,0
2,tt0000502,movie,Bohemios,Bohemios,0.0,1905.0,100.0,[Unknown],4.1,15,...,0,0,0,0,0,0,0,1,0,0
3,tt0000574,movie,The Story of the Kelly Gang,The Story of the Kelly Gang,0.0,1906.0,70.0,"[Action, Adventure, Biography]",6.0,840,...,0,0,0,0,0,0,0,0,0,0
4,tt0000591,movie,The Prodigal Son,L'enfant prodigue,0.0,1907.0,90.0,[Drama],4.4,20,...,0,0,0,0,0,0,0,0,0,0


In [None]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294797 entries, 0 to 294796
Data columns (total 37 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   tconst             294797 non-null  object 
 1   titleType          294797 non-null  object 
 2   primaryTitle       294797 non-null  object 
 3   originalTitle      294797 non-null  object 
 4   isAdult            294797 non-null  float64
 5   startYear          294797 non-null  float64
 6   runtimeMinutes     294797 non-null  float64
 7   genres             294797 non-null  object 
 8   averageRating      294797 non-null  float64
 9   numVotes           294797 non-null  int64  
 10  genre_Action       294797 non-null  int64  
 11  genre_Adult        294797 non-null  int64  
 12  genre_Adventure    294797 non-null  int64  
 13  genre_Animation    294797 non-null  int64  
 14  genre_Biography    294797 non-null  int64  
 15  genre_Comedy       294797 non-null  int64  
 16  ge

#### Dimensionality reduction

As of now, we are removing the redundant features

If I had more time and resources, I would perform Principal Component Analysis (PCA) to improve the performance of models

In [None]:
movies = movies.drop(columns='genres')

In [None]:
movies.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,runtimeMinutes,averageRating,numVotes,genre_Action,...,genre_News,genre_Reality-TV,genre_Romance,genre_Sci-Fi,genre_Sport,genre_Talk-Show,genre_Thriller,genre_Unknown,genre_War,genre_Western
0,tt0000009,movie,Miss Jerry,Miss Jerry,0.0,1894.0,45.0,5.3,206,0,...,0,0,1,0,0,0,0,0,0,0
1,tt0000147,movie,The Corbett-Fitzsimmons Fight,The Corbett-Fitzsimmons Fight,0.0,1897.0,100.0,5.3,480,0,...,1,0,0,0,1,0,0,0,0,0
2,tt0000502,movie,Bohemios,Bohemios,0.0,1905.0,100.0,4.1,15,0,...,0,0,0,0,0,0,0,1,0,0
3,tt0000574,movie,The Story of the Kelly Gang,The Story of the Kelly Gang,0.0,1906.0,70.0,6.0,840,1,...,0,0,0,0,0,0,0,0,0,0
4,tt0000591,movie,The Prodigal Son,L'enfant prodigue,0.0,1907.0,90.0,4.4,20,0,...,0,0,0,0,0,0,0,0,0,0
