# Movie Rating Prediction

In [805]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction import FeatureHasher
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import *;

import locale

## Importing the dataset and basic EDA

In [806]:
df = pd.read_csv("IMDb Movies India.csv", encoding=locale.getpreferredencoding())
df.head()

Unnamed: 0,Name,Year,Duration,Genre,Rating,Votes,Director,Actor 1,Actor 2,Actor 3
0,,,,Drama,,,J.S. Randhawa,Manmauji,Birbal,Rajendra Bhatia
1,#Gadhvi (He thought he was Gandhi),(2019),109 min,Drama,7.0,8.0,Gaurav Bakshi,Rasika Dugal,Vivek Ghamande,Arvind Jangid
2,#Homecoming,(2021),90 min,"Drama, Musical",,,Soumyajit Majumdar,Sayani Gupta,Plabita Borthakur,Roy Angana
3,#Yaaram,(2019),110 min,"Comedy, Romance",4.4,35.0,Ovais Khan,Prateik,Ishita Raj,Siddhant Kapoor
4,...And Once Again,(2010),105 min,Drama,,,Amol Palekar,Rajat Kapoor,Rituparna Sengupta,Antara Mali


In [807]:
df.shape

(15509, 10)

In [808]:
# Duplicated entries
df.duplicated().sum()

6

In [809]:
# Removing duplicates
df = df.drop_duplicates()
df.shape

(15503, 10)

In [810]:
# Attributes with null values
df.isnull().any()

Name        False
Year         True
Duration     True
Genre        True
Rating       True
Votes        True
Director     True
Actor 1      True
Actor 2      True
Actor 3      True
dtype: bool

In [811]:
# Null values
df.isnull().sum()

Name           0
Year         527
Duration    8264
Genre       1876
Rating      7584
Votes       7583
Director     524
Actor 1     1615
Actor 2     2381
Actor 3     3140
dtype: int64

There are a lot of null values. Filling up these many null entries may mislead us in our prediction. So, keeping them as it is for now.

## Preprocessing

The data exploration grabbed my concentration towards the three attributes - Actor 1, Actor 2 and Actor 3. <br>
It is possible that the same actor is in "Actor 1" attribute in some entry, and "Actor 2" in some other entry. <br>
Having them split into three attributes won't let us explore the actor effectively. <br>
This gives us an instinct that the actors must be brought into same column somehow. <br>

I'm trying to concatenate the actors into a string separated by commas.

**Approach 1: String Concatenation with '+' operator**

In [812]:
df['Actor 1'] + ", " + df['Actor 2'] + ", " + df['Actor 3']

0                     Manmauji, Birbal, Rajendra Bhatia
1           Rasika Dugal, Vivek Ghamande, Arvind Jangid
2           Sayani Gupta, Plabita Borthakur, Roy Angana
3                  Prateik, Ishita Raj, Siddhant Kapoor
4         Rajat Kapoor, Rituparna Sengupta, Antara Mali
                              ...                      
15504    Naseeruddin Shah, Sumeet Saigal, Suparna Anand
15505         Akshay Kumar, Twinkle Khanna, Aruna Irani
15506                                               NaN
15507                                               NaN
15508               Dharmendra, Jaya Prada, Arjun Sarja
Length: 15503, dtype: object

<br><br>
The above method works but it brings NaN as output even if one actor column is a NaN. <br>
This is not fair. We have to bring up a strategy by which we can fetch every non-null actor into consideration. <br>
So, we need to check whether the entry is null or not before concatenating. How about the below method? <br>

**Approach 2: Joining List of non-null Actors with Delimiter**

In [813]:
def concatenate_actors(row):
    actors = [actor for actor in row if pd.notnull(actor)]
    return ', '.join(actors)
df['Actors'] = df[['Actor 1', 'Actor 2','Actor 3']].apply(concatenate_actors, axis=1)
df.head()

Unnamed: 0,Name,Year,Duration,Genre,Rating,Votes,Director,Actor 1,Actor 2,Actor 3,Actors
0,,,,Drama,,,J.S. Randhawa,Manmauji,Birbal,Rajendra Bhatia,"Manmauji, Birbal, Rajendra Bhatia"
1,#Gadhvi (He thought he was Gandhi),(2019),109 min,Drama,7.0,8.0,Gaurav Bakshi,Rasika Dugal,Vivek Ghamande,Arvind Jangid,"Rasika Dugal, Vivek Ghamande, Arvind Jangid"
2,#Homecoming,(2021),90 min,"Drama, Musical",,,Soumyajit Majumdar,Sayani Gupta,Plabita Borthakur,Roy Angana,"Sayani Gupta, Plabita Borthakur, Roy Angana"
3,#Yaaram,(2019),110 min,"Comedy, Romance",4.4,35.0,Ovais Khan,Prateik,Ishita Raj,Siddhant Kapoor,"Prateik, Ishita Raj, Siddhant Kapoor"
4,...And Once Again,(2010),105 min,Drama,,,Amol Palekar,Rajat Kapoor,Rituparna Sengupta,Antara Mali,"Rajat Kapoor, Rituparna Sengupta, Antara Mali"


<br><br>Yes, this method seem to perform exactly what we expected. <br>
As you can see, in 15506 only "Sangeeta Tiwari" is shown, which means only one entry is non-null. <br>
But this cannot be fed into a model. The model will consider the whole string as a single entity. <br>
The model will fail to understand that it consists of  three names. And, it will fail to match the same person if found again. <br>
So we have to somehow bring each actor as separate entity. <br>

**Approach 1: One-Hot Encoding**

In [814]:
mySet = set()
for row in df['Actors']:
    for actor in row.split(', '):
        mySet.add(actor)

In [815]:
len(mySet)

10289

<br>So, this shows that there are 10289 unique actors in the dataset.<br>
Adopting the approach of one-hot encoding will need us to create 10289 unique attributes, which doesn't sound good. <br>

**Approach 2: Label Encoding** <br>
Just after our intuition of one-hot encoding, many would think LabelEncoder would make the job easier, because they don't need column creation at all. <br>
But they also possess some problems here. <br>
- LabelEncoder will assign a number for each actor. This means it has to assign for 10289 actors, increasing the output dimensionality.
- Also, each row will yield three numbers, one for each actor. Handling these three numbers will lead to same issue.

So, Label Encoding is also not a good idea here.

<br>

**Approach 3: Feature Hashing** <br>
This approach incorporates the concept of Hashing to perform dimensionality reduction

In [816]:
hasher = FeatureHasher(n_features=10, input_type='string')
hashed_actors = hasher.transform([x.split(', ') for x in df['Actors']])
actors_df = pd.DataFrame(hashed_actors.toarray())
actors_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,1.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0
1,0.0,0.0,1.0,0.0,1.0,-1.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,-1.0,0.0,-1.0
3,-1.0,0.0,-1.0,0.0,-1.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,-1.0


In [817]:
actors_df.shape

(15503, 10)

Before proceeding further, the missing values must be filled.

<br>**Handling Missing Values and Preprocessing** <br>

"Year" attribute:

In [818]:
df.isnull().sum()

Name           0
Year         527
Duration    8264
Genre       1876
Rating      7584
Votes       7583
Director     524
Actor 1     1615
Actor 2     2381
Actor 3     3140
Actors         0
dtype: int64

In [819]:
# Retrieving non-null values of year
not_null = df['Year'][df['Year'].notnull()]

# Removing the open and close brackets
remove_brackets= not_null.str.replace('(', '').str.replace(')', '')

# Converting year: string into int
int_years = remove_brackets.astype(np.int32)

# Computing mean year
mean_year = int(int_years.mean())
mean_year

1987

In [820]:
# Filling mean year in place of null

df['Year'] = df['Year'].str.replace('(', '').str.replace(')', '')
df['Year'] = df['Year'].fillna(mean_year)
df['Year'] = df['Year'].astype(np.int32)
df['Year']

0        1987
1        2019
2        2021
3        2019
4        2010
         ... 
15504    1988
15505    1999
15506    2005
15507    1988
15508    1998
Name: Year, Length: 15503, dtype: int32

<br>Preprocessing the "Duration" attribute:

In [821]:
df['Duration']

0            NaN
1        109 min
2         90 min
3        110 min
4        105 min
          ...   
15504        NaN
15505    129 min
15506        NaN
15507        NaN
15508    130 min
Name: Duration, Length: 15503, dtype: object

In [822]:
replaced_duration = df['Duration'].str.replace(" min", "")
replaced_duration

0        NaN
1        109
2         90
3        110
4        105
        ... 
15504    NaN
15505    129
15506    NaN
15507    NaN
15508    130
Name: Duration, Length: 15503, dtype: object

In [823]:
mean_duration = int(replaced_duration[replaced_duration.notnull()].astype(np.int32).mean())
mean_duration

128

In [824]:
df['Duration'] = replaced_duration.fillna(mean_duration)
df['Duration']

0        128
1        109
2         90
3        110
4        105
        ... 
15504    128
15505    129
15506    128
15507    128
15508    130
Name: Duration, Length: 15503, dtype: object

<br>Preprocessing the "Genre" attribute

In [825]:
df['Genre'].isnull().sum()

1876

In [826]:
df['Genre']

0                  Drama
1                  Drama
2         Drama, Musical
3        Comedy, Romance
4                  Drama
              ...       
15504             Action
15505      Action, Drama
15506             Action
15507             Action
15508      Action, Drama
Name: Genre, Length: 15503, dtype: object

<br>As we can see, "Genre" has to be treated in a very similar approach of how we treated the "Actors". <br>
We can see that these have multiple values for single entry, varying no of values for a single entry, and sometimes holds NaN as well. <br>
So, let's convert NaN with empty string and use FeatureHasing approach here.

In [827]:
df["Genre"] = df['Genre'].fillna("")

In [828]:
genre_hasher = FeatureHasher(n_features = 10, input_type = "string")
genre_df = pd.DataFrame(genre_hasher.transform(df['Genre'].str.split(", ")).toarray())
genre_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
3,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,-1.0
4,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


In [829]:
genre_df.shape

(15503, 10)

<br>Preprocessing the "Rating" attribute

In [830]:
df['Rating']

0        NaN
1        7.0
2        NaN
3        4.4
4        NaN
        ... 
15504    4.6
15505    4.5
15506    NaN
15507    NaN
15508    6.2
Name: Rating, Length: 15503, dtype: float64

In [831]:
mean_rating = df['Rating'][df['Rating'].notnull()].mean().round(1)
mean_rating

5.8

In [832]:
df["Rating"] = df['Rating'].fillna(mean_rating)
df['Rating']

0        5.8
1        7.0
2        5.8
3        4.4
4        5.8
        ... 
15504    4.6
15505    4.5
15506    5.8
15507    5.8
15508    6.2
Name: Rating, Length: 15503, dtype: float64

<br>Preprocessing the "Votes" attribute

In [833]:
df['Votes']

0        NaN
1          8
2        NaN
3         35
4        NaN
        ... 
15504     11
15505    655
15506    NaN
15507    NaN
15508     20
Name: Votes, Length: 15503, dtype: object

In [834]:
notnull_votes = df['Votes'][df['Votes'].notnull()]
notnull_votes

1            8
3           35
5          827
6        1,086
8          326
         ...  
15501      135
15503       44
15504       11
15505      655
15508       20
Name: Votes, Length: 7920, dtype: object

<br>This cannot be directly converted to int32. Presence of commas will raise errors. <br>

In [835]:
notnull_votes = notnull_votes.str.replace(",","")
notnull_votes

1           8
3          35
5         827
6        1086
8         326
         ... 
15501     135
15503      44
15504      11
15505     655
15508      20
Name: Votes, Length: 7920, dtype: object

<br> Converting this to integer also yield error due to presence of dollar, and letter "M" <br>
$ can be replaced to empty quotes. But "M" has to be replaced with a string of six zeros. Because, Million(M) = 1000000 

In [836]:
notnull_votes = notnull_votes.str.replace('$',"").str.replace("M", "000000")
notnull_votes

1           8
3          35
5         827
6        1086
8         326
         ... 
15501     135
15503      44
15504      11
15505     655
15508      20
Name: Votes, Length: 7920, dtype: object

<br>Still the values are not able to be converted to integer format because of presence of a float value. <br>
But presence of float value in an attribute like "Voting" seems illogical. So converting it to integer. <br>

In [837]:
notnull_votes = notnull_votes.str.split('.').str[0]
notnull_votes

1           8
3          35
5         827
6        1086
8         326
         ... 
15501     135
15503      44
15504      11
15505     655
15508      20
Name: Votes, Length: 7920, dtype: object

In [838]:
notnull_votes = notnull_votes.astype(np.int32)
notnull_votes

1           8
3          35
5         827
6        1086
8         326
         ... 
15501     135
15503      44
15504      11
15505     655
15508      20
Name: Votes, Length: 7920, dtype: int32

<br>Finally, the votes are converted into integer!

In [839]:
# Computing the mean value
notnull_votes.mean()

1938.2762626262627

In [840]:
# Computing the median value
notnull_votes.median()

55.0

As we can see, the mean value is too high. This must be due to presence of outlier. <br>
The value of mean is sensitive to outliers. <br>
So using the median here to fill the missing values.

In [841]:
df['Votes'] = df["Votes"].replace(["NaN", "nan"], np.nan)  # Here some null were not in proper form
df['Votes']

0        NaN
1          8
2        NaN
3         35
4        NaN
        ... 
15504     11
15505    655
15506    NaN
15507    NaN
15508     20
Name: Votes, Length: 15503, dtype: object

In [842]:
df["Votes"] = df['Votes'].str.replace(",","")

In [843]:
df['Votes'] = df['Votes'].fillna(int(notnull_votes.median())).astype('str')

In [844]:
df["Votes"] = df['Votes'].str.replace("$", "").str.replace("M","000000")

In [845]:
df["Votes"] = df["Votes"].str.split(".").str[0]

In [846]:
df["Votes"] = df["Votes"].astype(np.int32)
df["Votes"]

0         55
1          8
2         55
3         35
4         55
        ... 
15504     11
15505    655
15506     55
15507     55
15508     20
Name: Votes, Length: 15503, dtype: int32

<br>Preprocessing the attribute "Director"

In [847]:
df["Director"].isnull().sum()

524

In [848]:
# Filling random name for director will be inappropriate. Let it be empty quotes instead
df["Director"] = df["Director"].fillna("")

<br>Concatenating Genre Features (10 features) and Actors features (10 features) to the data

In [849]:
preprocessed_df = pd.concat([df, genre_df, actors_df], axis=1).drop(["Actor 1", "Actor 2", "Actor 3", "Actors", "Genre"],axis=1)
preprocessed_df

Unnamed: 0,Name,Year,Duration,Rating,Votes,Director,0,1,2,3,...,0.1,1.1,2.1,3.1,4,5,6,7,8,9
0,,1987.0,128,5.8,55.0,J.S. Randhawa,0.0,0.0,0.0,0.0,...,1.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0
1,#Gadhvi (He thought he was Gandhi),2019.0,109,7.0,8.0,Gaurav Bakshi,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,-1.0,0.0,0.0,0.0,0.0
2,#Homecoming,2021.0,90,5.8,55.0,Soumyajit Majumdar,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,-1.0,0.0,-1.0
3,#Yaaram,2019.0,110,4.4,35.0,Ovais Khan,0.0,0.0,0.0,0.0,...,-1.0,0.0,-1.0,0.0,-1.0,0.0,0.0,0.0,0.0,0.0
4,...And Once Again,2010.0,105,5.8,55.0,Amol Palekar,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,-1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1769,,,,,,,0.0,0.0,0.0,0.0,...,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4723,,,,,,,0.0,0.0,0.0,1.0,...,0.0,-1.0,0.0,-1.0,0.0,-1.0,0.0,0.0,0.0,0.0
9713,,,,,,,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,-2.0,0.0,0.0,0.0,1.0,0.0,0.0
13069,,,,,,,1.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Train-test Split

In [850]:
encoder = LabelEncoder()
preprocessed_df["Name"] = encoder.fit_transform(preprocessed_df["Name"])

In [851]:
# Director has always only one entry
df["Director"].str.split(",").str.len().unique()

array([1], dtype=int64)

In [852]:
preprocessed_df["Director"] = encoder.fit_transform(preprocessed_df["Director"])

In [853]:
preprocessed_df = preprocessed_df.reset_index()[:15503] # Removing those duplicates

In [854]:
preprocessed_df

Unnamed: 0,index,Name,Year,Duration,Rating,Votes,Director,0,1,2,...,0.1,1.1,2.1,3,4,5,6,7,8,9
0,0,0,1987.0,128,5.8,55.0,1927,0.0,0.0,0.0,...,1.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0
1,1,1,2019.0,109,7.0,8.0,1549,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,-1.0,0.0,0.0,0.0,0.0
2,2,2,2021.0,90,5.8,55.0,5124,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,-1.0,0.0,-1.0
3,3,3,2019.0,110,4.4,35.0,3320,0.0,0.0,0.0,...,-1.0,0.0,-1.0,0.0,-1.0,0.0,0.0,0.0,0.0,0.0
4,4,7,2010.0,105,5.8,55.0,386,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,-1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15498,15504,13832,1988.0,128,4.6,11.0,2691,,,,...,,,,,,,,,,
15499,15505,13834,1999.0,129,4.5,655.0,2500,,,,...,,,,,,,,,,
15500,15506,13835,2005.0,128,5.8,55.0,2425,,,,...,,,,,,,,,,
15501,15507,13836,1988.0,128,5.8,55.0,0,,,,...,,,,,,,,,,


In [855]:
preprocessed_df.columns = ["index", "Name", "Year", "Duration", "Rating", "Votes", "Director"] + ["Genre"+str(i) for i in range(10)] + ["Actors"+str(i) for i in range(10)]

In [856]:
preprocessed_df = preprocessed_df.dropna()
preprocessed_df = preprocessed_df.drop("index", axis=1)

In [857]:
# Splitting target and other attributes
X = preprocessed_df.drop("Rating", axis=1)
y = preprocessed_df["Rating"]

In [858]:
# Train test splitting
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [859]:
X_train.shape

(12397, 25)

In [860]:
y_train.shape

(12397,)

In [861]:
X_test.shape

(3100, 25)

In [862]:
y_test.shape

(3100,)

## Modeling

In [863]:
X_train

Unnamed: 0,Name,Year,Duration,Votes,Director,Genre0,Genre1,Genre2,Genre3,Genre4,...,Actors0,Actors1,Actors2,Actors3,Actors4,Actors5,Actors6,Actors7,Actors8,Actors9
5353,4755,1955.0,128,55.0,3350,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,-1.0,1.0,0.0,0.0,0.0,-1.0,0.0
15322,13694,2008.0,151,4147.0,5177,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,-1.0,-1.0,0.0
335,315,1989.0,149,53.0,666,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0,-1.0,0.0
8304,7434,2019.0,128,55.0,1748,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,-1.0,0.0,0.0,0.0,0.0,1.0,-1.0,0.0
5400,4796,1990.0,128,55.0,3159,0.0,0.0,-1.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5191,4624,1953.0,128,55.0,44,0.0,0.0,-1.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13418,11931,2004.0,174,203.0,2485,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0
5390,4787,1999.0,128,6.0,4464,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,-1.0,0.0,1.0,1.0,0.0,0.0,0.0
860,767,2021.0,128,55.0,4625,0.0,1.0,0.0,0.0,0.0,...,1.0,-1.0,0.0,0.0,0.0,-1.0,0.0,0.0,0.0,0.0


In [864]:
# Linear Regression
linear_reg = LinearRegression()
linear_reg.fit(X_train, y_train)
y_pred = linear_reg.predict(X_test)
mean_squared_error(y_test, y_pred)

0.9700142643993769

In [865]:
# KFold
def cross_valid(times, X, y):
    acc = []
    for i in range(times):
        lin_reg = LinearRegression()
        X_train_sample, X_test_sample, y_train_sample, y_test_sample = train_test_split(X, y, test_size=0.2, random_state=i)
        lin_reg.fit(X_train_sample, y_train_sample)
        y_pred_sample = lin_reg.predict(X_test_sample)
        acc.append(mean_squared_error(y_pred_sample, y_test_sample))
    return acc

In [866]:
kfold = np.array(cross_valid(10,X,y))
kfold

array([0.98481926, 0.93764461, 0.89812089, 0.99715571, 0.96079152,
       0.98637184, 0.93330949, 0.89952593, 0.98097447, 0.97619919])

In [868]:
print(f"Accuracy of Linear Regression (10 K-Fold): {kfold.mean()}")

Accuracy of Linear Regression (10 K-Fold): 0.9554912912450515


<br>As you can notice, the function for obtaining K-Fold changes the random_state for every iteration. <br>
This way, we ensure that the training set and testing set chosen in every iteration are different. <br>
Even after shuffling the data for every iteration of cross-validation, still the accuracy remains great. <br>
Thus, the objective of building a model for Movie Rating Prediction was successfully accomplished.