# Project D10: KAGGLE - Movie Ratings

Authors:
- Kevin Kliimask
- Jens Jäger
- Taavi Eistre

In this notebook we are going to analyze the dataset and train regression models.

## Importing the data

We will start off with importing all the necessary packages and the data.
After looking at the data manually, we saw that 6 rows were 'broken' so to say with a lot of misaligned columns, so we decided to skip them.
We will be using the selected columns from the report.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso
import math
import ast
%matplotlib inline

data = pd.read_csv('movies_metadata.csv', skiprows=[19730, 19731, 29503, 29504, 35587, 35588],
                   usecols=['original_title', 'original_language', 'genres', 'production_companies',
                            'production_countries', 'runtime', 'revenue', 'release_date', 'vote_average'])

## Preprocessing the data

First off we'll remove all the duplicate rows (movies).

In [None]:
print(f'The number of movies before removing the duplicates: {len(data)}')
data = data.drop_duplicates(ignore_index=True)

print(f'The number of movies after removing the duplicates: {len(data)}')

data.head()

Next off we will be converting the **JSON** formats of columns **'genres'**, **'production_companies'** and **'production_countries'** to a list format using the help of `ast.literal_eval()`,
which helps to parse the columns' objects into the desired type.

In [None]:
data['genres'] = data['genres'].apply(lambda genres_list: [genres['name'] for genres in ast.literal_eval(genres_list)])

data['production_companies'] = data['production_companies'].apply(lambda companies_list: [companies['name'] for companies in ast.literal_eval(companies_list)])

data['production_countries'] = data['production_countries'].apply(lambda countries_list: [countries['name'] for countries in ast.literal_eval(countries_list)])

data.head()

Next we will remove all the movies that have:
- No genres
- No production countries
- No production companies
- 0 runtime
- 0 vote_average
- NaN values

In [None]:
print(f'The number of movies before removing all of the mentioned movies above: {len(data)}')

data = data.dropna()
data = data[(data.runtime > 0) & (data.vote_average > 0) & (data.genres.str.len() > 0) & (data.production_companies.str.len() > 0) & (data.production_countries.str.len() > 0)]
data = data.reset_index(drop=True)

print(f'The number of movies after removing all of the mentioned movies above: {len(data)}')

Finding the **total number** and **frequency** of genres, companies and countries.

In [None]:
genres_dict = {}
companies_dict = {}
countries_dict = {}

for index, row in data.iterrows():
    for genre in row['genres']:
        genres_dict[genre] = genres_dict.get(genre, 0) + 1

    for company in row['production_companies']:
        companies_dict[company] = companies_dict.get(company, 0) + 1

    for country in row['production_countries']:
        countries_dict[country] = countries_dict.get(country, 0) + 1

In [None]:
print(f'The number of genres: {len(genres_dict)}')
print(f'The number of production companies: {len(companies_dict)}')
print(f'The number of production countries: {len(countries_dict)}')

After seeing that there are **too many** production companies, we decided not to use them on the model.

## Exploring our data

Now that we have cleaned the data, let's have a look at what interesting we can find from what remains.

In [1]:
data.head()

NameError: name 'data' is not defined

Firstly, we'll look at the top 10 highest rated movies.

In [None]:
print(f'The number of movies with a rating of 10: {len(data[data.vote_average == 10])}')
data.sort_values(by=['vote_average'], ascending=False).head(10)[['original_title', 'vote_average']]

Next, we will see the **frequency** histogram of the top 25 original languages of movies.

In [None]:
language_freq = data['original_language'].value_counts()[:25]
language_freq.plot(kind='bar', figsize=(12, 5), rot=45, xlabel='Original language', ylabel='Number of movies', title='Original language frequency histogram')
plt.show()

Next up, runtime **frequency** histogram.

In [None]:
data['runtime'].hist(bins=100, range=[0, 300])
plt.title('Runtime frequency histogram')
plt.xlabel('Runtime')
plt.ylabel('Number of movies')

plt.show()

The vote average (rating) **frequency** histogram.

In [None]:
data['vote_average'].hist(bins=10, range=[0, 10])
plt.title('Vote average frequency histogram')
plt.xlabel('Vote average')
plt.ylabel('Number of movies')

plt.show()

Now, let's see the **relative frequency** histogram of genres.

In [None]:
genres_dict = dict(sorted(genres_dict.items(), key=lambda x: x[1], reverse=True))
total_number = 0
for freq in genres_dict.values():
    total_number += freq

plt.figure(figsize=(12, 5))
plt.bar(range(len(genres_dict)), [freq / total_number for freq in genres_dict.values()], align='center')
plt.xticks(range(len(genres_dict)), list(genres_dict.keys()), rotation=45)
plt.ylabel('Relative frequency of genres')
plt.title('Relative frequency histogram of genres')

plt.show()

The **relative frequency** histogram of the top 25 production companies.

In [None]:
companies_dict = dict(sorted(companies_dict.items(), key=lambda x: x[1], reverse=True)[:25])
total_number = 0
for freq in companies_dict.values():
    total_number += freq

plt.figure(figsize=(10, 7))
plt.barh(list(companies_dict.keys()), [freq / total_number for freq in companies_dict.values()], align='center', orientation='horizontal')
plt.gca().invert_yaxis()
plt.xlabel('Relative frequency of production companies')
plt.title('Production companies relative frequency histogram')

plt.show()

And finally, the **relative frequency** histogram of the top 25 production countries.

In [None]:
countries_dict = dict(sorted(countries_dict.items(), key=lambda x: x[1], reverse=True))
total_number = 0
for freq in list(countries_dict.values())[:25]:
    total_number += freq

plt.figure(figsize=(10, 7))
plt.barh(list(countries_dict.keys())[:25], [freq / total_number for freq in list(countries_dict.values())[:25]], align='center')
plt.gca().invert_yaxis()
plt.xlabel('Relative frequency of production countries')
plt.title('Production countries relative frequency histogram')

plt.show()

The top 15 mean yearly revenues.

In [None]:
data['release_date'] = data['release_date'].apply(lambda date: str(date)[:4])

yearly_revenues = data.groupby('release_date')['revenue'].mean().sort_values(ascending=False)[:15]
yearly_revenues = yearly_revenues.apply(lambda x: round(x / 1_000_000, 2))

yearly_revenues.plot(kind='bar', figsize=(10, 5), rot=45, title='Top 15 mean yearly revenue', xlabel='Release year', ylabel='Yearly mean revenue (mil)')

## Starting with the model
## Preparing data

We will drop 'original_title', 'production_companies', 'release_date' and 'revenue', as these are not going to be used in the model.

In [None]:
data.head()

In [None]:
data = data.drop(columns=['original_title', 'production_companies', 'release_date', 'revenue'])

data.head()

Then we will use pd.get_dummies to one-hot-encode 'original_language' column.

In [None]:
data = pd.get_dummies(data, columns=['original_language'])
data.head()

Next, we will be using MultiLabelBinarizer() from sklearn to one-hot-encode 'genres' and 'production_countries' columns.

In [None]:
genres_columns = [genre for genre in genres_dict.keys()]

countries_columns = [country for country in countries_dict.keys()]

In [None]:
mlb = MultiLabelBinarizer()

genres_list = pd.DataFrame(mlb.fit_transform(data['genres']), columns=genres_columns)

countries_list = pd.DataFrame(mlb.fit_transform(data['production_countries']), columns=countries_columns)

data = pd.concat([data.drop(columns=['genres', 'production_countries']), genres_list, countries_list], axis=1)

## Splitting the data

Splitting the data for training and testing. For this we will be using train_test_split from sklearn.

In [None]:
X = data.drop(columns=['vote_average'])
y = data['vote_average']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, random_state=0)

print('Features data shape:')
print(X_train.shape)
print('Values data shape:')
print(y_train.shape)

We will be trying to train the model on two different training data. The first one will be the full training dataset and the second one will be a more balanced one, where movie ratings (vote averages) try to be in equal number (same number of 2's as there are 4's for example).

In [None]:
train_new = pd.concat([X_train, y_train], axis=1)

train_0_4 = train_new[train_new.vote_average < 5]
train_5_6 = train_new[(train_new.vote_average >= 5) & (train_new.vote_average < 7)]
train_7_10 = train_new[train_new.vote_average >= 7]

print(f'Length of movies with 0-4: {len(train_0_4)}')
print(f'Length of movies with 5-6: {len(train_5_6)}')
print(f'Length of movies with 7-10: {len(train_7_10)}')

In [None]:
train_5_6 = train_5_6.sample(n=3389, random_state=0)
train_7_10 = train_7_10.sample(n=3389, random_state=0)

train_new = pd.concat([train_0_4, train_5_6, train_7_10]).sort_index()

X_train_new = train_new.drop(columns=['vote_average'])
y_train_new = train_new['vote_average']

In [None]:
print(f'Length of original training data: {len(X_train)}')
print(f'Length of more balanced training data: {len(X_train_new)}')

Defining our way of determining the accuracy of the models.
For this we will be using root mean squared error.

In [None]:
def MSE(y_target, y_pred):
    sum = 0
    for i in range(len(y_target)):
        sum += (y_target[i] - y_pred[i]) ** 2
    return sum / len(y_target)

def RMSE(y_target, y_pred):
    return math.sqrt(MSE(y_target, y_pred))

## Training the models

We will be trying Linear Regression, Ridge Regression and Lasso Regression for this dataset.

In [None]:
linear_1 = LinearRegression()
linear_2 = LinearRegression()

ridge_1 = Ridge()
ridge_2 = Ridge()

lasso_1 = Lasso()
lasso_2 = Lasso()

Fitting the models.

In [None]:
linear_1 = linear_1.fit(X_train, y_train)
linear_2 = linear_2.fit(X_train_new, y_train_new)

ridge_1 = ridge_1.fit(X_train, y_train)
ridge_2 = ridge_2.fit(X_train_new, y_train_new)

lasso_1 = lasso_1.fit(X_train, y_train)
lasso_2 = lasso_2.fit(X_train_new, y_train_new)

## Testing the models

Having the new models predict on the testing data.

In [None]:
linear_1_pred = linear_1.predict(X_test)
linear_2_pred = linear_2.predict(X_test)

ridge_1_pred = ridge_1.predict(X_test)
ridge_2_pred = ridge_2.predict(X_test)

lasso_1_pred = lasso_1.predict(X_test)
lasso_2_pred = lasso_2.predict(X_test)

Finding the accuracy of our new models.

In [None]:
print(f'RMSE for LR (unbalanced) - {RMSE(y_test.to_numpy(), linear_1_pred)}')
print(f'RMSE for LR (more balanced) - {RMSE(y_test.to_numpy(), linear_2_pred)}\n')

print(f'RMSE for Ridge (unbalanced) - {RMSE(y_test.to_numpy(), ridge_1_pred)}')
print(f'RMSE for Ridge (more balanced) - {RMSE(y_test.to_numpy(), ridge_2_pred)}\n')

print(f'RMSE for Lasso (unbalanced) - {RMSE(y_test.to_numpy(), lasso_1_pred)}')
print(f'RMSE for Lasso (more balanced) - {RMSE(y_test.to_numpy(), lasso_2_pred)}\n')

As we can see, Ridge Regression performed the best on our test set with a root mean squared error of about 1.09.