# Movie Rating Prediction

Here, we aim to predict the user rating of a movie given a director and a cast. We use the 'vote_average' feature in our dataset as the rating. Our dataset includes various features but for this task, we only use 'director' and 'cast'.

We try two different models to achieve this: Linear Regression and RandomForestRegressor.


In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.ensemble import RandomForestRegressor

We start by loading our movie dataset. We prepare our data by filling any missing director and cast information with 'Unknown'. We also transform the 'cast' feature from a string of actors to a list of actors.

In [2]:
# Load the data
movies = pd.read_csv('movie_dataset.csv')

# Fill missing directors and cast with 'Unknown'
movies['director'].fillna('Unknown', inplace=True)
movies['cast'].fillna('Unknown', inplace=True)

# Convert 'cast' feature from string to list of actors
movies['cast'] = movies['cast'].apply(lambda x: x.split())

We then perform one-hot encoding on the 'director' feature, and use the MultiLabelBinarizer to encode the 'cast' feature since each movie can have multiple actors.

In [3]:
# One-hot encoding for director and cast
director_encoded = pd.get_dummies(movies['director'])
cast_encoder = MultiLabelBinarizer()
cast_encoded = pd.DataFrame(cast_encoder.fit_transform(movies['cast']), columns=cast_encoder.classes_)

# Combine encoded features
movies_encoded = pd.concat([director_encoded, cast_encoded], axis=1)

The first model we try is a simple Linear Regression model.

In [4]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(movies_encoded, movies['vote_average'], test_size=0.2, random_state=42)

# Train a regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Evaluate the model
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

# Print the MSE
print("Mean Squared Error:", mse)

Mean Squared Error: 4.9634122165806904e+26


But the Mean Squared Error (MSE) was not so good.

To potentially improve our model, we decided to switch to a Random Forest Regressor. Random Forests are known to work well with a large number of features and can model complex interactions. However, when we initially tried running the RandomForestRegressor with the default parameters, the model was taking a very long time to train.

In [5]:
# Train a Random Forest model
# model = RandomForestRegressor(n_estimators=100, random_state=42) # Commenting this line as it took too long
# model.fit(X_train, y_train)

To address this issue, we reduced the number of trees in the forest (n_estimators), limited the maximum depth of the trees (max_depth), and set a minimum number of samples required to be at a leaf node (min_samples_leaf).

This made the model faster to train, but could potentially decrease its performance.

In [12]:
# Train a Random Forest model with modified parameters to speed up the training
model = RandomForestRegressor(n_estimators=50, max_depth=10, min_samples_leaf=4, random_state=42)
model.fit(X_train, y_train)

# Evaluate the model
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

# Print the MSE
print("Mean Squared Error:", mse)
print()

# Create a table to display the predictions and actual values
table = pd.DataFrame({'Movie': movies.loc[y_test.index, 'title'],
                      'Prediction': y_pred.round(1),
                      'Actual Rating': y_test})

# Print the table
print(table)

Mean Squared Error: 1.3292070000896272

                                       Movie  Prediction  Actual Rating
596                                    I Spy         6.1            5.2
3372                            Split Second         6.1            5.7
2702                                  Gossip         6.2            5.5
2473                Vicky Cristina Barcelona         6.2            6.7
8     Harry Potter and the Half-Blood Prince         6.2            7.4
...                                      ...         ...            ...
2801                             The Funeral         6.2            7.3
198                                 R.I.P.D.         6.2            5.4
2423                            Summer Catch         6.1            4.8
2298                               Sex Drive         6.2            6.0
402                              The Rundown         6.2            6.4

[961 rows x 3 columns]


In conclusion, we started with a simple linear regression model and then moved to a more complex random forest regressor to better capture the complexity of our data.