# Movie Revenue Prediction



**1. Summery**
Here, we will build a machine learning model to predict the revenue of movies based on different attributes. We will use a linear regression model, and Random Forest.

**2. Data Loading and Preprocessing**

In [38]:
# Importing necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler

# Load the data
movies = pd.read_csv('movie_dataset.csv')

# Fill missing directors with 'Unknown'
movies['director'].fillna('Unknown', inplace=True)

print(movies)

      index     budget                                    genres  \
0         0  237000000  Action Adventure Fantasy Science Fiction   
1         1  300000000                  Adventure Fantasy Action   
2         2  245000000                    Action Adventure Crime   
3         3  250000000               Action Crime Drama Thriller   
4         4  260000000          Action Adventure Science Fiction   
...     ...        ...                                       ...   
4798   4798     220000                     Action Crime Thriller   
4799   4799       9000                            Comedy Romance   
4800   4800          0             Comedy Drama Romance TV Movie   
4801   4801          0                                       NaN   
4802   4802          0                               Documentary   

                                               homepage      id  \
0                           http://www.avatarmovie.com/   19995   
1          http://disney.go.com/disneypictures/pi

Here, we load the data and handle missing values in the 'director' feature by replacing them with 'Unknown'.

**3. Model Building: Linear Regression (Director Only)**

In [39]:
# One-hot encoding for director only
movies_encoded = pd.get_dummies(movies['director'])

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(movies_encoded, movies['revenue'], test_size=0.2, random_state=42)

# Train a regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Evaluate the model
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

print("Mean Squared Error:", mse)

Mean Squared Error: 1.2390624272818453e+44


In this part, we built a simple linear regression model using only the 'director' feature. One-hot encoding was applied to handle categorical values. The model didn't perform well, so we thought the 'director' feature alone is not enough to predict movie revenue.

**4. Model Building: Linear Regression (Director + Budget)**

In [40]:
# Load the data
movies = pd.read_csv('movie_dataset.csv')

# Fill missing directors and budget with 'Unknown' and median respectively
movies['director'].fillna('Unknown', inplace=True)
movies['budget'].fillna(movies['budget'].median(), inplace=True)

# Scale the revenue using StandardScaler
scaler_revenue = StandardScaler()
movies['revenue'] = scaler_revenue.fit_transform(movies['revenue'].values.reshape(-1, 1))

# One-hot encoding for director
movies_encoded = pd.get_dummies(movies['director'])

# Combine encoded and budget features
movies_final = pd.concat([movies['budget'], movies_encoded], axis=1)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(movies_final, movies['revenue'],
                                                    test_size=0.2, random_state=42)

# Train a regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Evaluate the model
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

# Print the MSE
print("Mean Squared Error:", mse)

Mean Squared Error: 667029824.1574824


Here, we added the 'budget' feature in addition to the 'director' feature. The 'budget' values were also missing in some rows, so we replaced them with the median. To handle the skewness in 'revenue', we scale revenue using StandardScaler.

However, this model still didn't perform well, because the mean squared error (MSE) was still quite high.

***5. Model Building: Random Forest***

In [44]:
# Load the data
movies = pd.read_csv('movie_dataset.csv')

# Fill missing directors and budget with 'Unknown' and median respectively
movies['director'].fillna('Unknown', inplace=True)
movies['budget'].fillna(movies['budget'].median(), inplace=True)

# Scale the revenue using StandardScaler
scaler = StandardScaler()
movies['revenue_scaled'] = scaler.fit_transform(movies['revenue'].values.reshape(-1, 1))

# One-hot encoding for director
movies_encoded = pd.get_dummies(movies['director'])

# Combine encoded and budget features
movies_final = pd.concat([movies['budget'], movies_encoded], axis=1)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(movies_final, movies['revenue_scaled'], test_size=0.2, random_state=42)

# Train a random forest regression model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Evaluate the model
y_pred_scaled = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred_scaled)

# Inverse transform the scaled predictions
y_pred = scaler.inverse_transform(y_pred_scaled.reshape(-1, 1))

# Create a table to display the predictions and actual values
table = pd.DataFrame({'Movie': movies.loc[y_test.index, 'title'],
                      'Prediction': y_pred.flatten().astype(int),
                      'Actual Revenue': movies.loc[y_test.index, 'revenue'].astype(int)})

# Print the table
print(table)
print()

# Print the MSE
print("Mean Squared Error:", mse)

                                       Movie  Prediction  Actual Revenue
596                                    I Spy   212838966        33561137
3372                            Split Second           0               5
2702                                  Gossip     1290542         5108820
2473                Vicky Cristina Barcelona    29312036        96408652
8     Harry Potter and the Half-Blood Prince   818748805       933959197
...                                      ...         ...             ...
2801                             The Funeral     3648639         1227324
198                                 R.I.P.D.   251096709        61648500
2423                            Summer Catch    27197206        19693891
2298                               Sex Drive    10657704        18755936
402                              The Rundown   152897409        80916492

[961 rows x 3 columns]

Mean Squared Error: 0.3879626643869817
                                       Movie  Prediction  Ac

Finally, we switched to a Random Forest model. We continued to use the 'director' and 'budget' features but replaced the Linear Regression model with a Random Forest model, a robust ensemble learning method. it improved the model's performance. We got a much smaller MSE indicating a better fit to the data.

In conclusion, we can say that the Random Forest model performed significantly better than the Linear Regression model in predicting movie revenue in our case. The inclusion of the 'budget' feature and the application of feature scaling also played a significant role in enhancing the model's performance.