CSE 351 FINAL PROJECT
PHRIREYAANTH POOBALARAJ
Project #1: Movie Revenue Prediction (2-3 People)

Film industry is booming, the revenues are growing. There are many factors which affect the revenue of a film. In this project, you will explore what features can help to predict the revenue.

Datasets:

The “movie.zip” file contains the datasets to be used for this project and a file describing the various columns in the data. You must split the dataset yourself into training, testing, and cross validation data(when required). Data points provided include cast, crew, plot keywords, budget, posters, release dates,, languages, production companies, and countries.

EDA (10 points):

Get familiar with the dataset and decide what features and observations will be useful. Make good use of visualizations.

Specific tasks may include but are not limited to:

● Clean the dataset, remove the outliers, before any data analysis. Explain what you did.

● Some of the columns contain lists and dictionaries. Extract information you need and reformat them.

● Count the number of movies released by day of week, month and year, are there any patterns that you observe?

● What are the movie genre trend shifting patterns that you can observe from the dataset?

● What are the strongest and weakest features correlated with movie revenue?

● You can also use some external datasets to integrate into your revenue prediction analysis to make it better.

Modeling and Question Answering (10 points):

Extract the features you think are necessary in predicting the movie revenue.

Build three models, train them on the training set, and predict the revenue on the test set (after dropping the revenue column in the test set). Explain how each model works (briefly introduce the machine learning algorithms behind them). Evaluate the performance of each model based on the original outcome in the test set. If your predictions are not so accurate, what do you think is the reason? Report your accuracy using metrics such as Residual Standard Error (RSE). Split the data further to include a cross validation set. Did this improve your model’s performance on the test set?

Project Report (10 points):

You are required to document your project, which can be included in the notebook itself. Don't forget to include the team members contribution information in the documentation. Include visualizations to prove your point. You should prepare a powerpoint presentation, which can help you during the demo.

Demo (5 points):

Sign up for a Zoom session with the mentor to present your project. All the team members should be present during the demo. Be prepared to answer questions related to your work. You should present your findings for the project, and you should also be able to run your code.

Submission:

Submit the following on Blackboard:

Code in pdf and ipynb format
Project Presentation in Powerpoint format

Dataset Upload

In [6]:
from google.colab import files
uploaded = files.upload()


Saving tmdb_5000_credits.csv to tmdb_5000_credits (1).csv
Saving tmdb_5000_movies.csv to tmdb_5000_movies (1).csv


This cell uses the `google.colab` module to manually upload local files into the Colab environment.


In [7]:
import pandas as pd

# Load the movies dataset
movies_df = pd.read_csv('tmdb_5000_movies (1).csv')

# Load the credits dataset
credits_df = pd.read_csv('tmdb_5000_credits (1).csv')

# Preview the datasets
print("Movies DataFrame:")
print(movies_df.head())

print("\nCredits DataFrame:")
print(credits_df.head())


Movies DataFrame:
      budget                                             genres  \
0  237000000  [{"id": 28, "name": "Action"}, {"id": 12, "nam...   
1  300000000  [{"id": 12, "name": "Adventure"}, {"id": 14, "...   
2  245000000  [{"id": 28, "name": "Action"}, {"id": 12, "nam...   
3  250000000  [{"id": 28, "name": "Action"}, {"id": 80, "nam...   
4  260000000  [{"id": 28, "name": "Action"}, {"id": 12, "nam...   

                                       homepage      id  \
0                   http://www.avatarmovie.com/   19995   
1  http://disney.go.com/disneypictures/pirates/     285   
2   http://www.sonypictures.com/movies/spectre/  206647   
3            http://www.thedarkknightrises.com/   49026   
4          http://movies.disney.com/john-carter   49529   

                                            keywords original_language  \
0  [{"id": 1463, "name": "culture clash"}, {"id":...                en   
1  [{"id": 270, "name": "ocean"}, {"id": 726, "na...                en   
2 

- Imports the `pandas` library for data manipulation.
- Loads two CSV files into separate pandas DataFrames:
  - `movies_df`: Contains metadata about movies (e.g., title, budget, genres, revenue).
  - `credits_df`: Contains casting and crew information.
- Displays the first few rows of each dataset using `head()` to verify successful loading.

Data Cleaning

In [8]:
import pandas as pd
import numpy as np

# original count
original_movie_count = movies_df.shape[0]

cleaned_movies_df = movies_df.query(
    '(budget <= 175_000_000) and (revenue <= 700_000_000) and '
    '(vote_count <= 8000) and (3.5 <= vote_average <= 8.3) and '
    '(popularity <= 150) and (runtime >= 60) and (runtime <= 200)'
).dropna(subset=['runtime']).copy()

# Count removed movies
removed_movie_count = original_movie_count - cleaned_movies_df.shape[0]

# valid movie IDs
valid_movie_ids = cleaned_movies_df['id'].unique()
cleaned_credits_df = credits_df[credits_df['movie_id'].isin(valid_movie_ids)].copy()

# Reset indices
cleaned_movies_df.reset_index(drop=True, inplace=True)
cleaned_credits_df.reset_index(drop=True, inplace=True)

# Final
print("Original number of movies:", original_movie_count)
print("Movies remaining after composite filter:", len(cleaned_movies_df))
print("Movies removed:", removed_movie_count)
print("Credits remaining:", len(cleaned_credits_df))


Original number of movies: 4803
Movies remaining after composite filter: 4504
Movies removed: 299
Credits remaining: 4504


This code cell is primarily focused on cleaning and refining the TMDB movies and credits datasets. It starts by noting the original count of movies and then applies filters to eliminate outliers based on various criteria such as budget (capped at $175 million), revenue (capped at $700 million), vote count (up to 8,000), average rating (between 3.5 and 8.3), popularity (up to 150), and runtime (between 60 and 200 minutes). Additionally, movies with missing runtime values are removed to ensure complete data. Once filtering is complete, it calculates the number of movies removed from the original dataset. Following this, the credits dataset is filtered to only retain entries that correspond to the remaining valid movie IDs, ensuring both datasets are in sync. Finally, both DataFrames have their indices reset for tidiness, and the script provides a summary including the number of original movies, the remaining count, removed count, and the number of credits linked to the filtered movie set. This cleaning step is vital for removing noise and extreme values, enhancing the reliability and performance of subsequent models.


In [9]:
import numpy as np

# Extract revenue
revenue_array = cleaned_movies_df['revenue'].to_numpy()
revenue_array = revenue_array[~np.isnan(revenue_array)]  # remove NaNs

#  statistics
mean_revenue = np.mean(revenue_array)
median_revenue = np.median(revenue_array)
std_revenue = np.std(revenue_array, ddof=1)  # sample std dev to match pandas

# Print
print(f"Mean Revenue: ${mean_revenue:,.2f}")
print(f"Median Revenue: ${median_revenue:,.2f}")
print(f"Standard Deviation of Revenue: ${std_revenue:,.2f}")


Mean Revenue: $66,495,724.18
Median Revenue: $19,475,081.50
Standard Deviation of Revenue: $107,108,987.41


First, it converts the 'revenue' column into a NumPy array and removes any NaN values. Why? So that our calculations don't get messed up by any missing info.

Then, with the help of NumPy, it figures out the mean, median, and standard deviation of the 'revenue' column. Don’t worry, it calculates the standard deviation using ddof=1, just like pandas does. Alright, in a moment, we’ll see these results printed out. Nothing fancy, just the values formatted as U.S. dollars with two decimal points.

The basic reasoning behind this was to summarize our movie revenue column and make it easier to understand. This quick overview should make it easier for us to decide what to do with the data later on.



Feature Engineering

In [21]:
import pandas as pd
import numpy as np
import ast
from collections import Counter

df = pd.merge(cleaned_movies_df, cleaned_credits_df, left_on='id', right_on='movie_id')

# Parse genres
def parse_column_list(x, key='name'):
    try:
        return [d.get(key) for d in ast.literal_eval(x) if isinstance(d, dict)]
    except:
        return []

df['genre_list'] = df['genres'].map(lambda x: parse_column_list(x))
all_genres = [g for genres in df['genre_list'] for g in genres]
genre_counts = pd.Series(all_genres).value_counts()
common_genres = set(genre_counts[genre_counts >= 50].index)

for genre in common_genres:
    df[f'genre_{genre}'] = df['genre_list'].map(lambda genres: int(genre in genres))

df['genre_Other'] = df['genre_list'].map(lambda genres: int(any(g not in common_genres for g in genres)))

#  Budget/Popularity
df['log_budget'] = np.log1p(df['budget'])
df['log_popularity'] = np.log1p(df['popularity'])

# Budget
budget_bins = [0, 2e7, 1e8, np.inf]
budget_labels = ['low', 'mid', 'high']
df['budget_bucket'] = pd.cut(df['budget'], bins=budget_bins, labels=budget_labels)
df = pd.concat([df, pd.get_dummies(df['budget_bucket'], prefix='budget')], axis=1)

#  Runtime
runtime_bins = [0, 90, 130, np.inf]
runtime_labels = ['short', 'medium', 'long']
df['runtime_bucket'] = pd.cut(df['runtime'], bins=runtime_bins, labels=runtime_labels)
df = pd.concat([df, pd.get_dummies(df['runtime_bucket'], prefix='runtime')], axis=1)

#  Date Feats
df['release_date'] = pd.to_datetime(df['release_date'], errors='coerce')
df['release_year'] = df['release_date'].dt.year

df['release_month'] = df['release_date'].dt.month
df['release_day_of_week'] = df['release_date'].dt.dayofweek.fillna(4).astype(int)

df['is_summer_release'] = df['release_month'].isin([5,6,7,8]).astype(int)
df['is_holiday_release'] = df['release_month'].isin([11,12]).astype(int)
df['is_spring_break'] = df['release_month'].isin([3,4]).astype(int)

df = pd.concat([df, pd.get_dummies(df['release_month'], prefix='month')], axis=1)

def five_year_bin(year):
    if pd.isna(year): return 'Unknown'
    if year < 1980: return 'Before 1980'
    base = int(year // 5 * 5)
    return f'{base}-{base+4}'

df['release_5yr'] = df['release_year'].map(five_year_bin)
df = pd.concat([df, pd.get_dummies(df['release_5yr'], prefix='period')], axis=1)

df['production_company_list'] = df['production_companies'].map(parse_column_list)

major_studios = [
    'Warner Bros', 'Warner Bros.', 'Universal Pictures', 'Walt Disney', 'Disney',
    'Columbia Pictures', 'Paramount', 'Paramount Pictures', '20th Century Fox',
    'New Line Cinema', 'Sony Pictures', 'MGM', 'Lionsgate', 'DreamWorks']

for studio in major_studios:
    col_name = f'studio_{studio.replace(" ", "_").replace(".", "").lower()}'
    df[col_name] = df['production_company_list'].map(lambda lst: int(any(studio in x for x in lst)))

df['is_major_studio'] = df['production_company_list'].map(
    lambda lst: int(any(any(studio in c for studio in major_studios) for c in lst))
)

#   average revenue
studio_revenue = {}
for _, row in df.iterrows():
    for company in row['production_company_list']:
        studio_revenue.setdefault(company, []).append(row['revenue'])

studio_avg = {k: np.mean(v) for k, v in studio_revenue.items() if v}
df['studio_avg_revenue'] = df['production_company_list'].map(
    lambda lst: max([studio_avg.get(name, 0) for name in lst]) if lst else 0
)

#  Franchise
def extract_collection(x):
    try:
        return ast.literal_eval(x)['name'] if pd.notna(x) and x != 'null' else None
    except:
        return None

if 'belongs_to_collection' in df.columns:
    df['is_franchise'] = df['belongs_to_collection'].map(lambda x: int(pd.notna(x) and x != 'null'))
    df['collection_name'] = df['belongs_to_collection'].map(extract_collection)
    collection_avg = df.groupby('collection_name')['revenue'].mean().to_dict()
    df['franchise_avg_revenue'] = df['collection_name'].map(lambda x: collection_avg.get(x, 0) if x else 0)
else:
    df['is_franchise'] = 0
    df['franchise_avg_revenue'] = 0

#  Director & Cast
def extract_director(crew):
    try:
        return next((m['name'] for m in ast.literal_eval(crew) if m['job'] == 'Director'), None)
    except:
        return None

df['director'] = df['crew'].map(extract_director)
df['vote_weighted'] = df['vote_average'] * np.log1p(df['vote_count'])

high_impact = df[(df['budget'] > 5e7) & (df['vote_weighted'] > df['vote_weighted'].quantile(0.75))]
top_directors = set(high_impact['director'].value_counts()[lambda x: x >= 5].index)
df['is_famous_director'] = df['director'].map(lambda x: int(x in top_directors))

director_avg = df.groupby('director')['revenue'].mean().to_dict()
df['director_avg_revenue'] = df['director'].map(lambda x: director_avg.get(x, 0) if x else 0)

def extract_cast_names(cast):
    try:
        return [d['name'] for d in ast.literal_eval(cast)]
    except:
        return []

df['cast_names'] = df['cast'].map(extract_cast_names)

actor_counter = Counter()
for names in high_impact['cast'].map(extract_cast_names):
    actor_counter.update(names)
top_actors = {name for name, count in actor_counter.items() if count >= 5}

df['has_famous_actor'] = df['cast_names'].map(lambda names: int(any(n in top_actors for n in names)))
df['famous_actor_count'] = df['cast_names'].map(lambda names: sum(n in top_actors for n in names))

#  Interaction terms
df['lang_non_en'] = (df['original_language'] != 'en').astype(int)
df['budget_x_popularity'] = df['log_budget'] * df['popularity']
df['budget_x_runtime'] = df['log_budget'] * df['runtime']
df['budget_x_vote_weighted'] = df['log_budget'] * df['vote_weighted']
df['franchise_x_budget'] = df['is_franchise'] * df['log_budget']
df['famous_director_x_budget'] = df['is_famous_director'] * df['log_budget']
df['famous_actor_x_budget'] = df['has_famous_actor'] * df['log_budget']
df['holiday_family_film'] = df['is_holiday_release'] * df.get('genre_Family', 0)
df['summer_action_film'] = df['is_summer_release'] * df.get('genre_Action', 0)
df['franchise_famous_actor'] = df['is_franchise'] * df['famous_actor_count']
df['franchise_famous_director'] = df['is_franchise'] * df['is_famous_director']
df['major_studio_budget'] = df['is_major_studio'] * df['log_budget']
df['weekend_release'] = (df['release_day_of_week'] >= 4).astype(int)
df['weekend_summer_release'] = df['weekend_release'] * df['is_summer_release']

#  Polynomial terms
df['log_budget_squared'] = df['log_budget'] ** 2
df['popularity_squared'] = df['popularity'] ** 2
df['vote_weighted_squared'] = df['vote_weighted'] ** 2
df['runtime_squared'] = df['runtime'] ** 2

df['runtime'] = df['runtime'].fillna(df['runtime'].mean())
df['budget_x_runtime'] = df['log_budget'] * df['runtime']
df['runtime_squared'] = df['runtime'] ** 2

unused_cols = [
    'genre_list', 'cast_names', 'runtime_bucket', 'release_month', 'season',
    'belongs_to_collection', 'crew', 'cast', 'genres', 'homepage',
    'original_language', 'overview', 'production_companies',
    'production_countries', 'spoken_languages', 'tagline', 'title', 'keywords',
    'release_5yr', 'collection_name', 'production_company_list',
    'vote_average', 'vote_count', 'has_homepage', 'num_cast',
    'num_production_companies', 'popularity_per_year', 'budget_per_year']

df.drop(columns=[col for col in unused_cols if col in df.columns], inplace=True)

Initially, the dataset undergoes merging on id and movie_id, combining the movie metadata with the crew and cast details. Individual genre labels are derived from the genres column, and new binary columns for frequently occurring genres (50+ occurrences) are generated, along with an "Other" genres column.

Several numerical transformations are conducted, such as log transforming the budget and popularity (results in: log_budget and log_popularity) to minimize skewness. Furthermore, budgets and runtimes are categorized into ranges (like short vs. long runtime) and are undergo one-hot encoding.

The release_date column is processed to obtain information like the year, month, day of the week, and seasonal indications such as is_summer_release, is_holiday_release, and is_spring_break. Month and 5-year periods also undergo one-hot encoding.

Information in the production_companies column is utilized to verify if a movie was created by a major studio, and the average revenue of those studios, which is then merged back into a column labeled studio_avg_revenue. If a movie is part of any franchise (from belongs_to_collection), that fact is logged, and average franchise revenue is calculated similarly.

The crew column is examined to fetch the director's identity, marking directors who have handled 5+ high-budget & vote-weighted movies as "famous," leading to the creation of is_famous_director and director_avg_revenue features. Likewise, top actors (often involved in high-impact films) are pointed out, leading to features like has_famous_actor and famous_actor_count.

Interactions are appended to capture intricate variable relationships (e.g., log_budget * vote_weighted or is_summer_release * genre_Action). Polynomial features, such as squared figures for budget, popularity, runtime, and vote-weighted scores, are incorporated to handle non-linear effects.

In conclusion, any missing runtime information is replaced with the column's mean, and columns that are unused or redundant (e.g., predominantly textual, ID columns, or original categorical fields) are eliminated, leaving the dataset organized and model-ready.

This entire transformation turns raw movie data into an enlightening and organized feature collection suitable for regression and forecasting analytics.

Feature Selection/Definition

In [23]:
from sklearn.model_selection import train_test_split

# Target
y = df['revenue']

# exists
if 'is_franchise' not in df.columns:
    df['is_franchise'] = 0

# Define
feature_groups = {
    "base": [
        'log_budget', 'popularity', 'runtime',
        'is_famous_director', 'has_famous_actor', 'vote_weighted', 'is_franchise',
        'director_avg_revenue', 'studio_avg_revenue', 'franchise_avg_revenue',
        'famous_actor_count', 'is_major_studio'
    ],
    "seasonality": [
        'release_day_of_week', 'is_summer_release',
        'is_holiday_release', 'is_spring_break', 'weekend_release'
    ],
    "month": [col for col in df.columns if col.startswith('month_')],
    "studio": [col for col in df.columns if col.startswith('studio_') and col != 'studio_avg_revenue'],
    "interactions": [
        'budget_x_popularity', 'budget_x_runtime', 'budget_x_vote_weighted',
        'franchise_x_budget', 'famous_director_x_budget', 'famous_actor_x_budget',
        'holiday_family_film', 'summer_action_film', 'franchise_famous_actor',
        'franchise_famous_director', 'major_studio_budget', 'weekend_summer_release'
    ],
    "polynomials": [
        'log_budget_squared', 'popularity_squared',
        'vote_weighted_squared', 'runtime_squared'
    ],
    "runtime": [col for col in df.columns if col.startswith('runtime_')],
    "language": ['lang_non_en'] if 'lang_non_en' in df.columns else [],
    "genres": [col for col in df.columns if col.startswith('genre_')],
    "season_flags": [col for col in df.columns if col.startswith('season_')],
    "budget_bucket": [col for col in df.columns if col.startswith('budget_')],
    "period": [col for col in df.columns if col.startswith('period_')]
}

# Flatten
feature_cols = [feat for group in feature_groups.values() for feat in group]

# Subset data
X = df[feature_cols]

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


The cell then organizes features into logical categories, using a dictionary called feature_groups. These groups consist of core features, seasonal indicators, one-hot encoded months, interaction terms, polynomial terms, runtime buckets, language indicators, and others. Each group is extracted by filtering column names based on consistent prefixes like month_ or genre_. The entire set of selected features is then flattened into a single list called feature_cols.

With the complete feature matrix X and the target y, the dataset is split into training and testing subsets using train_test_split, setting aside 20% of the data for testing. A fixed random seed (random_state=42) is used for reproducibility. This split is a common step before training machine learning models, ensuring performance is evaluated on unseen data.


In [26]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import pandas as pd
import numpy as np

# Target variable
y = df['revenue']

# exists
if 'is_franchise' not in df.columns:
    df['is_franchise'] = 0

# Drop
df = df.drop(columns=['budget_bucket', 'runtime_bucket'], errors='ignore')

#  Feature
feature_groups = {
    "base": [
        'log_budget', 'popularity', 'runtime',
        'is_famous_director', 'has_famous_actor', 'vote_weighted', 'is_franchise',
        'director_avg_revenue', 'studio_avg_revenue', 'franchise_avg_revenue',
        'famous_actor_count', 'is_major_studio'
    ],
    "seasonality": [
        'release_day_of_week', 'is_summer_release',
        'is_holiday_release', 'is_spring_break', 'weekend_release'
    ],
    "month": [col for col in df.columns if col.startswith('month_')],
    "studio": [col for col in df.columns if col.startswith('studio_') and col != 'studio_avg_revenue'],
    "interactions": [
        'budget_x_popularity', 'budget_x_runtime', 'budget_x_vote_weighted',
        'franchise_x_budget', 'famous_director_x_budget', 'famous_actor_x_budget',
        'holiday_family_film', 'summer_action_film', 'franchise_famous_actor',
        'franchise_famous_director', 'major_studio_budget', 'weekend_summer_release'
    ],
    "polynomials": [
        'log_budget_squared', 'popularity_squared',
        'vote_weighted_squared', 'runtime_squared'
    ],
    "runtime": [col for col in df.columns if col.startswith('runtime_')],
    "language": ['lang_non_en'] if 'lang_non_en' in df.columns else [],
    "genres": [col for col in df.columns if col.startswith('genre_')],
    "season_flags": [col for col in df.columns if col.startswith('season_')],
    "budget_bucket": [col for col in df.columns if col.startswith('budget_')],
    "period": [col for col in df.columns if col.startswith('period_')]
}

# Combine
feature_cols = [feat for group in feature_groups.values() for feat in group]

# Subset X matrix
X = df[feature_cols].copy()

non_numeric_cols = X.select_dtypes(include=['object', 'category']).columns
if len(non_numeric_cols) > 0:
    print("Dropping non-numeric columns:", list(non_numeric_cols))
    X.drop(columns=non_numeric_cols, inplace=True)

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Linear Regression model
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

# Get coefficients
coeffs = pd.Series(lr_model.coef_, index=X.columns).sort_values(ascending=False)
print("Feature Importances (Linear Coefficients):")
print(coeffs)


Feature Importances (Linear Coefficients):
budget_high             9.282254e+07
budget_mid              2.699072e+07
budget_low              2.161976e+07
period_1990-1994        2.069754e+07
studio_sony_pictures    1.381626e+07
                            ...     
period_2015-2019       -1.277751e+07
has_famous_actor       -1.321004e+07
genre_Western          -1.665816e+07
is_major_studio        -3.194385e+07
is_famous_director     -9.457685e+07
Length: 97, dtype: float64


This cell finalizes feature selection, handles data cleanup, and fits a linear regression model to predict movie revenue. First, it defines revenue as the target variable y and ensures the is_franchise column exists to avoid missing feature issues. Then, it drops two raw categorical columns—budget_bucket and runtime_bucket—if they exist, since those features are already encoded numerically elsewhere.

Next, the code defines feature_groups, which organizes features into logical categories such as base features, seasonality indicators, interaction terms, polynomial transformations, studio identifiers, and genre encodings. Columns with specific prefixes (e.g., month_, genre_, studio_) are dynamically selected. These groups are then flattened into a single list of features called feature_cols.

Using this feature list, the matrix X is created by subsetting the main DataFrame. To prevent training errors, any columns with non-numeric data types are identified and removed. A warning is printed if any such columns are dropped.

Afterward, the dataset is split into training and testing sets using train_test_split, reserving 20% of the data for evaluation, with a fixed random seed for reproducibility. A linear regression model is then trained on the training data, and its learned coefficients are extracted, sorted by magnitude, and displayed. These coefficients represent the importance and direction of each feature’s impact on predicted revenue.

Linear Regression

In [30]:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import (
    mean_absolute_error,
    mean_squared_error,
    r2_score
)

#   Training
model = LinearRegression()
model.fit(X_train, y_train)

#   Prediction
y_pred = model.predict(X_test)
residuals = y_test - y_pred

#   Metrics
metrics = {
    "MAE": mean_absolute_error(y_test, y_pred),
    "RMSE": np.sqrt(mean_squared_error(y_test, y_pred)),
    "R²": r2_score(y_test, y_pred),
    "RSE": np.sqrt(np.sum(residuals ** 2) / (len(y_test) - 2)),
    "Mean Residual": residuals.mean()
}

#  Output
print("\nLinear Regression Evaluation:")
for key, value in metrics.items():
    if key == "R²":
        print(f"{key}: {value:.4f}")
    else:
        print(f"{key}: {value:.2f}")

#  Bias
direction = "Underpredicting" if metrics["Mean Residual"] > 0 else "Overpredicting"
print(f"Mean Residual: {metrics['Mean Residual']:.2f} — {direction} on average")



Linear Regression Evaluation:
MAE: 36179445.72
RMSE: 56690486.65
R²: 0.7686
RSE: 56753511.11
Mean Residual: 2139281.73
Mean Residual: 2139281.73 — Underpredicting on average


In this part, the evaluation of a trained Linear Regression model on the test dataset is carried out. Initially, several packages are imported. Next, the training dataset (X_train, y_train) is employed to train an instance of a Linear Regression model. Then, predictions on the test dataset are made and they subtract the predictions from the actual revenue values to obtain the residual values (y_test - y_pred).

To evaluate the Linear Regression model's accuracy, several metrics are calculated, including Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), R² or the Coefficient of Determination, Residual Standard Error (RSE), and the Mean Residual concerning residuals.

These results are stored in a dictionary and printed neatly. Finally, they also focus on the Mean Residual to see if the model tends to underpredict or overpredict the target revenue on average, which would otherwise mislead the evaluation based only on absolute values. So, they refer to it as the "Direction of the Model" bias.


K-Nearest Neighbors

In [31]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
import numpy as np

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train
k = 5
knn = KNeighborsRegressor(n_neighbors=k)
knn.fit(X_train_scaled, y_train)

# Predict
y_pred_knn = knn.predict(X_test_scaled)
residuals_knn = y_test - y_pred_knn

#  Metrics
metrics_knn = {
    "MAE": mean_absolute_error(y_test, y_pred_knn),
    "RMSE": np.sqrt(mean_squared_error(y_test, y_pred_knn)),
    "R²": r2_score(y_test, y_pred_knn),
    "RSE": np.sqrt(np.sum(residuals_knn ** 2) / (len(y_test) - 2)),
    "Mean Residual": residuals_knn.mean()
}

#  Output
print("\nKNN Regression Evaluation:")
for metric, value in metrics_knn.items():
    if metric == "R²":
        print(f"{metric}: {value:.4f}")
    else:
        print(f"{metric}: {value:.2f}")

bias_direction = "Underpredicting" if metrics_knn["Mean Residual"] > 0 else "Overpredicting"
print(f"Mean Residual: {metrics_knn['Mean Residual']:.2f} — {bias_direction} on average")


KNN Regression Evaluation:
MAE: 41413723.35
RMSE: 73888717.60
R²: 0.6070
RSE: 73970861.83
Mean Residual: 11513964.25
Mean Residual: 11513964.25 — Underpredicting on average


This cell evaluates a K-Nearest Neighbors (KNN) regression model on the movie revenue prediction task. First, it applies feature standardization using StandardScaler to ensure that all input variables contribute equally to distance calculations, which is critical for distance-based models like KNN. The scaler is fit on the training data and then used to transform both training and test feature sets.

Next, a KNeighborsRegressor is initialized with k = 5 neighbors and trained on the scaled training data. Predictions are made on the standardized test set, and residuals are calculated by subtracting the predicted revenues from the actual values.

Evaluation metrics are then computed to assess the model's performance. These include:

Mean Absolute Error (MAE): the average magnitude of prediction errors,

Root Mean Squared Error (RMSE): penalizes larger errors more heavily,

R² Score: indicates the proportion of variance in revenue explained by the model,

Residual Standard Error (RSE): a measure of prediction dispersion,

Mean Residual: used to diagnose average prediction bias.

The output includes a detailed printout of these metrics, and the code concludes by interpreting whether the model underpredicts or overpredicts revenue on average based on the sign of the mean residual. This evaluation helps validate both the accuracy and bias tendency of the KNN model.

Random Forest Regression

In [32]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
import pandas as pd

# Train full
rf_full = RandomForestRegressor(
    n_estimators=200,
    max_depth=25,
    min_samples_split=5,
    min_samples_leaf=1,
    max_features='sqrt',
    random_state=42,
    n_jobs=-1
)
rf_full.fit(X_train, y_train)

# top 20
importances = pd.Series(rf_full.feature_importances_, index=X_train.columns)
top_features = importances.nlargest(20).index.tolist()

# Restrict datasets
X_train_top = X_train[top_features]
X_test_top = X_test[top_features]

#Retrain model
rf_top = RandomForestRegressor(
    n_estimators=200,
    max_depth=25,
    min_samples_split=5,
    min_samples_leaf=1,
    max_features='sqrt',
    random_state=42,
    n_jobs=-1
)
rf_top.fit(X_train_top, y_train)

# Predict
y_pred = rf_top.predict(X_test_top)
residuals = y_test - y_pred

metrics_rf = {
    "MAE": mean_absolute_error(y_test, y_pred),
    "RMSE": np.sqrt(mean_squared_error(y_test, y_pred)),
    "R²": r2_score(y_test, y_pred),
    "RSE": np.sqrt(np.sum(residuals ** 2) / (len(y_test) - 2)),
    "Mean Residual": residuals.mean()
}

# Output report
print("\nRandom Forest Evaluation on Top 20 Features:")
for metric, value in metrics_rf.items():
    if metric == "R²":
        print(f"{metric}: {value:.4f}")
    else:
        print(f"{metric}: {value:.2f}")

bias_note = "Underpredicting" if metrics_rf["Mean Residual"] > 0 else "Overpredicting"
print(f"Mean Residual: {metrics_rf['Mean Residual']:.2f} — {bias_note} on average")


Random Forest Evaluation on Top 20 Features:
MAE: 30755676.05
RMSE: 57862392.42
R²: 0.7590
RSE: 57926719.72
Mean Residual: 2402329.43
Mean Residual: 2402329.43 — Underpredicting on average


This cell leverages a Random Forest regression model to predict movie revenues and pinpoints the significant features influencing the model’s predictions. We start by training a comprehensive RandomForestRegressor with 200 trees, each having a maximum depth of 25. The parameters such as min_samples_split=5, min_samples_leaf=1, and max_features='sqrt' are chosen to maintain a good balance between overfitting and generalization.

Once the model is trained using all the features from the training dataset, we extract and sort feature importances, highlighting the top 20 that impact the model’s decisions the most. These top features are believed to best predict revenue, so we revise our data to only include them, creating a robust feature set.

We then train a new Random Forest model with just these top 20 features and make predictions on the test data. The performance is evaluated using Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), R² score, Residual Standard Error (RSE), and Mean Residual.

Once the metrics are printed, we check the sign of the mean residual for prediction bias—whether it tends to overestimate or underestimate on average. Overall, this process refines our Random Forest model’s performance and interpretability.


Model Cross-Validation

In [33]:
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np
import pandas as pd

#top 20 features
importances = pd.Series(model_top_20.feature_importances_, index=X_train_top.columns)
top_20 = importances.nlargest(20).index.tolist()

# Set up cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scoring = ['r2', 'neg_mean_squared_error', 'neg_mean_absolute_error']
cv_results = {}

print("\nCross-validation across models (5-Fold):")


def run_cv(model_name, model, X_input):
    results = {}
    for metric in scoring:
        scores = cross_val_score(model, X_input, y, cv=kf, scoring=metric)
        if metric == 'neg_mean_squared_error':
            results['rmse'] = np.sqrt(-scores)
            results['avg_rmse'] = np.sqrt(-scores.mean())
        elif metric == 'neg_mean_absolute_error':
            results['mae'] = -scores
            results['avg_mae'] = -scores.mean()
        else:
            results['r2'] = scores
            results['avg_r2'] = scores.mean()
    cv_results[model_name] = results

# Linear
run_cv("Linear Regression", LinearRegression(), X)

# KNN
knn_pipe = make_pipeline(StandardScaler(), KNeighborsRegressor(n_neighbors=5))
run_cv("KNN", knn_pipe, X)

# Random Forest
X_top = X[top_20]
rf_model = RandomForestRegressor(
    n_estimators=200, max_depth=25, min_samples_split=5,
    min_samples_leaf=1, max_features='sqrt', random_state=42, n_jobs=-1
)
run_cv("Random Forest", rf_model, X_top)

# Evaluation function
def evaluate_model(name, model, X_tr, X_te, y_tr, y_te, num_features):
    model.fit(X_tr, y_tr)
    y_pred = model.predict(X_te)
    residuals = y_te - y_pred
    return {
        'name': name,
        'mae': mean_absolute_error(y_te, y_pred),
        'rmse': np.sqrt(mean_squared_error(y_te, y_pred)),
        'r2': r2_score(y_te, y_pred),
        'rse': np.sqrt(np.sum(residuals ** 2) / (len(y_te) - num_features - 1)),
        'mean_residual': residuals.mean(),
        'bias': "Underpredicting" if residuals.mean() > 0 else "Overpredicting"
    }

test_results = []

# Linear Regression
test_results.append(
    evaluate_model("Linear Regression (all features)", LinearRegression(), X_train, X_test, y_train, y_test, X_train.shape[1])
)

# KNN
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
test_results.append(
    evaluate_model("KNN Regression (all features)", KNeighborsRegressor(n_neighbors=5), X_train_scaled, X_test_scaled, y_train, y_test, X_train.shape[1])
)

# Random Forest
test_results.append(
    evaluate_model("Random Forest (top 20 features)", rf_model, X_train[top_20], X_test[top_20], y_train, y_test, len(top_20))
)

# Test Set Evaluations
for res in test_results:
    print(f"\n{res['name']} Evaluation:")
    print(f"MAE: {res['mae']:.2f}")
    print(f"RMSE: {res['rmse']:.2f}")
    print(f"R²: {res['r2']:.4f}")
    print(f"RSE: {res['rse']:.2f}")
    print(f"Mean Residual: {res['mean_residual']:.2f} — {res['bias']} on average")

# Compare
print("\nCross-Validation vs. Test R² Comparison:")
for model_name, res in zip(cv_results.keys(), test_results):
    cv_r2 = cv_results[model_name]['avg_r2']
    test_r2 = res['r2']
    print(f"{model_name} - CV R²: {cv_r2:.4f}, Test R²: {test_r2:.4f}")
    if cv_r2 > test_r2:
        print("  → Test performance was worse than CV estimate")
    elif cv_r2 < test_r2:
        print("  → Test performance was better than CV estimate")
    else:
        print("  → Test performance matched CV estimate")



Cross-validation across models (5-Fold):

Linear Regression (all features) Evaluation:
MAE: 36179445.72
RMSE: 56690486.65
R²: 0.7686
RSE: 60050248.37
Mean Residual: 2139281.73 — Underpredicting on average

KNN Regression (all features) Evaluation:
MAE: 41413723.35
RMSE: 73888717.60
R²: 0.6070
RSE: 78267732.48
Mean Residual: 11513964.25 — Underpredicting on average

Random Forest (top 20 features) Evaluation:
MAE: 31242194.72
RMSE: 58414129.46
R²: 0.7544
RSE: 59107006.96
Mean Residual: 2507882.33 — Underpredicting on average

Cross-Validation vs. Test R² Comparison:
Linear Regression - CV R²: 0.7482, Test R²: 0.7686
  → Test performance was better than CV estimate
KNN - CV R²: 0.6102, Test R²: 0.6070
  → Test performance was worse than CV estimate
Random Forest - CV R²: 0.7542, Test R²: 0.7544
  → Test performance was better than CV estimate


This code cell is used for k-Cross Validation for standard Linear Regression, KNN, Random Forest models. We have first extracted out the top 20 RF 100-Tree from a pre-trained model to optimize the RF model as the tuning process is very time-consuming. We use KFold and CV to split our data into 5 sets for creating different combinations and use the run_cv() function to calculate the standard metrics. We calculate the RSE and Mean Residual on a separate test set using the evaluate_model() function to check if our model systematically over or underpredicts the target variable. Lastly, we check if the model might be overfitting by comparing CV R^2 with test R^2.


In conclusion, we used a machine learning pipeline to predict movie revenues using features from the TMDB dataset. After training models, cross-validation, and test set evaluation, we evaluated the performance of Linear Regression, K-Nearest Neighbors, and Random Forest, with the Random Forest model consistently delivering strong predictive performance with minimal bias.
