# Results of Project and Discussion

## 1) Project Recap

Before we discuss the results, we'd like to share a quick recap of our project, going through the different stages we've gone through in these last four weeks.

### 1.1) Research Question

**Original Research Question:** Assemble the next blockbuster film.

**Modified Research Question:** Can we predict whether or not a movie will be a blockbuster, based on its pre-release characteristics?

**Imagined Scenario:**
A client comes to us with a movie proposition. It is up to us to tell them, based on this movies features, if it will be a blockbuster or not. We are basing out classification on Gross Income and IMDB Rating. 

|                        | **Low Gross Income**        | **High Gross Income**                                   |
|------------------------|-----------------------------|----------------------------------------------------------|
| **Low IMDB Rating**    | Flop                        | Critically–Disliked Blockbuster                          |
| **High IMDB Rating**   | Hidden Gem                  | ***Critically–Acclaimed Blockbuster***                   |


**1.2) Data Acquisition**

Our data was obtained by webscraping the IMDb website. We initially used two documents: one containing the title and information for 11.4 million movies, and a second with the rating and number of votes for 1.5 million movies. These documents were merged, and only the 1.5 million movies with ratings were kept. Additional cleaning was done to remove titles that weren’t movies and to only keep movies released after 2020. This was part of the pre-scraping process, which left us with data on 54 thousand movies.

Finally, we scraped the IMDb website using those 54k movie titles and ended up with a dataset of 54,095 movies and 16 columns. These included runtime, genres, storyline, themes, director, actor, gross income, and average IMDb rating, all used as features for our model.

We dropped rows with null entries in the gross income column, as well as the columns: tconst, titleType, Unnamed, originalTitle, and endYear.

---
### 1.3) Exploratory Data Analysis

Our goal during the Exploratory Data Analysis (EDA) was to understand the data's structure, patterns, and relationships between features.
We started by building frequency distribution graphs for each feature, then explored deeper relationships between variables.
The correlation matrix for numerical values showed the strongest relationship between number of votes and gross income, with a correlation of 0.63. This makes sense — movies with more votes tend to be more widely seen, and thus, make more money.

All other variables showed weak correlation with rating, suggesting that rating is subjective and influenced by external factors like marketing and viewer demographics.We also manually inspected non-numeric features, exploring how actor names and director names relate to both gross income and rating. We also looked at actor pairs and whether specific combinations influenced a movie’s success. 

This manual inspection helped us spot some inconsistencies, which in turn guided further preprocessing steps, aiding in variable selection and model design.

EDA Steps:
- Checked dataset shape, size, dimensions, and types
- Summary statistics: distributions, central tendencies
- Visualizations: histograms, box plots, scatter plots
- Correlation matrix and heatmap for variable relationships

---
### 1.4) Preprocessing

During preprocessing, we made many decisions regarding dropping columns, encoding, and feature creation. We decided to drop the number of votes column because using it would be cheating, as we’d be predicting the past based on future information, which isn't valid.

We used one-hot encoding to encode genres into 25 categories, as genres often co-occur and show some degree of covariance. We also created new features from natural language columns like storyline, themes, and primary title. For columns that weren’t categorical or numeric, we applied dimensionality reduction, topic modeling, and sentiment analysis.

*Topic and Sentiment Analysis*
We performed topic analysis on text-heavy columns that aren’t easily encoded, such as themes, primary title, and storyline. These fields had almost as many unique values as there were movies, so one-hot encoding wasn't feasible. However, they contain valuable information, which is why we opted for topic modeling and sentiment analysis instead.

*Actor and Director*
Preprocessing actor and director data followed a similar approach. The actor column, originally a list of strings, was split into multiple rows so that each actor had their own row. Since our model can only interpret numerical values, we had to convert actor and director names accordingly.
We did this by averaging the gross worldwide revenue for each actor, which turned names into numerical values usable by the model. The same was done for directors, who were already listed in individual rows.

For actors or directors in the test set who didn’t appear in the training set, we created a default value using a custom function. This function took all actors who appeared in only one movie and averaged their gross income and rating to assign a fallback value.

---
### 1.5) LightGBM Model

We chose to use LightGBM because it supports classification, is an accurate model, handles mized types of data, and is efficient. 

Light Gradient Boosting Machine (LightGBM) is an ensemble learning method that builds models sequentially, where each new model learns from the errors of the previous ones. It improves performance by focusing on the leaf with the highest error and growing the tree vertically from there. Instead of level-wise tree growth, LightGBM uses a leaf-wise approach, which often results in better accuracy. In the end, the final model is a combination of all the individual models, making it a powerful and efficient method for structured data.

A big advantage of LightGBM is that it handles mixed data types well. It allows for integer-coded categorical features and focuses on optimal splits specifically for categorical data. However, it does have some drawbacks. LightGBM can be prone to overfitting, which is why parameters like max_depth are used to limit tree height. Additionally, it is slightly less interpretable compared to simpler models, though tools like SHAP values can help explain its predictions.

<hr style="height:2px;border-width:0;color:red;background-color:red">

## 2) Results

### 2.1) Regression Model

To train our LightGBM regression model, we started by identifying and declaring the categorical features, in this case: `"theme_sentiment_label"` and `"theme_topic_label"` and using the model’s `categorical_feature` parameter. We set up our parameters with `objective` set to regression, `metric` as RMSE, and `boosting_type` as gradient boosting decision trees. We used a learning rate of 0.1, `num_leaves` of 100, and kept `max_depth` unlimited (`-1`). We prepared three versions of our target variable: one for regression (`grossWorldwide`, log-transformed), and two for classification (`movieType1M` and `movieType100M`). We dropped the target columns from the feature set and wrapped both train and test data into LightGBM dataset objects, specifying categorical features. The model was then trained over 700 boosting rounds, and we monitored performance every 50 rounds.

After training, we used the model to predict log-transformed gross income and computed the RMSE. The final RMSE was **1.9378**, with predictions ranging from **11.2 to 20.6**, compared to an actual range of **10.3 to 20.7**. Although these numbers are in the right range, the RMSE is still quite high, indicating that the model isn’t very precise in predicting actual gross revenue. This suggests that even though we processed and engineered features carefully, the lack of key predictors like budget or detailed marketing data limits the model’s ability to make accurate regression predictions.


---
### 2.2) Classification Model

The classification model proved far more effective than the regression model. Especially with the 1M threshold, the model performs with very high recall for the two key categories we care about: critically acclaimed and disliked blockbusters (blockbusters = movies with gross worldwide box office above 1 million dollars). One major reason for this is that movies which make about 1 million dollars usually have high budgets as well. High budgets translate into directors and actors who are famous and successful.

Having movies with known actors and directors increases the chances of them being seen in the training data. So, the model learns which actors and directors (our two key variables) tend to be in blockbusters and predicts that movies with certain (optimal) combinations of those actors and directors will become a blockbuster (with 80–84% recall, which is very impressive in our opinion). This makes sense because with the SHAP values we see that actors’ and directors’ statistics average gross worldwide of the movies they star in, and their average ratings matter.

Of course, the other features are also used by the model. We checked if removing them would improve our results (it worsened them), and seeing that the predictions are especially good for movies where we plausibly have information on the actors and directors in the training data confirms that actors and directors are the most informative features we have.

For the other two categories, flop and hidden gem, the model performed significantly worse with recalls of 17–20%. This is due to the lack of knowledge about the actors that star in them, as they usually weren't seen by the model previously. Hence, the model assigns them mean values, which inflates the predictions for them, leading to many false positives.

When we switched to the 100M threshold for the blockbuster category, the recall levels dropped significantly to around 17% for the two categories we care about. This is likely because even actors and directors who are relatively famous and have been in blockbusters previously still often make less than 100 million dollars at the box office. In fact, it is likely that information gained from the other features helped the model achieve these seemingly low recalls.

Overall, it's also important to note that the importance of these two columns (actor and director previous average gross worldwide) doesn’t necessarily imply a causal relationship. It could be the case that movies with high budgets hire such actors and directors, but it is because of their high budgets that they actually achieve financial success. Hence, in this way, actor and director average gross worldwide box office is an indirect informer of budget, the key missing feature in our dataset.

<hr style="height:2px;border-width:0;color:red;background-color:red">

## 3) Analysis

Our results show a clear difference in performance between the regression and classification models. The regression model, though carefully tuned and trained using LightGBM, reached an RMSE of 1.94 when predicting log-transformed gross income. While the predictions were within the right numerical range (11.2–20.6 vs. actual 10.3–20.7), the error remained high. This likely reflects the absence of key variables such as budget, which strongly influences a movie’s financial success but is missing from our dataset. Despite our efforts to engineer useful features including categorical labels, sentiment and topic analysis, and numerical representations of actor and director data, the model wasn’t able to capture enough variance to make precise predictions of gross revenue.

In contrast, the classification model proved far more effective, particularly when using the 1M gross income threshold. For the two categories we focused on, critically acclaimed and disliked blockbusters, the model reached a recall of 80–84%, which we consider very strong. This performance can be traced back to the importance of actor and director information. Movies that pass the 1M threshold tend to have higher budgets, which in turn attract well-known, high-performing directors and actors. These individuals are more likely to appear in the training data, allowing the model to learn patterns associated with blockbuster success. SHAP values confirmed that the average gross and ratings of actors and directors were among the most influential features. While other features were also used, removing them did not improve results, confirming their supporting role in model performance.

However, the classification model struggled with the categories flop and hidden gem, where recall dropped to 17–20%. These categories often feature lesser-known actors who were not present in the training set, forcing the model to rely on default average values. This inflated predictions and led to a high number of false positives. When we increased the threshold to 100M to define blockbusters, recall for our target categories dropped sharply to around 17%. This is likely because even successful actors and directors often don’t cross that revenue level, making such extreme cases harder to detect. The model likely relied on secondary features to make these limited predictions.

Overall, while actor and director statistics were the strongest predictors, we recognize they do not imply a causal relationship. It’s likely that high-budget films hire actors and directors with strong track records, but it’s the budget, not necessarily the people alone, that drives financial success. In that sense, actor and director average gross may serve as proxies for budget, which remains the key missing variable in our dataset.

<hr style="height:2px;border-width:0;color:red;background-color:red">

## 4) Discussion

### 4.1) Regression Model

To predict a movie's gross worldwide revenue, we used a LightGBM regressor and evaluated different hyperparameter tuning strategies. The initial model tuned via `RandomizedSearchCV` yielded a Root Mean Squared Error (RMSE) of **1.9521**. This value was improved to **1.8788** after manual hyperparameter tuning based on informed ranges.

| Model                  | RMSE   | Tuning Method              |
| ---------------------- | ------ | -------------------------- |
| LightGBM (Randomized)  | 1.9521 | RandomizedSearchCV         |
| LightGBM (Manual Grid) | 1.8788 | Manual Hyperparameter Grid |

In the "Predicted vs Actual Revenue" plot (Figure 1), we observed a consistent underestimation for very high-grossing movies (above \$10^8), suggesting the model struggles with outliers. However, the model performed well for mid-range and lower-grossing movies. This aligns with the idea that revenue distribution is heavily skewed, with a few blockbusters and many average performers.

---
### 4.2) Classification Model

For the classification task, the goal was to categorize movies into four groups:

* 0: Critically Acclaimed Blockbuster
* 1: Critically Disliked Blockbuster
* 2: Flop
* 3: Hidden Gem

Using a neural network classifier with softmax output, we evaluated prediction probabilities and confusion matrices.

The average predicted probabilities (for true classes) were:

| Class                            | Avg. Confidence |
| -------------------------------- | --------------- |
| Critically Acclaimed Blockbuster | 0.2246          |
| Critically Disliked Blockbuster  | 0.1308          |
| Flop                             | 0.9888          |
| Hidden Gem                       | 0.9971          |

As shown in the final confusion matrix (Figure 2), the model performed extremely well for "flop" and "hidden gem" classes, with **1769** and **2000** correct predictions, respectively. However, it misclassified **116 out of 150 critically acclaimed blockbusters** as hidden gems and **many disliked blockbusters as flops**.

| True Class \ Predicted | 0 (Blockbuster) | 1 (Disliked) | 2 (Flop) | 3 (Hidden Gem) |
| ---------------------- | --------------- | ------------ | -------- | -------------- |
| 0 (Blockbuster)        | 34              | 0            | 0        | 116            |
| 1 (Disliked)           | 0               | 5            | 34       | 0              |
| 2 (Flop)               | 0               | 5            | 1769     | 0              |
| 3 (Hidden Gem)         | 21              | 2            | 0        | 2000           |

This imbalance shows that the model has a strong bias toward classes with higher frequency and more consistent features (e.g., flops and hidden gems) but struggles to confidently separate blockbusters due to overlapping characteristics with other classes.

---
### 4.3) Limitations

Although our classification model was able to perform quite highly for what we expected, there are still some aspects of the process that could be enhanced for better results and/or documentation. The biggest limitation that our models have is that we are not taking into account the budget of the movies. The budget could have important effects, like decreasing the overall income of the movie, and therefore changing the way that movies are classified. This was not done due to the lack of time, but would be an interesting exploration for future research. 

Another area for improvement for our project is that we did not have enough time to compare sufficient models. If we had more time, then other models aside from LightGBM could be compared and contrasted to find the optimal model that highlights key aspects of our data and makes the best predictions possible.

---
### 4.5) Assemble the next blockbuster

When trying to answer our original research question and find the characteristics of a film that makes it a blockbuster, we were surprised by our results. As mentioned before, the feature that was most influential was the directors and actors that were present in the movie. So, according to our data, when assembling a blockbuster, creators have to hire high ranking actors and directors if they want to produce a successful blockbuster film.

<hr style="height:2px;border-width:0;color:red;background-color:red">

## 5) Conclusion

The regression and classification models provided complementary insights. The regression model reliably predicted gross revenue for most films, though it underperformed for extreme cases. The classification model, on the other hand, was excellent at detecting low and moderate performers (flops and hidden gems) but lacked precision in identifying blockbusters.

These limitations point to the inherent challenges in movie prediction tasks: blockbusters often combine qualitative elements like marketing, release timing, and cultural trends, which are hard to capture in data.

Nonetheless, the models serve well for risk reduction — helping identify which films are unlikely to succeed and highlighting traits associated with consistent performers. For future work, adding more industry-specific features or ensemble models may help capture outliers more effectively.