
![movie_popcorn.jpg](attachment:movie_popcorn.jpg)

## Introduction
In today’s data-driven entertainment industry, accurately predicting a movie’s box office performance is crucial for production houses, distributors, and investors. Traditional prediction models mostly rely on static features such as cast, director, genre, budget, and historical box office trends. While these models offer baseline insights, they fall short in capturing real-time dynamics like audience excitement, social media buzz, and word-of-mouth sentiment — all of which significantly influence a movie’s financial success.

Our project aims to bridge this gap by integrating sentiment and emotional signals derived from social media platforms such as Reddit, YouTube, and IMDB, alongside structured movie data. This enriched approach goes beyond just revenue forecasting — it empowers stakeholders with timely insights into public anticipation, emotional resonance, and real-world audience perception prior to and during a movie's release cycle.

## Problem statement
Traditional box office prediction models primarily rely on structured features such as cast, genre, budget, and past revenue performance. While useful, these models overlook a critical factor that increasingly influences a movie's commercial success: real-time public sentiment and emotional engagement across digital platforms. In today’s highly connected world, the success or failure of a film is significantly shaped by how audiences perceive and discuss it before and during its release. Public hype, emotional buzz, and online discourse play a vital role in shaping movie attendance and streaming behavior — yet current models fail to incorporate this dynamic, real-time social feedback loop.

## Apporach
We propose that integrating sentiment polarity and emotional tone from public discussions will improve forecasting accuracy, helping stakeholders make smarter marketing, release, and investment decisions — ultimately minimizing financial risk and maximizing return.

Our project involves two types of data sources. The first is static or raw data, which we extract from IMDB. The second is real-time data from platforms like YouTube and Reddit, where we gather up-to-date public opinions about movies and generate sentiment scores. We use both types of data to train our models.

We trained three models using only structured metadata features: XGBoost, CatBoost, and LightGBM.

Sentiment & Emotion Signal Integration
We extracted public sentiment and emotional reactions using pretrained machine learning models. Due to computational constraints, instead of fetching data for all movies directly from the Reddit and YouTube APIs, we used these pretrained models to generate sentiment and emotion scores.

## Proposal Changes
Originally, the project included Twitter as a major sentiment source. However, due to Twitter's API restrictions and limited access to non-privileged developer data, we had to remove Twitter from the pipeline.

To compensate and strengthen our model, we introduced emotion analysis — going beyond sentiment polarity (positive/negative) to capture the type and depth of emotion expressed by the audience. This enhancement helps differentiate between movies with similar sentiment but different emotional impact.

## Team Contribution

This section outlines the individual contributions of each team member toward the successful execution of the project.

| **Task**                         | **Team Member**              |
|----------------------------------|------------------------------|
| IMDB Data Collection             | Nitish Kumar                 |
| Reddit Data Collection           | Leonardo Ferreira            |
| Reddit Data Analysis           | Leonardo Ferreira            |
| YouTube Data Collection          | Aryan Shetty                 |
| YouTube Data Analysis          | Aryan Shetty                 |
| Data Cleaning & Preprocessing    | Sunil Kuruba                 |
| Exploratory Data Analysis (EDA)  | Sunil Kuruba & Nitish Kumar |
| Machine Learning Model Development | Niharika Belavadi Shekar  |
| Model Evaluation and Verification | Niharika Belavadi Shekar  |
| Documentation | Sunil Kuruba  |


## Expected Deliverables (Milestones)

| **Phase**   | **Timeline**        | **Deliverables**                                              |
|-------------|---------------------|---------------------------------------------------------------|
| Phase 1     | March (Week 1–3)    | Data collection, cleaning, standardization                   |
| Phase 2     | April (Week 1–2)    | EDA, Sentiment analysis, feature engineering, model setup         |
| Phase 3     | April (Week 3–4)    | Final model training, evaluation, and report writing         |

## Reflection

We encountered some practical challenges with data extraction using the Twitter API, primarily due to restricted access to historical data. Additionally, we had to apply filters to both YouTube and Reddit datasets to align the extracted data with the release dates of the corresponding movies.

The project is progressing well—we estimate that around 80% of the work has been completed. Our hypotheses are strong, and the preliminary results look promising. At this stage, we are not facing any major roadblocks.

Overall, the teamwork has been exceptional. From collaborative coding and effective brainstorming sessions to detailed visualizations and smooth communication, our combined efforts have really come together to make significant progress.