IMDb Movie Analysis: From Data to Dashboard
View the Interactive Tableau Dashboard Here

Project Overview This end-to-end data analysis project explores the IMDb Movie Dataset to uncover insights into what makes a film successful. The project begins with data cleaning and preprocessing in Python, moves to exploratory data analysis (EDA), includes the development of a predictive model for box office revenue, and culminates in a fully interactive business intelligence dashboard built in Tableau. A key challenge addressed was the transformation of multi-value Genre data into a format suitable for robust analysis in Tableau, a common real-world data preparation task. Key Features End-to-End Workflow: Demonstrates a full data analysis pipeline from raw data to final presentation. Advanced Data Prep: Solves the multi-value field problem by structuring data in Python for optimal use in Tableau using a relational model. Predictive Modeling: Implements multiple regression models in Scikit-learn to predict movie revenue based on key features. Interactive BI Dashboard: Consolidates all findings into a user-friendly Tableau dashboard for dynamic exploration. Tech Stack Data Manipulation & Analysis: Python, Pandas, NumPy Data Visualization (Python): Matplotlib, Seaborn Machine Learning: Scikit-learn BI & Dashboarding: Tableau Public Development Environment: JupyterLab Project Workflow
- Data Cleaning and Preparation (Python) The initial dataset was processed to ensure data quality and usability. Standardized Column Names: Converted column names to lowercase and replaced spaces with underscores (e.g., Revenue (Millions) -> revenue_millions). Handled Missing Values: Imputed missing revenue_millions and metascore values with their respective medians to avoid skewed analysis. Prepared Data for Tableau: To handle the multi-value Genre column (e.g., "Action,Adventure,Sci-Fi"), the data was split into two separate, optimized files. This relational approach is more robust than using Tableau's internal pivoting for calculated fields. Generated python
import pandas as pd
df = pd.read_csv('IMDB-Movie-Data.csv')
movies_main = df.drop('genre', axis=1)
genres_lookup = df[['rank', 'genre']].copy() genres_lookup['genre'] = genres_lookup['genre'].str.split(',') genres_lookup = genres_lookup.explode('genre')
movies_main.to_csv('movies_main.csv', index=False) genres_lookup.to_csv('genres_lookup.csv', index=False) Use code with caution. Python 2. Exploratory Data Analysis (Python & Tableau) EDA was performed to uncover initial patterns and relationships. A correlation heatmap showed that a movie's votes has the strongest positive correlation with its revenue_millions. Bar charts revealed that Drama is the most frequently produced genre, while Adventure and Sci-Fi are the most profitable on average. Analysis of top directors confirmed that a small group of individuals consistently generates high-revenue films. 3. Predictive Modeling (Python) A machine learning model was developed to predict box office revenue. Objective: Predict revenue_millions. Features: year, runtime_minutes, rating, votes, metascore. Models: Trained and compared Linear Regression, Random Forest, and Gradient Boosting models. Result: The Gradient Boosting Regressor performed best, accurately capturing the non-linear relationships in the data. Votes was identified as the most important predictive feature. Generated python
from sklearn.ensemble import GradientBoostingRegressor from sklearn.model_selection import train_test_split
features = ['year', 'runtime_minutes', 'rating', 'votes', 'metascore'] X = df[features] y = df['revenue_millions']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
gbr = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
gbr.fit(X_train, y_train)
Use code with caution.
Python
Key Insights
Audience Engagement Drives Revenue: The number of Votes is a more powerful predictor of revenue than critic (Metascore) or audience (Rating) scores alone. This suggests that creating "buzz" is paramount.
Genre Defines Profitability: While dramas are common, big-budget genres like Adventure, Action, and Sci-Fi are the financial engines of the film industry.
There is a Formula for Success: A combination of high audience engagement (votes), decent critical acclaim, and belonging to a high-revenue genre provides a strong indication of a movie's financial success.
How to Run This Project
Prerequisites
Python 3.8+
Tableau Public or Tableau Desktop
Setup & Installation
Clone the repository:
Generated bash
git clone https://github.com/your-username/IMDb-Data-Analysis-Python-Tableau.git
cd IMDb-Data-Analysis-Python-Tableau
Use code with caution.
Bash
Create and activate a virtual environment (recommended):
Generated bash
python -m venv venv
source venv/bin/activate # On Windows, use venv\Scripts\activate
Use code with caution.
Bash
Install the required Python libraries:
Generated bash
pip install -r requirements.txt
Use code with caution.
Bash
Execution
Run the Analysis: Launch JupyterLab and run the notebooks in the /notebooks directory.
Generated bash
jupyter-lab
Use code with caution.
Bash
View the Dashboard: Open the .twbx file in the /tableau directory with Tableau. The dashboard connects to the .csv files located in the /data directory.
Project Directory Structure
Generated code
IMDb-Data-Analysis-Python-Tableau/
├── data/
│ ├── IMDB-Movie-Data.csv # Raw data
│ ├── movies_main.csv # Cleaned data for Tableau
│ └── genres_lookup.csv # Genre lookup table for Tableau
│
├── notebooks/
│ ├── 01_Data_Cleaning_and_Preparation.ipynb
│ └── 02_EDA_and_Modeling.ipynb
│
├── tableau/
│ ├── IMDb_Dashboard.twbx # Tableau Workbook
│ └── dashboard_preview.png # Dashboard screenshot
│
├── .gitignore
├── README.md # You are here
└── requirements.txt # Python libraries