Skip to content

k-sahi/IMDB-Data-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

IMDb Movie Analysis: From Data to Dashboard View the Interactive Tableau Dashboard Here alt text

Project Overview This end-to-end data analysis project explores the IMDb Movie Dataset to uncover insights into what makes a film successful. The project begins with data cleaning and preprocessing in Python, moves to exploratory data analysis (EDA), includes the development of a predictive model for box office revenue, and culminates in a fully interactive business intelligence dashboard built in Tableau. A key challenge addressed was the transformation of multi-value Genre data into a format suitable for robust analysis in Tableau, a common real-world data preparation task. Key Features End-to-End Workflow: Demonstrates a full data analysis pipeline from raw data to final presentation. Advanced Data Prep: Solves the multi-value field problem by structuring data in Python for optimal use in Tableau using a relational model. Predictive Modeling: Implements multiple regression models in Scikit-learn to predict movie revenue based on key features. Interactive BI Dashboard: Consolidates all findings into a user-friendly Tableau dashboard for dynamic exploration. Tech Stack Data Manipulation & Analysis: Python, Pandas, NumPy Data Visualization (Python): Matplotlib, Seaborn Machine Learning: Scikit-learn BI & Dashboarding: Tableau Public Development Environment: JupyterLab Project Workflow

  1. Data Cleaning and Preparation (Python) The initial dataset was processed to ensure data quality and usability. Standardized Column Names: Converted column names to lowercase and replaced spaces with underscores (e.g., Revenue (Millions) -> revenue_millions). Handled Missing Values: Imputed missing revenue_millions and metascore values with their respective medians to avoid skewed analysis. Prepared Data for Tableau: To handle the multi-value Genre column (e.g., "Action,Adventure,Sci-Fi"), the data was split into two separate, optimized files. This relational approach is more robust than using Tableau's internal pivoting for calculated fields. Generated python

Key step: Preparing data for Tableau's relational model

import pandas as pd

df = pd.read_csv('IMDB-Movie-Data.csv')

... (initial cleaning) ...

1. Create the main movie data file

movies_main = df.drop('genre', axis=1)

2. Create the genre lookup table

genres_lookup = df[['rank', 'genre']].copy() genres_lookup['genre'] = genres_lookup['genre'].str.split(',') genres_lookup = genres_lookup.explode('genre')

3. Save both files for use in Tableau

movies_main.to_csv('movies_main.csv', index=False) genres_lookup.to_csv('genres_lookup.csv', index=False) Use code with caution. Python 2. Exploratory Data Analysis (Python & Tableau) EDA was performed to uncover initial patterns and relationships. A correlation heatmap showed that a movie's votes has the strongest positive correlation with its revenue_millions. Bar charts revealed that Drama is the most frequently produced genre, while Adventure and Sci-Fi are the most profitable on average. Analysis of top directors confirmed that a small group of individuals consistently generates high-revenue films. 3. Predictive Modeling (Python) A machine learning model was developed to predict box office revenue. Objective: Predict revenue_millions. Features: year, runtime_minutes, rating, votes, metascore. Models: Trained and compared Linear Regression, Random Forest, and Gradient Boosting models. Result: The Gradient Boosting Regressor performed best, accurately capturing the non-linear relationships in the data. Votes was identified as the most important predictive feature. Generated python

Example: Training the final model

from sklearn.ensemble import GradientBoostingRegressor from sklearn.model_selection import train_test_split

features = ['year', 'runtime_minutes', 'rating', 'votes', 'metascore'] X = df[features] y = df['revenue_millions']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

gbr = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=42) gbr.fit(X_train, y_train) Use code with caution. Python Key Insights Audience Engagement Drives Revenue: The number of Votes is a more powerful predictor of revenue than critic (Metascore) or audience (Rating) scores alone. This suggests that creating "buzz" is paramount. Genre Defines Profitability: While dramas are common, big-budget genres like Adventure, Action, and Sci-Fi are the financial engines of the film industry. There is a Formula for Success: A combination of high audience engagement (votes), decent critical acclaim, and belonging to a high-revenue genre provides a strong indication of a movie's financial success. How to Run This Project Prerequisites Python 3.8+ Tableau Public or Tableau Desktop Setup & Installation Clone the repository: Generated bash git clone https://github.com/your-username/IMDb-Data-Analysis-Python-Tableau.git cd IMDb-Data-Analysis-Python-Tableau Use code with caution. Bash Create and activate a virtual environment (recommended): Generated bash python -m venv venv source venv/bin/activate # On Windows, use venv\Scripts\activate Use code with caution. Bash Install the required Python libraries: Generated bash pip install -r requirements.txt Use code with caution. Bash Execution Run the Analysis: Launch JupyterLab and run the notebooks in the /notebooks directory. Generated bash jupyter-lab Use code with caution. Bash View the Dashboard: Open the .twbx file in the /tableau directory with Tableau. The dashboard connects to the .csv files located in the /data directory. Project Directory Structure Generated code IMDb-Data-Analysis-Python-Tableau/ ├── data/ │ ├── IMDB-Movie-Data.csv # Raw data │ ├── movies_main.csv # Cleaned data for Tableau │ └── genres_lookup.csv # Genre lookup table for Tableau │ ├── notebooks/ │ ├── 01_Data_Cleaning_and_Preparation.ipynb │ └── 02_EDA_and_Modeling.ipynb │ ├── tableau/ │ ├── IMDb_Dashboard.twbx # Tableau Workbook │ └── dashboard_preview.png # Dashboard screenshot │ ├── .gitignore ├── README.md # You are here └── requirements.txt # Python libraries

About

Exploratory analysis of IMDb movie data using Python (Pandas, NumPy, Seaborn) for cleaning and feature engineering, plus predictive modeling and interactive Tableau dashboards to visualize key trends.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors