# 1. Project Overview & Business Understanding



The head of a new movie studio requires data-driven recommendations to guide initial film production choices, specifically aiming to maximize worldwide box office success.



Key Questions to be Answered:



1. Which film genres yield the highest average worldwide revenue?


2. How does audience reception (IMDB rating) correlate with financial success?


3. Is there an optimal film runtime that maximizes gross earnings?


4. How consistent are audience ratings within the highest-grossing film genres?


5. Are the observed differences in average genre revenue statistically significant, or merely due to random chance?

2. Data Understanding and Acquisition

In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import sqlite3

In [9]:
# Establish connection to the IMDB database
conn = sqlite3.connect('im.db')
# Load the Box Office Mojo (BOM) gross revenue data
df = pd.read_csv('bom.movie_gross.csv')

In [10]:
# Display all tables in the database
query1 = """
SELECT * 
FROM sqlite_master
 WHERE type = 'table';
 """
print(pd.read_sql_query(query1, conn))

Empty DataFrame
Columns: [type, name, tbl_name, rootpage, sql]
Index: []


In [11]:
print("\nBox Office Mojo Data Preview:")
# Preview BOM data structure
print(df.head())


Box Office Mojo Data Preview:
                                         title studio  domestic_gross  \
0                                  Toy Story 3     BV     415000000.0   
1                   Alice in Wonderland (2010)     BV     334200000.0   
2  Harry Potter and the Deathly Hallows Part 1     WB     296000000.0   
3                                    Inception     WB     292600000.0   
4                          Shrek Forever After   P/DW     238700000.0   

  foreign_gross  year  
0     652000000  2010  
1     691300000  2010  
2     664300000  2010  
3     535700000  2010  
4     513900000  2010  


3. Data Preparation and Cleaning

This section integrates and cleans the data from the IMDB database and the Box Office Mojo CSV file. The key step is joining movie metadata (runtime, genre, rating) with financial gross data.

In [None]:
# This creates a 'Revenue' table inside the im.db


df.to_sql("Revenue", conn, if_exists="replace", index=False)

In [None]:
# --- Data Merging: SQL Query to join all necessary tables -
query2= """ 
SELECT mb.movie_id, r.title, mb.original_title,r.year, mb.runtime_minutes,mb.genres, r.studio, r.domestic_gross, r.foreign_gross, mr.averagerating, mr.numvotes
FROM movie_basics AS mb
JOIN Revenue AS r
ON mb.primary_title = r.title
JOIN movie_ratings AS mr
ON mb.movie_id =  mr.movie_id;
"""
df1=pd.read_sql_query(query2, conn)
df1.head()

In [None]:
# Data Merging: SQL Query to join all necessary tables.
query_all = """
SELECT
t1.movie_id,
t1.primary_title,
t1.start_year,
t1.runtime_minutes,
t1.genres,
t2.averagerating,
t2.numvotes,
t3.domestic_gross,
REPLACE(t3.foreign_gross, ',', '') AS foreign_gross,
(t3.domestic_gross + CAST(REPLACE(t3.foreign_gross, ',', '') AS REAL)) AS Total_revenues
FROM movie_basics t1
JOIN movie_ratings t2
ON t1.movie_id = t2.movie_id
JOIN Revenue t3
ON t1.primary_title = t3.title;
"""
# Execute query and create the final working DataFrame
df1 = pd.read_sql_query(query_all, conn)
df1.head() # checks the first 5 rows of the merged dataframe

In [None]:
df1.describe()

In [None]:
# Final Cleaning and Feature Engineering 
# 1. Convert total revenues to millions (M USD) for readability
df1['Total_revenues_mil'] = df1['Total_revenues'] / 1_000_000

In [None]:
#  Handle missing values in critical columns
# Drop rows where genres or total revenues are missing.
df1.dropna(subset=['genres', 'Total_revenues'], inplace=True)
df1['Total_revenues_mil'].isnull().sum()

In [None]:
# Impute missing runtime with the median for robust analysis
median_runtime = df1['runtime_minutes'].median()
df1['runtime_minutes'].fillna(median_runtime, inplace=True)

In [None]:
# Filter out films with 0 revenue 
df1 = df1[df1['Total_revenues_mil'] > 0]
print(f"Final Clean Rows for Analysis: {len(df1)}")
df1.info()