# Movie Studio Analysis: Understanding Box Office Success
## Business Understanding
The company is planning to launch a new movie studio, but lacks experience in movie production. The goal of this analysis is to explore current trends in the film industry and provide actionable insights that can guide the studio's strategy. This analysis focuses on identifying factors that contribute to box office success, including genre, budget, release timing, and the impact of key personnel such as directors, actors, and writers.
    

## Data Understanding
The analysis is based on multiple datasets related to movie budgets, revenues, genres, release dates, and key personnel (directors, actors, and writers).

These datasets include:
   - **Box Office Mojo (BOM) Movie Gross**: Contains information on domestic and foreign box office revenues.
   - **TMDB Movies**: Includes information on movie popularity and ratings.
   - **IMDb Database**: Provides detailed information on directors, actors, writers, and other key personnel involved in the movies.
    

## Data Preparation
In this section, we will load, clean, and merge the datasets to prepare them for analysis.

In [2]:
# Library imports
import pandas as pd
import sqlite3
from zipfile import ZipFile
import scipy.stats as stats
import helper

In [3]:
# Load the datasets
tmdb_movies = pd.read_csv("zippedData/tmdb.movies.csv.gz")
tn_movies = pd.read_csv("zippedData/tn.movie_budgets.csv.gz")

In [4]:
# Loading IMDb SQLite database

# Unzip the sqlite db file if not already done
with ZipFile("zippedData/im.db.zip", 'r') as zObject:
    zObject.extractall("zippedData/")

# Creating the connection
conn = sqlite3.connect("zippedData/im.db")

# Loading data for directors, actors, and writers filtering for US movies in English

# Queries
query_directors = """
SELECT mb.*, mr.averagerating, mr.numvotes, p.primary_name, p.birth_year, p.death_year, p.primary_profession
FROM movie_basics AS mb
JOIN movie_ratings AS mr ON mb.movie_id = mr.movie_id
JOIN principals AS pr ON mb.movie_id = pr.movie_id
JOIN persons AS p ON pr.person_id = p.person_id
JOIN movie_akas AS ma ON mb.movie_id = ma.movie_id
WHERE ma.region = 'US'
AND pr.category = 'director'
AND ma.language = 'en';
"""

query_actors = """
SELECT mb.*, mr.averagerating, mr.numvotes, p.primary_name, p.birth_year, p.death_year, p.primary_profession
FROM movie_basics AS mb
JOIN movie_ratings AS mr ON mb.movie_id = mr.movie_id
JOIN principals AS pr ON mb.movie_id = pr.movie_id
JOIN persons AS p ON pr.person_id = p.person_id
JOIN movie_akas AS ma ON mb.movie_id = ma.movie_id
WHERE ma.region = 'US'
AND pr.category = 'actor'
AND ma.language = 'en';
"""

query_writers = """
SELECT mb.*, mr.averagerating, mr.numvotes, p.primary_name, p.birth_year, p.death_year, p.primary_profession
FROM movie_basics AS mb
JOIN movie_ratings AS mr ON mb.movie_id = mr.movie_id
JOIN principals AS pr ON mb.movie_id = pr.movie_id
JOIN persons AS p ON pr.person_id = p.person_id
JOIN movie_akas AS ma ON mb.movie_id = ma.movie_id
WHERE ma.region = 'US'
AND pr.category = 'writer'
AND ma.language = 'en';
"""

query_basics = """
SELECT *
FROM movie_basics
"""

query_ratings = """
SELECT *
FROM movie_ratings
"""

# Execute queries and assign to dataframes
directors_merged = pd.read_sql_query(query_directors, conn)
actors_merged = pd.read_sql_query(query_actors, conn)
writers_merged = pd.read_sql_query(query_writers, conn)
movie_basics = pd.read_sql_query(query_basics, conn)
movie_ratings = pd.read_sql_query(query_ratings, conn)

# Close the connection
conn.close()

## Data Cleaning
Here I'll go through the process of cleaning the data by handling missing values, removing duplicate information, recasting data types, and feature engineering

In [5]:
# Cleaning IMDb data

directors_cleaned = helper.clean_imdb_data(directors_merged)
actors_cleaned = helper.clean_imdb_data(actors_merged)
writers_cleaned = helper.clean_imdb_data(writers_merged)

# Extracting the primary genre from the 'genres' column
movie_basics['primary_genre'] = movie_basics['genres'].apply(lambda x: x.split(',')[0] if pd.notnull(x) else None)

In [6]:
# Cleaning TMDB data
tmdb_cleaned = helper.clean_tmdb_data(tmdb_movies)

In [7]:
# Cleaning 'The Numbers' movie budgets data

tn_movie_budgets_cleaned = helper.clean_tn_movie_budgets(tn_movies)

In [8]:
# Merge TMDB and TN Movie Budgets data
merged_data = pd.merge(tmdb_cleaned, tn_movie_budgets_cleaned, left_on='title', right_on='movie')

# Dropping duplicates based on 'title' and 'release_date'
if 'release_date' in merged_data.columns:
    merged_data = merged_data.drop_duplicates(subset=['title', 'release_date'])
    print("Dropped duplicates based on 'title' and 'release_date'.")
else:
    print("Error: 'release_date' column is missing in the merged data.")
    
# Merging IMDb data
# Ensuring correct data types
movie_basics['primary_title'] = movie_basics['primary_title'].astype(str)
merged_data['title'] = merged_data['title'].astype(str)

# Merge the primary genre from IMDb into the merged_data DataFrame
merged_data = pd.merge(merged_data, movie_basics[['primary_title', 'primary_genre']], left_on='title', right_on='primary_title', how='left')

# Drop the extra 'primary_title' column after merging
merged_data = merged_data.drop(columns=['primary_title'])

Error: 'release_date' column is missing in the merged data.


In [9]:
# Adding in ROI columns
merged_data['roi_domestic'] = merged_data['domestic_gross'] / merged_data['production_budget']
merged_data['roi_worldwide'] = merged_data['worldwide_gross'] / merged_data['production_budget']

In [10]:
# Extract the month from 'release_date_x' and create a new 'month' column
merged_data['month'] = merged_data['release_date_x'].dt.month

In [11]:
# Drop duplicates based on 'title' and 'release_date'
merged_data = merged_data.drop_duplicates(subset=['title', 'release_date_x'])

In [12]:
# Final merged dataset preview
merged_data.head()

Unnamed: 0.1,Unnamed: 0,genre_ids,id_x,original_language,original_title,popularity,release_date_x,title,vote_average,vote_count,id_y,release_date_y,movie,production_budget,domestic_gross,worldwide_gross,primary_genre,roi_domestic,roi_worldwide,month
0,1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610,30,2010-03-26,How to Train Your Dragon,165000000.0,217581232.0,494870992.0,Action,1.318674,2.999218,3
1,2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368,15,2010-05-07,Iron Man 2,170000000.0,312433331.0,621156389.0,Action,1.837843,3.653861,5
2,3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174,37,1995-11-22,Toy Story,30000000.0,191796233.0,364545516.0,,6.393208,12.151517,11
3,4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186,38,2010-07-16,Inception,160000000.0,292576195.0,835524642.0,Action,1.828601,5.222029,7
4,5,"[12, 14, 10751]",32657,en,Percy Jackson & the Olympians: The Lightning T...,26.691,2010-02-11,Percy Jackson & the Olympians: The Lightning T...,6.1,4229,17,2010-02-12,Percy Jackson & the Olympians: The Lightning T...,95000000.0,88768303.0,223050874.0,Adventure,0.934403,2.347904,2


In [13]:
merged_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2124 entries, 0 to 4420
Data columns (total 20 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   Unnamed: 0         2124 non-null   int64         
 1   genre_ids          2124 non-null   object        
 2   id_x               2124 non-null   int64         
 3   original_language  2124 non-null   object        
 4   original_title     2124 non-null   object        
 5   popularity         2124 non-null   float64       
 6   release_date_x     2124 non-null   datetime64[ns]
 7   title              2124 non-null   object        
 8   vote_average       2124 non-null   float64       
 9   vote_count         2124 non-null   int64         
 10  id_y               2124 non-null   int64         
 11  release_date_y     2124 non-null   datetime64[ns]
 12  movie              2124 non-null   object        
 13  production_budget  2124 non-null   float64       
 14  domestic_gros

In [14]:
# Looks like many of the genre values are missing. TMDB provides an API we can use to fill these in.

merged_data = helper.update_missing_genres(merged_data)

Updating missing genres: 100%|██████████| 151/151 [00:06<00:00, 23.09it/s]


In [15]:
merged_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2124 entries, 0 to 4420
Data columns (total 20 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   Unnamed: 0         2124 non-null   int64         
 1   genre_ids          2124 non-null   object        
 2   id_x               2124 non-null   int64         
 3   original_language  2124 non-null   object        
 4   original_title     2124 non-null   object        
 5   popularity         2124 non-null   float64       
 6   release_date_x     2124 non-null   datetime64[ns]
 7   title              2124 non-null   object        
 8   vote_average       2124 non-null   float64       
 9   vote_count         2124 non-null   int64         
 10  id_y               2124 non-null   int64         
 11  release_date_y     2124 non-null   datetime64[ns]
 12  movie              2124 non-null   object        
 13  production_budget  2124 non-null   float64       
 14  domestic_gros

In [16]:
# We were able to get genres for all but 10 of the movies, so I will drop those that are missing
merged_data = merged_data.dropna(subset=['primary_genre'])

In [17]:
# I would also like to take a look at movies that belong to a franchise to see if they perform differently than those that are 'one-offs'

# Initialize the 'franchise' and 'collection' columns
merged_data['franchise'] = False
merged_data['collection'] = None

# Update the franchise info for all movies
merged_data = helper.update_franchise_info(merged_data)

Updating franchise info: 100%|██████████| 2114/2114 [01:14<00:00, 28.19it/s]


In [18]:
# Drop the redundant columns
columns_to_drop = ['Unnamed: 0', 'genre_ids', 'original_title', 'id_y', 'release_date_y', 'movie']
merged_data_cleaned = merged_data.drop(columns=columns_to_drop)

# Rename columns as specified
merged_data_cleaned = merged_data_cleaned.rename(columns={'id_x': 'tmdb_id', 'release_date_x': 'release_date', 'month': 'release_month'})

# Preview the updated DataFrame
merged_data_cleaned.head()


Unnamed: 0,tmdb_id,original_language,popularity,release_date,title,vote_average,vote_count,production_budget,domestic_gross,worldwide_gross,primary_genre,roi_domestic,roi_worldwide,release_month,franchise,collection
0,10191,en,28.734,2010-03-26,How to Train Your Dragon,7.7,7610,165000000.0,217581232.0,494870992.0,Action,1.318674,2.999218,3,True,How to Train Your Dragon Collection
1,10138,en,28.515,2010-05-07,Iron Man 2,6.8,12368,170000000.0,312433331.0,621156389.0,Action,1.837843,3.653861,5,True,Iron Man Collection
2,862,en,28.005,1995-11-22,Toy Story,7.9,10174,30000000.0,191796233.0,364545516.0,Animation,6.393208,12.151517,11,True,Toy Story Collection
3,27205,en,27.92,2010-07-16,Inception,8.3,22186,160000000.0,292576195.0,835524642.0,Action,1.828601,5.222029,7,False,
4,32657,en,26.691,2010-02-11,Percy Jackson & the Olympians: The Lightning T...,6.1,4229,95000000.0,88768303.0,223050874.0,Adventure,0.934403,2.347904,2,True,Percy Jackson Collection


## Hypothesis testing

In [19]:
# Does genre significantly impact ROI?

# H0 - Genre does not impact ROI
# H1 - Genre does impact ROI

# Domestic ROI

# Drop missing values for the relevant columns
genre_roi_data = merged_data_cleaned.dropna(subset=['primary_genre', 'roi_domestic', 'roi_worldwide'])

# Group by primary genre and calculate the mean revenue for each genre
genre_roi = genre_roi_data.groupby('primary_genre')['roi_domestic'].mean()

# Perform ANOVA to test if the means are significantly different
anova_results = stats.f_oneway(*[genre_roi_data[genre_roi_data['primary_genre'] == genre]['roi_domestic']
                                 for genre in genre_roi.index])

# Output the ANOVA results
print("ANOVA Results for Genre Impact on Worldwide ROI:")
print(f"F-statistic: {anova_results.statistic}, p-value: {anova_results.pvalue}")

# Interpretation
if anova_results.pvalue < 0.05:
    print("Reject the null hypothesis: There is a significant difference in domestic ROI across genres.")
else:
    print("Fail to reject the null hypothesis: No significant difference in domestic ROI across genres.")
    
# Worldwide ROI

# Group by primary genre and calculate the mean revenue for each genre
genre_roi = genre_roi_data.groupby('primary_genre')['roi_worldwide'].mean()

# Perform ANOVA to test if the means are significantly different
anova_results = stats.f_oneway(*[genre_roi_data[genre_roi_data['primary_genre'] == genre]['roi_worldwide']
                                 for genre in genre_roi.index])

# Output the ANOVA results
print("ANOVA Results for Genre Impact on Worldwide ROI:")
print(f"F-statistic: {anova_results.statistic}, p-value: {anova_results.pvalue}")

# Interpretation
if anova_results.pvalue < 0.05:
    print("Reject the null hypothesis: There is a significant difference in worldwide ROI across genres.")
else:
    print("Fail to reject the null hypothesis: No significant difference in worldwide ROI across genres.")
    
    

ANOVA Results for Genre Impact on Worldwide ROI:
F-statistic: 3.6299154011564334, p-value: 5.016404509540143e-08
Reject the null hypothesis: There is a significant difference in domestic ROI across genres.
ANOVA Results for Genre Impact on Worldwide ROI:
F-statistic: 4.249328207795131, p-value: 3.88690241691565e-10
Reject the null hypothesis: There is a significant difference in worldwide ROI across genres.


In [20]:
# Does being in a franchise impact ROI?

# H0 - There is no difference in the average worldwide gross revenue between franchise and non-franchise movies.
# H1 - There is a difference in the average worldwide gross revenue between franchise and non-franchise movies.

# Split the data into two groups: franchise and non-franchise for worldwide ROI
ww_franchise_movies = merged_data_cleaned[merged_data_cleaned['franchise'] == True]['roi_worldwide']
ww_non_franchise_movies = merged_data_cleaned[merged_data_cleaned['franchise'] == False]['roi_worldwide']

# Perform a t-test for worldwide ROI
ww_t_stat, ww_p_value = stats.ttest_ind(ww_franchise_movies.dropna(), ww_non_franchise_movies.dropna(), equal_var=False)

print("T-Test Results for Franchise Impact on Worldwide ROI:")
print(f"T-statistic: {ww_t_stat}, p-value: {ww_p_value}")

if ww_p_value < 0.05:
    print("Reject the null hypothesis: There is a significant difference in worldwide ROI for movies in a franchise.")
else:
    print("Fail to reject the null hypothesis: There is NOT a significant difference in worldwide ROI for movies in a franchise.")
    
# Split the data into two groups: franchise and non-franchise for domestic ROI
d_franchise_movies = merged_data_cleaned[merged_data_cleaned['franchise'] == True]['roi_domestic']
d_non_franchise_movies = merged_data_cleaned[merged_data_cleaned['franchise'] == False]['roi_domestic']

# Perform a t-test for domestic ROI
d_t_stat, d_p_value = stats.ttest_ind(d_franchise_movies.dropna(), d_non_franchise_movies.dropna(), equal_var=False)

print("T-Test Results for Franchise Impact on Domestic ROI:")
print(f"T-statistic: {d_t_stat}, p-value: {d_p_value}")

if d_p_value < 0.05:
    print("Reject the null hypothesis: There is a significant difference in domestic ROI for movies in a franchise.")
else:
    print("Fail to reject the null hypothesis: There is NOT a significant difference in domestic ROI for movies in a franchise.")

T-Test Results for Franchise Impact on Worldwide ROI:
T-statistic: 3.797323557634855, p-value: 0.0001689329659322775
Reject the null hypothesis: There is a significant difference in worldwide ROI for movies in a franchise.
T-Test Results for Franchise Impact on Domestic ROI:
T-statistic: 2.9288433848764943, p-value: 0.0035936269884244403
Reject the null hypothesis: There is a significant difference in domestic ROI for movies in a franchise.


In [30]:
# Does the time of year the movie is released in matter?

# H0 - There is no significant difference in ROI based on which month the movie is released
# H1 - There is a significant difference in ROI based on which month the movie is released

# Group the data by release month for worldwide ROI
ww_monthly_roi = merged_data_cleaned.groupby('release_month')['roi_worldwide'].apply(list)

# Perform the ANOVA test
ww_anova_result = stats.f_oneway(*[roi for roi in ww_monthly_roi])

# Output the ANOVA results
print("ANOVA Results for Release Month Impact on Worldwide ROI:")
print(f"F-statistic: {ww_anova_result.statistic}, p-value: {ww_anova_result.pvalue}")

# Interpretation
if ww_anova_result.pvalue < 0.05:
    print("Reject the null hypothesis: There is a significant difference in worldwide ROI across release month.")
else:
    print("Fail to reject the null hypothesis: No significant difference in worldwide ROI across release month.")

# Group the data by release month for domestic ROI
d_monthly_roi = merged_data_cleaned.groupby('release_month')['roi_domestic'].apply(list)

# Perform the ANOVA test
d_anova_result = stats.f_oneway(*[roi for roi in d_monthly_roi])

# Output the ANOVA results
print("ANOVA Results for Release Month Impact on Domestic ROI:")
print(f"F-statistic: {d_anova_result.statistic}, p-value: {d_anova_result.pvalue}")

# Interpretation
if d_anova_result.pvalue < 0.05:
    print("Reject the null hypothesis: There is a significant difference in domestic ROI across release month.")
else:
    print("Fail to reject the null hypothesis: No significant difference in domestic ROI across release month.")


ANOVA Results for Release Month Impact on Worldwide ROI:
F-statistic: 1.4252451648435245, p-value: 0.15451297079742496
Fail to reject the null hypothesis: No significant difference in worldwide ROI across release month.
ANOVA Results for Release Month Impact on Domestic ROI:
F-statistic: 1.1625733712012765, p-value: 0.3081616903528722
Fail to reject the null hypothesis: No significant difference in domestic ROI across release month.


In [35]:
# Does production budget impact ROI or Gross?

# H0a - Production budget does not have a significant impact on ROI
# H1a - Production budget has a significant impact on ROI
# H0b - Production budget does not have a significant impact on gross
# H1b - Production budget has a significant impact on gross

# Linear regression for domestic ROI
slope_dom_roi, intercept_dom_roi, r_value_dom_roi, p_value_dom_roi, std_err_dom_roi = stats.linregress(
    merged_data_cleaned['production_budget'], merged_data_cleaned['roi_domestic'])

print(f"Domestic ROI Regression Results:")
print(f"Slope: {slope_dom_roi}")
print(f"Intercept: {intercept_dom_roi}")
print(f"R-squared: {r_value_dom_roi**2}")
print(f"P-value: {p_value_dom_roi}\n")

# Linear regression for domestic gross
slope_dom_gross, intercept_dom_gross, r_value_dom_gross, p_value_dom_gross, std_err_dom_gross = stats.linregress(
    merged_data_cleaned['production_budget'], merged_data_cleaned['domestic_gross'])

print(f"Domestic Gross Regression Results:")
print(f"Slope: {slope_dom_gross}")
print(f"Intercept: {intercept_dom_gross}")
print(f"R-squared: {r_value_dom_gross**2}")
print(f"P-value: {p_value_dom_gross}\n")

# Linear regression for worldwide ROI
slope_world_roi, intercept_world_roi, r_value_world_roi, p_value_world_roi, std_err_world_roi = stats.linregress(
    merged_data_cleaned['production_budget'], merged_data_cleaned['roi_worldwide'])

print(f"Worldwide ROI Regression Results:")
print(f"Slope: {slope_world_roi}")
print(f"Intercept: {intercept_world_roi}")
print(f"R-squared: {r_value_world_roi**2}")
print(f"P-value: {p_value_world_roi}\n")

# Linear regression for worldwide gross
slope_world_gross, intercept_world_gross, r_value_world_gross, p_value_world_gross, std_err_world_gross = stats.linregress(
    merged_data_cleaned['production_budget'], merged_data_cleaned['worldwide_gross'])

print(f"Worldwide Gross Regression Results:")
print(f"Slope: {slope_world_gross}")
print(f"Intercept: {intercept_world_gross}")
print(f"R-squared: {r_value_world_gross**2}")
print(f"P-value: {p_value_world_gross}\n")


Domestic ROI Regression Results:
Slope: -9.622155887079312e-09
Intercept: 2.3242254344926367
R-squared: 0.005510687712702002
P-value: 0.0006358440802301604

Domestic Gross Regression Results:
Slope: 1.169715934925598
Intercept: 4353189.260108948
R-squared: 0.5370973593379069
P-value: 0.0

Worldwide ROI Regression Results:
Slope: -7.86202850726377e-09
Intercept: 3.9838429163431135
R-squared: 0.001266393693980641
P-value: 0.10189147507521298

Worldwide Gross Regression Results:
Slope: 3.467976689697516
Intercept: -12830191.121621355
R-squared: 0.6386219655522728
P-value: 0.0

