Film Success Analysis

1. Project Aim

The goal of this project is to identify the key factors that make a movie financially successful so that studios, producers, and investors can make smarter decisions when funding new films.

2. Key Questions to Explore

→ Which movie genres consistently produce the highest return on investment (ROI)?

→ Does a higher IMDb rating lead to better financial performance?

→ When is the best time of year to release a movie for maximum profitability?

→ Do movies with bigger production budgets achieve higher returns?

→ Which studios consistently produce the most financially successful movies?

→ How does competition from other films released at the same time affect a movie’s ROI?

→ How does collaboration between studios impact a film’s financial success?

→ Are some production companies more successful than others, and what types of movies or actors do they usually work with?

3. Approach To Use

→ **Data Collection & Loading** – Understand the structure of the data.

→ **Data Cleaning**  – Handle missing values, remove duplicates, and fix inconsistencies.

→ **Data analysis** – Identify trends and patterns in genres, ratings, budget size, release dates.

→ **Visualization** – Create charts and graphs to present key insights.

→ **Model Building** - Use Linear Regression to build predictive models for ROI.

→ **Conclusion** – Summarize findings and recommendations to the movie producers and investors on what factors are linked to higher ROI



Importing all the necessary libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#operating system module
import os 

%matplotlib inline

In [3]:
#find the file path
!ls


Ans_Qsns.ipynb
CONTRIBUTING.md
LICENSE.md
README.md
Untitled.ipynb
cleaned_movie_budgets_df.csv
index.ipynb
movie_data_erd.jpeg
student.ipynb
zippedData


Read in the Csv files

In [5]:
#Read all the files into dataframes
merged_movie_budgets= pd.read_csv("cleaned_movie_budgets_df.csv")
merged_movie_budgets.head()

Unnamed: 0,movie_id,movie_name,release_year_movies,runtime_minutes,genres,averagerating,numvotes,release_date,production_budget,domestic_gross,worldwide_gross
0,tt0249516,foodfight!,1970,91,"action,animation,comedy",1.9,8248,2012-12-31,45000000,0,73706
1,tt0359950,the secret life of walter mitty,1970,114,"adventure,comedy,drama",7.3,275300,2013-12-25,91000000,58236838,187861183
2,tt0365907,a walk among the tombstones,1970,114,"action,crime,drama",6.5,105116,2014-09-19,28000000,26017685,62108587
3,tt0369610,jurassic world,1970,124,"action,adventure,sci-fi",7.0,539338,2015-06-12,215000000,652270625,1648854864
4,tt0376136,the rum diary,1970,119,"comedy,drama",6.2,94787,2011-10-28,45000000,13109815,21544732


In [6]:
merged_movie_budgets.tail()

Unnamed: 0,movie_id,movie_name,release_year_movies,runtime_minutes,genres,averagerating,numvotes,release_date,production_budget,domestic_gross,worldwide_gross
1421,tt8043306,teefa in trouble,1970,155,"action,comedy,crime",7.4,2724,2018-07-20,1500000,0,98806
1422,tt8155288,happy death day 2u,1970,100,"drama,horror,mystery",6.3,27462,2019-02-13,9000000,28051045,64179495
1423,tt8580348,perfectos desconocidos,1970,97,comedy,6.7,702,2017-12-31,4000000,0,31166312
1424,tt8632862,fahrenheit 11/9,1970,128,documentary,6.7,11628,2018-09-21,5000000,6352306,6653715
1425,tt9024106,unplanned,1970,106,"biography,drama",6.3,5945,2019-03-29,6000000,18107621,18107621


In [7]:
#summary statistics of the numerical columns
merged_movie_budgets.describe() 

Unnamed: 0,release_year_movies,runtime_minutes,averagerating,numvotes,production_budget,domestic_gross,worldwide_gross
count,1426.0,1426.0,1426.0,1426.0,1426.0,1426.0,1426.0
mean,1970.0,110.068022,6.408906,116754.5,44542700.0,55947860.0,139524800.0
std,0.0,16.78417,0.954934,163668.1,55351530.0,82839850.0,228936000.0
min,1970.0,87.0,1.6,624.0,9000.0,0.0,528.0
25%,1970.0,97.0,5.9,16438.0,9425000.0,4202701.0,10101820.0
50%,1970.0,107.0,6.5,59613.5,24500000.0,27632400.0,50783140.0
75%,1970.0,119.0,7.1,142556.8,55000000.0,67778920.0,156750000.0
max,1970.0,192.0,8.8,1841066.0,410600000.0,700059600.0,2048134000.0


In [8]:
#overview of the data
merged_movie_budgets.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1426 entries, 0 to 1425
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   movie_id             1426 non-null   object 
 1   movie_name           1426 non-null   object 
 2   release_year_movies  1426 non-null   int64  
 3   runtime_minutes      1426 non-null   int64  
 4   genres               1426 non-null   object 
 5   averagerating        1426 non-null   float64
 6   numvotes             1426 non-null   int64  
 7   release_date         1426 non-null   object 
 8   production_budget    1426 non-null   int64  
 9   domestic_gross       1426 non-null   int64  
 10  worldwide_gross      1426 non-null   int64  
dtypes: float64(1), int64(6), object(4)
memory usage: 122.7+ KB


In [12]:
#make a copy of the dataframe
movie_budgets = merged_movie_budgets
movie_budgets

Unnamed: 0,movie_id,movie_name,release_year_movies,runtime_minutes,genres,averagerating,numvotes,release_date,production_budget,domestic_gross,worldwide_gross
0,tt0249516,foodfight!,1970,91,"action,animation,comedy",1.9,8248,2012-12-31,45000000,0,73706
1,tt0359950,the secret life of walter mitty,1970,114,"adventure,comedy,drama",7.3,275300,2013-12-25,91000000,58236838,187861183
2,tt0365907,a walk among the tombstones,1970,114,"action,crime,drama",6.5,105116,2014-09-19,28000000,26017685,62108587
3,tt0369610,jurassic world,1970,124,"action,adventure,sci-fi",7.0,539338,2015-06-12,215000000,652270625,1648854864
4,tt0376136,the rum diary,1970,119,"comedy,drama",6.2,94787,2011-10-28,45000000,13109815,21544732
...,...,...,...,...,...,...,...,...,...,...,...
1421,tt8043306,teefa in trouble,1970,155,"action,comedy,crime",7.4,2724,2018-07-20,1500000,0,98806
1422,tt8155288,happy death day 2u,1970,100,"drama,horror,mystery",6.3,27462,2019-02-13,9000000,28051045,64179495
1423,tt8580348,perfectos desconocidos,1970,97,comedy,6.7,702,2017-12-31,4000000,0,31166312
1424,tt8632862,fahrenheit 11/9,1970,128,documentary,6.7,11628,2018-09-21,5000000,6352306,6653715


In [17]:
#convert to release date to datetime
movie_budgets['release_date'] =  pd.to_datetime(merged_movie_budgets['release_date'], errors='coerce')

#check if it worked
movie_budgets['release_date'].dtype

dtype('<M8[ns]')

In [25]:
#identify numerical /categorical columns
categorical_cols=  merged_movie_budgets.select_dtypes(include = ["object"]).columns.tolist()#convert the dataframe to output list of the cols

#numerical cols
numerical_cols = ['release_year_movies','runtime_minutes','averagerating','numvotes','production_budget','domestic_gross','worldwide_gross']

print(f'categorical_cols:{categorical_cols}')
print(f'numerical_cols:{numerical_cols}')

categorical_cols:['movie_id', 'movie_name', 'genres']
numerical_cols:['release_year_movies', 'runtime_minutes', 'averagerating', 'numvotes', 'production_budget', 'domestic_gross', 'worldwide_gross']


Answering the Qsns

1.  Which movie genres consistently produce the highest return on investment (ROI)?

In [28]:
#Create Return On Investment Column

movie_budgets['ROI'] = (movie_budgets['worldwide_gross'] - movie_budgets['production_budget'])/movie_budgets['production_budget']

#check if it worked
movie_budgets.head()

Unnamed: 0,movie_id,movie_name,release_year_movies,runtime_minutes,genres,averagerating,numvotes,release_date,production_budget,domestic_gross,worldwide_gross,ROI
0,tt0249516,foodfight!,1970,91,"action,animation,comedy",1.9,8248,2012-12-31,45000000,0,73706,-0.998362
1,tt0359950,the secret life of walter mitty,1970,114,"adventure,comedy,drama",7.3,275300,2013-12-25,91000000,58236838,187861183,1.064409
2,tt0365907,a walk among the tombstones,1970,114,"action,crime,drama",6.5,105116,2014-09-19,28000000,26017685,62108587,1.218164
3,tt0369610,jurassic world,1970,124,"action,adventure,sci-fi",7.0,539338,2015-06-12,215000000,652270625,1648854864,6.669092
4,tt0376136,the rum diary,1970,119,"comedy,drama",6.2,94787,2011-10-28,45000000,13109815,21544732,-0.521228


In [31]:
# calculate the average ROI for each genre
ROI_Genre =movie_budgets.groupby('genres')['ROI'].mean().sort_values(ascending=False)

#top 10 Genre by ROI
print(ROI_Genre.head(10))

genres
crime,drama,family           33.529686
horror,mystery,thriller      16.982064
adventure,horror             11.931420
biography,drama,fantasy      11.679440
drama,horror,thriller         9.579818
horror                        9.500400
adventure,drama,fantasy       8.630940
drama,sci-fi,thriller         8.285763
action,comedy,documentary     7.584290
adventure,horror,mystery      7.364813
Name: ROI, dtype: float64


**Insights**

→ The genre combination crime,drama,family has the highest average ROI with an ROI of 33.53.

→ Horror genres appear frequently in high ROI movies with horror proving to be a dominant genre especially when combined with mystery, thriller or drama.

→ Action genres like action,comedy,documentary (ROI of 7.58) has lower ROI.  They do not always generate as high of returns compared to genres like drama or horror.

2. Does a higher IMDb rating lead to better financial performance?

In [33]:
# Correlation calculation

# correlation of ratings vs ROI
correlation_ROI = movie_budgets['averagerating'].corr(movie_budgets['ROI'])

#correlation of ratings vs Worldwide gross
correlation_gross = movie_budgets['averagerating'].corr(movie_budgets['worldwide_gross'])

print(f'Correlation between Average Rating and ROI: {correlation_ROI:.2f}')
print(f'Correlation between Average Rating and Worldwide Gross: {correlation_gross:.2f}')

Correlation between Average Rating and ROI: 0.16
Correlation between Average Rating and Worldwide Gross: 0.27


**Insights**

Correlation near 1.0 = strong positive relationship

Correlation near 0 = no clear relationship

→ There is some relationship between ratings and financial performance but ratings alone cannot predict ROI .

3.  When is the best time of year to release a movie for maximum profitability?

In [42]:
# get the specific release month
movie_budgets['release_month'] = movie_budgets['release_date'].dt.month

#get the specific release dates
movie_budgets['release_day'] = movie_budgets['release_date'].dt.day

# Check if it worked
movie_budgets.head()

Unnamed: 0,movie_id,movie_name,release_year_movies,runtime_minutes,genres,averagerating,numvotes,release_date,production_budget,domestic_gross,worldwide_gross,ROI,release_month,release_day
0,tt0249516,foodfight!,1970,91,"action,animation,comedy",1.9,8248,2012-12-31,45000000,0,73706,-0.998362,12,31
1,tt0359950,the secret life of walter mitty,1970,114,"adventure,comedy,drama",7.3,275300,2013-12-25,91000000,58236838,187861183,1.064409,12,25
2,tt0365907,a walk among the tombstones,1970,114,"action,crime,drama",6.5,105116,2014-09-19,28000000,26017685,62108587,1.218164,9,19
3,tt0369610,jurassic world,1970,124,"action,adventure,sci-fi",7.0,539338,2015-06-12,215000000,652270625,1648854864,6.669092,6,12
4,tt0376136,the rum diary,1970,119,"comedy,drama",6.2,94787,2011-10-28,45000000,13109815,21544732,-0.521228,10,28


In [47]:
#calculate the average ROI for each month
Monthly_ROI = movie_budgets.groupby('release_month')['ROI'].mean()

#months with the highest average ROI
sorted_Monthly_ROI  = Monthly_ROI .sort_values(ascending=False)

#calculate the average ROI for each day
Daily_ROI = movie_budgets.groupby('release_day')['ROI'].mean()

#days with the highest average ROI
sorted_Daily_ROI  = Daily_ROI.sort_values(ascending=False)

#check results
print(f'Months with the highest average ROI:{sorted_Monthly_ROI.head()}')
print(f'days with the highest average ROI:{sorted_Daily_ROI.head()}')

Months with the highest average ROI:release_month
7     3.126229
10    2.967778
11    2.886975
6     2.790249
8     2.588299
Name: ROI, dtype: float64
days with the highest average ROI:release_day
28    4.013378
1     3.970736
15    3.615441
3     3.430825
11    3.010295
Name: ROI, dtype: float64


**Insights**

**Months insights**
    
→ July has the highest average ROI of 3.13 meaning movies released in this month tend to perform the best financially on average.

→ October and November also show strong performance with average ROIs of 2.97 and 2.89

→  mid-summer (July) and late autumn (October and November) are the most financially profitable times to release movies.

**Days insights**

→ Date 28th of the month has the highest average ROI of 4.01

→ 1st of the month follows  with an ROI of 3.97 and 15th (mid-month) comes third with an ROI of 3.62 

→ films released at the end or  start of the month tend to achieve the best financial returns.



4. Do movies with bigger production budgets achieve higher returns

In [49]:
#Correlation calculation

# correlation of production_budget vs ROI
Budget_ROI_corr = movie_budgets['production_budget'].corr(movie_budgets['ROI'])

#correlation of production_budget vs Worldwide gross
Budget_gross_corr = movie_budgets['production_budget'].corr(movie_budgets['worldwide_gross'])

print(f'Correlation between production_budget and ROI: {Budget_ROI_corr:.2f}')
print(f'Correlation between production_budget and Worldwide Gross: {Budget_gross_corr:.2f}')

Correlation between production_budget and ROI: -0.05
Correlation between production_budget and Worldwide Gross: 0.78


**Insights**

Correlation

1 → Perfect positive correlation

0 → No correlation

–1 → Perfect negative correlation


→ Production Budget vs. ROI = -0.05  -  very weak (almost no relationship) meaning a bigger budget doesn't guarantee higher returns.

→ Production Budget vs. Worldwide Gross = 0.78  - strong positive correlation. 

Movies with bigger budgets tend to earn more in total worldwide gross but that doesn't always translate to higher profitability.