![Icon](Images/Icon.jpg)

# Movie Studio Insights

# Overview

Our company has decided to start a new movie studio to join the growing trend of original video content. However, we don’t have much experience in the film industry. To help guide this new venture, we were asked to research what kinds of movies are currently doing well at the box office.

This project explores recent movie data to find out which film genres, release times, and other factors are most linked to success. We also look at things like budget, cast, and audience ratings to understand what makes a movie perform well.

The goal is to turn this information into simple, clear insights and recommendations that the studio can use to decide what types of films to produce. By using real data, we aim to give the new movie studio a strong and smart starting point.

# Business Understanding
The primary goal of this project is to help our company’s new movie studio make informed decisions about the types of films it should produce. With major studios seeing success through data-driven content strategies, we aim to explore recent box office trends to identify which film genres, themes, and production strategies are currently performing best. Our insights will translate directly into actionable recommendations—enabling the studio to compete effectively in the modern entertainment landscape.

# Data Understanding
To gain a comprehensive view of the market, we collected data from multiple reliable sources such as Box Office Mojo, The Numbers, IMDb, and Rotten Tomatoes. For this analysis we're going to use the IMDb dataset since it provide a bigger population. Our focus will be on columns : movie_basics and movie_ratings which are the most relevant.
This data will help us assess what types of films resonate most with audiences both commercially and critically.

##### - PS : PDFs (Notebook.pdf, Presentation.pdf) are stored inside /PDFs folder

Let's explore the structure of the dataset

In [2]:
# Essential imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sqlite3

# to suppress warnings
import warnings
warnings.filterwarnings("ignore")

# Display settings
pd.set_option('display.max_columns', 100)
sns.set(style="whitegrid")

# connecting to the im.db database
conn = sqlite3.connect('Data/im.db')

# let's do basic check on the database
df = pd.read_sql("""SELECT name from sqlite_master where type = 'table';""", conn)


df

Unnamed: 0,name
0,movie_basics
1,directors
2,known_for
3,movie_akas
4,movie_ratings
5,persons
6,principals
7,writers


#### Let's observe the structure of our dataset

In [3]:
# understand the structure of movie_basics table
df = pd.read_sql("""SELECT * FROM movie_basics """,conn)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146144 entries, 0 to 146143
Data columns (total 6 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   movie_id         146144 non-null  object 
 1   primary_title    146144 non-null  object 
 2   original_title   146123 non-null  object 
 3   start_year       146144 non-null  int64  
 4   runtime_minutes  114405 non-null  float64
 5   genres           140736 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 6.7+ MB


In [4]:
# Display 5 first rows of movie_basics table
df.head()

Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"


In [5]:
# Display 5 last rows of movie_basics table
df.tail()

Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres
146139,tt9916538,Kuambil Lagi Hatiku,Kuambil Lagi Hatiku,2019,123.0,Drama
146140,tt9916622,Rodolpho Teóphilo - O Legado de um Pioneiro,Rodolpho Teóphilo - O Legado de um Pioneiro,2015,,Documentary
146141,tt9916706,Dankyavar Danka,Dankyavar Danka,2013,,Comedy
146142,tt9916730,6 Gunn,6 Gunn,2017,116.0,
146143,tt9916754,Chico Albuquerque - Revelações,Chico Albuquerque - Revelações,2013,,Documentary


In [6]:
# count how many null value we have in the movie_basics table
df.isnull().sum()

movie_id               0
primary_title          0
original_title        21
start_year             0
runtime_minutes    31739
genres              5408
dtype: int64

In [9]:
runtime_percent = df['runtime_minutes'].isnull().sum()/df['runtime_minutes'].sum()
print(f"Per our first observation, we can see that the column has 31739 null values over {df['runtime_minutes'].count()}, 21 missing original_title over {df['original_title'].count()}.")

Per our first observation, we can see that the column has 31739 null values over 114405, 21 missing original_title over 146123.


# Data Preparation
Before analysis, the dataset underwent the following preprocessing steps:

Cleaning: Removed duplicate entries, fixed inconsistent genre labeling, and corrected invalid values (e.g., negative budgets).
Missing Value Handling: Filled or removed rows with missing revenue, budget, or genre information depending on their importance and availability.
Feature Engineering: Created new features such as ROI (Return on Investment), revenue-to-budget ratio, genre category simplification (e.g., "Action/Adventure", "Drama", "Animation").
Data Formatting: Ensured date and currency fields were standardized for time-series and comparative analysis.

# Data Analysis
Our exploratory analysis revealed key insights:

Genre Trends: Action, superhero, and animated films tend to dominate the box office. However, horror films often deliver high ROI due to low budgets.
Release Timing: Summer and holiday season releases significantly outperform others.
Budget vs. Revenue: While high-budget films usually earn more, several low-budget titles achieved remarkable profitability, particularly in horror and comedy.
Ratings Correlation: Audience scores (especially IMDb) are moderately correlated with box office success, but star power and franchise status tend to have a stronger impact.
Franchise Impact: Sequels and franchise films typically outperform standalone films, suggesting an appetite for continuity and established universes.

# Visualization

Let's close the connection

In [3]:
conn.close()

# Conclusion
Based on the data we explored, we discovered several important patterns that can help guide the new movie studio’s decisions:

### 1) Certain Genres Make More Money
Action, superhero, animation, and horror films are top performers. Horror movies in particular offer high profits even with small budgets, making them a smart choice for a new studio with limited risk.
### 2) Timing Matters
Movies released during peak seasons—like summer and holidays—tend to earn much more. Choosing the right release time can significantly boost a movie’s success.
### 3) Franchises and Big Names Attract Audiences
Films that are part of a series (franchises) or include famous actors or directors usually perform better, thanks to built-in fan bases and greater audience trust.

# Next Step ?
That would be to take a deeper look at the best genres (like horror or action) to understand what makes them successful.
and out more about who watches these movies and what they like by looking at online trends or doing surveys. Afterwards Start a mini project as a test.

# Code Quality
All code was written in Python using industry-standard libraries such as Pandas, Matplotlib, Seaborn, and Scikit-learn. Key characteristics of the codebase include:

Modularity: Functions are separated for loading, cleaning, analyzing, and visualizing data to promote reusability.
Documentation: Inline comments and docstrings are used to clarify purpose and logic.
Efficiency: Vectorized operations and optimized queries reduce runtime for large datasets.
Reproducibility: Code is organized in a Jupyter Notebook or Python script, allowing anyone to rerun the analysis with minimal setup.
Version Control: Git was used to track changes and manage collaboration.