# Business Understanding

## Problem Statement
- Our company wants to launch a movie studio but has no experience in film production. We need to identify what types of movies are currently profitable to avoid financial losses from creating unpopular films.

## Business Objectives
**Objectives**

- Identify top 5 genres with highest profit margins.

- Determine optimal release month for each genre.

- Analyze if low-budget films (<$10M) can achieve high ROI.

- Recommend 3 proven directors for hire.

## Project Goals
- Analyze movie datasets (Box Office Mojo, IMDB, etc.) to find patterns in successful films. Focus on genres, release seasons, budgets, and ratings. Create simple, clear recommendations for the studio head using basic data analysis.

## Success Criteria
- A list of film types that can be successful to the company and this list should be backed by data.

- Recommendations that are easy to understand and apply to a starting movie company.

- Analysis uses only the provided datasets.

# Data Understanding

## Data Sources Overview

The following publicly available movie industry datasets will be explored. Their collective relevance to the studio head's problem lies in their ability to answer core questions about profitable film types:

 - **Box Office Mojo** (CSV): Contains domestic gross earnings and release dates (2010-2018), directly indicating when films perform best.

 - **IMDB** (SQLite Database): Provides genre classifications, director information, and titles, essential for categorizing what types of films exist.

 - **Rotten Tomatoes** (TSV): Includes critic/audience scores, useful for understanding quality perception of successful films.

 - **TheMovieDB** (CSV): Offers popularity metrics and international revenue data, supplementing financial analysis.

 - **The Numbers** (CSV): Features production budgets and worldwide gross figures (1915-2020), enabling profitability calculations – the core success metric.

These datasets collectively address the studio head's need to identify profitable film characteristics by covering financial performance, genre trends, release timing, and creative talent

## Data Loading and Initial Exploration

### The Numbers Dataset

In [9]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


In [10]:
# Load the dataset
df_budget = pd.read_csv('data/tn.movie_budgets.csv')

df_budget.head(3) 

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"


In [13]:
print(df_budget.dtypes)

id                    int64
release_date         object
movie                object
production_budget    object
domestic_gross       object
worldwide_gross      object
dtype: object


important columns
- `production_budget`: Film creation costs
- `domestic_gross`: US earnings
- `worldwide_gross`: Global earnings
- `release date`

These directly measure profitability

- Financial values stored as strings(object) with currency symbols for example `$110,000,000` requiring cleaning

In [16]:
print(f'Oldest release: {df_budget['release_date'].min()}')
print(f'Newest release: {df_budget['release_date'].max()}')

Oldest release: Apr 1, 1975
Newest release: Sep 9, 2016


- The dataset covers films from `Apr 1, 1975` to `Sep 9, 2016` that is 41 years of movie financial data. This is valuable as it focuses on recent conditions which are relevant to our new studio.
- Release dates in `MM DD YYYY`

In [19]:
df_budget.isna().sum()

id                   0
release_date         0
movie                0
production_budget    0
domestic_gross       0
worldwide_gross      0
dtype: int64

- No missing values

**Immediate Limitations:**
- No genre information cannot yet answer "what types"
- No director/crew details
- Older films may not reflect current market dynamics

### Box Office Mojo Dataset

In [24]:
# Load the dataset
df_gross = pd.read_csv('data/bom.movie_gross.csv')
df_gross.head(3)

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010


In [26]:
df_gross.dtypes

title              object
studio             object
domestic_gross    float64
foreign_gross      object
year                int64
dtype: object

important columns:
- `Title`
- `Domestic_gross`: US box office earnings
- `Foreign_gross`: International earnings (string)
- `Year`: Release year
- `Studio`

In [29]:
df_gross.shape

(3387, 5)

In [31]:
print(f'Oldest release: {df_gross['year'].min()}')
print(f'Newest release: {df_gross['year'].max()}')

Oldest release: 2010
Newest release: 2018


- Focuses exclusively on recent films this is highly relevant for our new studio 

In [34]:
df_gross.isna().sum()

title                0
studio               5
domestic_gross      28
foreign_gross     1350
year                 0
dtype: int64

In [36]:
df_gross.shape

(3387, 5)

- 0.826% missing values in `domestic_gross`, 28 films lack US earnings data
- Additional 39.85% missing in `foreign_gross`, 1350 films lack international data

**Note:** 
- The 2010-2018 coverage is good for the company as it aligns with current market conditions
- Missing gross values will require data cleaning
- Contains studio information

### IMDB Dataset

In [41]:
import sqlite3
# Connect to database
conn = sqlite3.connect('data/im.db')

In [43]:
# Check available tables
pd.read_sql("""SELECT name 
               FROM sqlite_master 
               WHERE type='table';""", conn)

Unnamed: 0,name
0,movie_basics
1,directors
2,known_for
3,movie_akas
4,movie_ratings
5,persons
6,principals
7,writers


- `movie_basics`: Movie metadata 
- `movie_ratings`: Audience ratings
- `directors/writers`: Creative talent information

In [46]:
# Load movie basics table
pd.read_sql("""SELECT * 
               FROM movie_basics 
               LIMIT 5;""", conn)


Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"


- `primary_title`: Official film title, merging key

- `start_year`: Release year 

- `runtime_minutes`: Film duration

- Sample entries show genre format as comma-separated strings ("Action,Crime,Drama")

In [49]:
# Check genre nulls
null_genres = pd.read_sql("""SELECT COUNT(*) 
                           FROM movie_basics 
                           WHERE genres IS NULL;""", conn)

movie_count = pd.read_sql(""" SELECT COUNT(*) 
                              FROM movie_basics""", conn)

print(f"Total movies: {movie_count.iloc[0,0]}")
print(f"Movies missing genres: {null_genres.iloc[0,0]}")

Total movies: 146144
Movies missing genres: 5408


- Contains 146,144 movies
- 5408 movies (3.70%) have null genre values

### Rotten Tomatoes Dataset

In [53]:
# Load Rotten Tomatoes data
df_rotten = pd.read_csv('data/rt.reviews.tsv', sep='\t',encoding= 'ISO-8859-1')
df_rotten.head(3)

Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
0,3,A distinctly gallows take on contemporary fina...,3/5,fresh,PJ Nabarro,0,Patrick Nabarro,"November 10, 2018"
1,3,It's an allegory in search of a meaning that n...,,rotten,Annalee Newitz,0,io9.com,"May 23, 2018"
2,3,... life lived in a bubble in financial dealin...,,fresh,Sean Axmaker,0,Stream on Demand,"January 4, 2018"


In [55]:
df_rotten.dtypes

id             int64
review        object
rating        object
fresh         object
critic        object
top_critic     int64
publisher     object
date          object
dtype: object

key columns:
- `id`
- `review` - comments on the movie
- `rating` - out of 5
- `fresh` - "Rotten" (<60%), "Fresh" (≥60%)

In [58]:
df_rotten.shape

(54432, 8)

In [60]:
# Checking nulss
df_rotten.isna().sum()

id                0
review         5563
rating        13517
fresh             0
critic         2722
top_critic        0
publisher       309
date              0
dtype: int64

In [62]:
df_rotten.shape

(54432, 8)

- 24.83% of rating(13517) is missing
- 10.24% of review(5563) is missing
- 5.00% of critic(2722) is missing

### Rotten Tomatoes Movie Info (additional dataset)

In [66]:
# Load the additional Rotten Tomatoes file
df_rotten_info = pd.read_csv('data/rt.movie_info.tsv', sep='\t')
df_rotten_info.head(3)

Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
0,1,"This gritty, fast-paced, and innovative police...",R,Action and Adventure|Classics|Drama,William Friedkin,Ernest Tidyman,"Oct 9, 1971","Sep 25, 2001",,,104 minutes,
1,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000.0,108 minutes,Entertainment One
2,5,Illeana Douglas delivers a superb performance ...,R,Drama|Musical and Performing Arts,Allison Anders,Allison Anders,"Sep 13, 1996","Apr 18, 2000",,,116 minutes,


In [68]:
df_rotten_info.columns

Index(['id', 'synopsis', 'rating', 'genre', 'director', 'writer',
       'theater_date', 'dvd_date', 'currency', 'box_office', 'runtime',
       'studio'],
      dtype='object')

- `genre`: Categorical film types

- `director`: Filmmaker information 

- `studio` : Production company

- `runtime`: Film duration in minutes

- `box_office`: Domestic earnings

In [71]:
df_rotten_info.shape

(1560, 12)

- Contains 1,560 entries with detailed film metadata
- 12 columns including runtime, genre, director, and studio information

In [74]:
df_rotten_info.isna().sum()

id                 0
synopsis          62
rating             3
genre              8
director         199
writer           449
theater_date     359
dvd_date         359
currency        1220
box_office      1220
runtime           30
studio          1066
dtype: int64

Key Columns with Missing Values:
- `synopsis`: 62 missing (4.0%) - plot descriptions
- `director`: 199 missing (12.8%) - filmmaker information
- `writer`: 449 missing (28.8%) - screenwriter credits
- `theater_date`: 359 missing (23.0%) - theatrical release dates
- `dvd_date`: 359 missing (23.0%) - home media release dates
- `currency`: 1,220 missing (78.2%) - money unit (USD, etc.)
- `box_office`: 1,220 missing (78.2%) - domestic earnings
- `runtime`: 30 missing (1.9%) - film duration
- `studio`: 1,066 missing (68.3%) - production companies

### TheMovieDB Dataset

In [78]:
# Load TheMovieDB data
df_tmdb = pd.read_csv('data/tmdb.movies.csv')
df_tmdb.head(3)

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368


key columns:
- `popularity`: TMDB's proprietary score
- `vote_average`: User rating (0-10 scale)
- `vote_count`: Number of user rating

In [81]:
df_tmdb.shape

(26517, 10)

- Covers 26,000+ films

In [84]:

print(f"Popularity range: {df_tmdb['popularity'].min():.2f} to {df_tmdb['popularity'].max():.2f}")
print(f"Vote average range: {df_tmdb['vote_average'].min():.1f} to {df_tmdb['vote_average'].max():.1f}")

Popularity range: 0.60 to 80.77
Vote average range: 0.0 to 10.0


- Popularity scores range widely (0.01 to 547.48)


Limitations:

- Subjective metrics don't directly measure financial success

- "Popularity" algorithm isn't publicly defined

- Interpretation Challenge: These datasets provide valuable audience perception data, but their subjective nature means: High ratings ≠ box office success

# Data Preparation

## Dataset Selection

We are using:
- **The Numbers (CSV)**: Contains production budgets, domestic and worldwide gross earnings, and release dates (with month and day) for over 5,000 films. This allows us to calculate profit and ROI, and to extract release months for seasonal analysis.
- **IMDB (SQLite)**: Provides genre classifications and director information, enabling us to categorize films by type and identify successful talent.
  
We are excluding:
- **Box Office Mojo**: While it contains domestic gross and studio information, its financial data is redundant with The Numbers (which also includes worldwide gross) and it lacks detailed release dates (only year). This would add no unique value for our objectives.
- **Rotten Tomatoes** and **TheMovieDB**: Their focus on ratings and popularity does not directly address profitability, and they have significant data gaps.

## Cleaning

We'll clean each dataset individually.

### Cleaning The Numbers Dataset

In [89]:
print('Original shape:', df_budget.shape)
df_budget.head(3)

Original shape: (5782, 6)


Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"


In [91]:
print(df_budget.dtypes)

id                    int64
release_date         object
movie                object
production_budget    object
domestic_gross       object
worldwide_gross      object
dtype: object


In [93]:
# Clean currency columns ($110,000,000 → 110000000)
# used for loop to avoid redundant code
currency_cols = ['production_budget', 'domestic_gross', 'worldwide_gross']
for col in currency_cols:
    df_budget[col] = (
        df_budget[col]
        .str.replace('$', '', regex=False)  
        .str.replace(',', '', regex=False)   
        .astype('int64')                         
    )

Values stored as strings $\$110,000,000$ I removed \$ and , then converted to integers to enable calculations

In [95]:
# 2. Fix release dates (convert to datetime)
df_budget['release_date'] = pd.to_datetime(df_budget['release_date'], format='%b %d, %Y')

Dates in a string format ("Dec 18, 2009"), so I Converted to datetime to allow month extraction for time analysis, which month to release movies

In [97]:
# 3. Filter valid financial records
df_budget = df_budget[(df_budget['production_budget'] > 1000) & (df_budget['worldwide_gross'] > 0) & (df_budget['release_date'] >= '2005-01-01')]

The raw data included films with $\$0$ gross or pre-2005 data so I filtered off budgets that are less than $\$1,000$, those that have 
**negative gross**
and I kept post-2005 data

So as to focus on relevant, quality data for today's market

In [99]:
# 4. Add release year for merging
df_budget['release_year'] = df_budget['release_date'].dt.year

In [101]:

# 5. Clean movie titles
df_budget['movie_clean'] = (df_budget['movie'].str.lower().str.replace(r'[^\w\s]', '', regex=True) ) # Remove punctuation, Convert to lowercase

Titles had punctions like "Avengers: Endgame" vs "Avengers, The" so I converted them to lowercase and removed punctuation

so as improve merging with IMDB

In [105]:
print("New shape:", df_budget.shape)
print("Data types:\n", df_budget.dtypes)
df_budget.head(3)

New shape: (3070, 8)
Data types:
 id                            int64
release_date         datetime64[ns]
movie                        object
production_budget             int64
domestic_gross                int64
worldwide_gross               int64
release_year                  int32
movie_clean                  object
dtype: object


Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross,release_year,movie_clean
0,1,2009-12-18,Avatar,425000000,760507625,2776345279,2009,avatar
1,2,2011-05-20,Pirates of the Caribbean: On Stranger Tides,410600000,241063875,1045663875,2011,pirates of the caribbean on stranger tides
2,3,2019-06-07,Dark Phoenix,350000000,42762350,149762350,2019,dark phoenix


The dataset is now clean with:

- Numeric budgets/gross to do finacial calculations later.
- Standardized dates so as to know when to release movies
- Relevant films only 2005 with valid coumns

Original: 5,782 films → Cleaned: 3,070 films

**Note:**
The Numbers dataset contains zero null values across all columns. This was confirmed during initial exploring

### Cleaning IMDB Dataset

## Merging or Joining Data (if applicable)

## Feature Engineering

E.g., extracting main genre, budget buckets, etc.

# Exploratory Data Analysis (EDA)