# Movie Studio Analysis: Understanding Box Office Success
## Business Understanding
The company is planning to launch a new movie studio, but lacks experience in movie production. The goal of this analysis is to explore current trends in the film industry and provide actionable insights that can guide the studio's strategy. This analysis focuses on identifying factors that contribute to box office success, including genre, budget, release timing, and the impact of key personnel such as directors, actors, and writers.
    

## Data Understanding
The analysis is based on multiple datasets related to movie budgets, revenues, genres, release dates, and key personnel (directors, actors, and writers).

These datasets include:
   - **Box Office Mojo (BOM) Movie Gross**: Contains information on domestic and foreign box office revenues.
   - **Rotten Tomatoes (RT) Movie Info**: Provides metadata about movies, including genres and release dates.
   - **TMDB Movies**: Includes information on movie popularity and ratings.
   - **IMDb Database**: Provides detailed information on directors, actors, writers, and other key personnel involved in the movies.
    

## Data Preparation
In this section, we will load, clean, and merge the datasets to prepare them for analysis.

In [16]:
# Library imports
import pandas as pd
import sqlite3
from zipfile import ZipFile
import numpy as np
import scipy.stats as stats

In [4]:
# Load the datasets
rt_movie_info = pd.read_csv("zippedData/rt.movie_info.tsv.gz", sep="\t")
tmdb_movies = pd.read_csv("zippedData/tmdb.movies.csv.gz")
tn_movies = pd.read_csv("zippedData/tn.movie_budgets.csv.gz")

In [6]:
# Loading IMDb SQLite database

# Unzip the sqlite db file if not already done
with ZipFile("zippedData/im.db.zip", 'r') as zObject:
    zObject.extractall("zippedData/")

# Creating the connection
conn = sqlite3.connect("zippedData/im.db")

# Loading data for directors, actors, and writers filtering for US movies in English

# Queries
query_directors = """
SELECT mb.*, mr.averagerating, mr.numvotes, p.primary_name, p.birth_year, p.death_year, p.primary_profession
FROM movie_basics AS mb
JOIN movie_ratings AS mr ON mb.movie_id = mr.movie_id
JOIN principals AS pr ON mb.movie_id = pr.movie_id
JOIN persons AS p ON pr.person_id = p.person_id
JOIN movie_akas AS ma ON mb.movie_id = ma.movie_id
WHERE ma.region = 'US'
AND pr.category = 'director'
AND ma.language = 'en';
"""

query_actors = """
SELECT mb.*, mr.averagerating, mr.numvotes, p.primary_name, p.birth_year, p.death_year, p.primary_profession
FROM movie_basics AS mb
JOIN movie_ratings AS mr ON mb.movie_id = mr.movie_id
JOIN principals AS pr ON mb.movie_id = pr.movie_id
JOIN persons AS p ON pr.person_id = p.person_id
JOIN movie_akas AS ma ON mb.movie_id = ma.movie_id
WHERE ma.region = 'US'
AND pr.category = 'actor'
AND ma.language = 'en';
"""

query_writers = """
SELECT mb.*, mr.averagerating, mr.numvotes, p.primary_name, p.birth_year, p.death_year, p.primary_profession
FROM movie_basics AS mb
JOIN movie_ratings AS mr ON mb.movie_id = mr.movie_id
JOIN principals AS pr ON mb.movie_id = pr.movie_id
JOIN persons AS p ON pr.person_id = p.person_id
JOIN movie_akas AS ma ON mb.movie_id = ma.movie_id
WHERE ma.region = 'US'
AND pr.category = 'writer'
AND ma.language = 'en';
"""

# Execute queries and assign to dataframes
directors_merged = pd.read_sql_query(query_directors, conn)
actors_merged = pd.read_sql_query(query_actors, conn)
writers_merged = pd.read_sql_query(query_writers, conn)

# Close the connection
conn.close()

## Data Cleaning
Here I'll go through the process of cleaning the data by handling missing values, removing duplicate information, recasting data types, and feature engineering

In [8]:
# Cleaning IMDb data
def clean_imdb_data(df):
    # Handle missing values (dropping rows with missing ratings or votes)
    df = df.dropna(subset=['averagerating', 'numvotes', 'primary_name'])
    
    # Convert numvotes to integers
    df['numvotes'] = df['numvotes'].astype(int)
    
    # Filter out movies with less than 1000 votes
    df = df[df['numvotes'] >= 1000]
    
    return df

directors_cleaned = clean_imdb_data(directors_merged)
actors_cleaned = clean_imdb_data(actors_merged)
writers_cleaned = clean_imdb_data(writers_merged)

In [9]:
# Cleaning TMDB data
def clean_tmdb_data(df):
    # Handle missing values
    df = df.dropna(subset=['popularity', 'vote_count', 'release_date'])
    
    # Convert release_date to datetime
    df['release_date'] = pd.to_datetime(df['release_date'])
    
    # Convert vote_count to integer
    df['vote_count'] = df['vote_count'].astype(int)
    
    # Remove any duplicate values based on movie title and release date
    df = df.drop_duplicates(subset=['title', 'release_date'])
    
    return df

tmdb_cleaned = clean_tmdb_data(tmdb_movies)

In [11]:
# Cleaning the Rotten Tomatoes data
def clean_rt_movie_info(df):
    # Handle missing values
    df = df.dropna(subset=['genre', 'director', 'theater_date']).copy()
    
    # Convert theater_date and dvd_date to datetime
    df.loc[:, 'theater_date'] = pd.to_datetime(df['theater_date'], errors='coerce')
    df.loc[:, 'dvd_date'] = pd.to_datetime(df['dvd_date'], errors='coerce')
    
    # Standarize genre names
    df.loc[:, 'genre'] = df['genre'].str.strip().str.title()
    
    return df

rt_movie_info_cleaned = clean_rt_movie_info(rt_movie_info)

In [12]:
# Cleaning 'The Numbers' movie budgets data

# Our selected data already has currency in the proper format but just in case I add more data later this will come in handy
def clean_currency(x):
    if isinstance(x, str):
        return float(x.replace('$', '').replace(',', ''))
    return x

def clean_tn_movie_budgets(df):
    # Apply the currency cleaning function to the budget and revenue columns
    df['production_budget'] = df['production_budget'].apply(clean_currency)
    df['domestic_gross'] = df['domestic_gross'].apply(clean_currency)
    df['worldwide_gross'] = df['domestic_gross'].apply(clean_currency)
    
    # Handle missing values
    df = df.dropna(subset=['production_budget', 'domestic_gross', 'worldwide_gross'])
    
    return df

tn_movie_budgets_cleaned = clean_tn_movie_budgets(tn_movies)

## Hypothesis testing