<h1 style="font-weight: big">Movie Industry Margin Analysis by Budget Class</h1>

<figure>
    <img src="Movie Financing.png"
         alt="Movie Financing"
         width="400"
         height="200">
</figure>

<h1 style="font-weight: normal">Overview</h1>

For Microsoft's new movie studio, this project analyzes historical domestic and international box office trends against their stated movie budgets to find the 'sweet spot' budget range where movie budget dollars achieve the highest return.

Within that sweet spot, the analysis also identifies which movie genre pairings (action, animation, mystery, etc) are most likely to achieve those high returns.

Microsoft can use this information to determine how much initial investment is ideal when funding individual movies as well as which genres would be most beneficial for each budget class.

<h1 style="font-weight: normal">Business Problem</h1>

<figure>
    <img src="Movie Set.jpg"
         alt="Movie Set"
         width="400"
         height="200">
</figure>

Microsoft is looking to make a move into the original video content space and is establishing a new studio. Being new to the movie space, Microsoft is looking to understand which films are performing the best at the box office.

Analysis of historical financial trends at the box office will allow Microsoft to gain insight as to how much budget to allocate to the first few films it makes in order to maximize profitability and profit margin.

<h1 style="font-weight: normal">Data Understanding</h1>

The project team has gathered historical box office information from Box Office Mojo as well as movie genre, credited movie industry professionals, and ratings information from the online IMDB database.

Our primary focus will be on the financial information from Box Office Mojo and then will layer in additional detail from the online IMDB database as needed.

In [1]:
import pandas as pd
df_budgets = pd.read_csv("Raw Data/tn.movie_budgets.csv.gz")

<h2 style="font-weight: 100">Box Office Data</h2>

The box office data from Box Office Mojo includes dates as far back as 1915 and as recent as Dec 2020. All numbers reported are in USD and 'worldwide_gross' and is inclusive of 'domestic_gross' (fn 1)

In [2]:
df_budgets.head(2)

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"


<h1 style="font-weight: normal">Data Preparation</h1>

In [3]:
import datetime
import string
punct = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{}~'

<h2 style="font-weight: 100">IMDB SQL Data</h2>

PLACEHOLDER FOR VINNIE

<h2 style="font-weight: 100">Data Cleaning</h2>

In order to make the box office data both easier to work with and easier to merge with other data sets, we will perform the following data cleaning exercises:

- Budget and box office data will have the dollar signs and commas removed 
- Budget and box office data will be formatted as a number
- Release date will be formatted as a date time object
- Movie titles will be stripped of punctuation to make for a friendlier merges later

In [4]:
### Removing dollar signs and commas, and reformatting as a number
df_budgets['worldwide_gross'] = df_budgets['worldwide_gross'].str.replace('$', '').str.replace(',', '').astype(float)
df_budgets['domestic_gross'] = df_budgets['domestic_gross'].str.replace('$', '').str.replace(',', '').astype(float)
df_budgets['production_budget'] = df_budgets['production_budget'].str.replace('$', '').str.replace(',', '').astype(float)

### Reformatting release_date as a datetime object
df_budgets['release_date'] = pd.to_datetime(df_budgets['release_date'])

### Removing punctuation from movie titles
transtab = str.maketrans(dict.fromkeys(punct, ''))
df_budgets['movie'] = '|'.join(df_budgets['movie'].tolist()).translate(transtab).split('|')

<h2 style="font-weight: 100">Feature Engineering</h2>

Additionally, we added certain columns to create fields that can be merged on later, as well as financial metric columns for further analysis. In short:

- We created a 'year' column from the 'release_date' datetime object
- We created a 'movie_and_year' column, which combines 'movie' and 'year'
- We created a 'profit' column, which is calculated as 'worldwide_gross' - 'production budget'
- We created a 'profit_margin' column, which is calculated as ('profit' / 'worldwide_gross')

In [5]:
### Creation of 'year' and 'movie_and_year' columns
df_budgets['year'] = df_budgets['release_date'].dt.strftime('%Y')
df_budgets['movie_and_year'] = df_budgets['movie'] + " - " + df_budgets['year']

### Creation of 'profit' and 'profit_margin' columns
df_budgets['profit'] = df_budgets['worldwide_gross'] - df_budgets['production_budget']
df_budgets['profit_margin'] = (df_budgets['profit'] / df_budgets['worldwide_gross']) * 100

### Converting 'year' back into a numeric value
df_budgets['year'] = df_budgets['year'].map(int)

<h2 style="font-weight: 100">Data Filtering</h2>

While considering the business needs of Microsoft, the project team analyzed the data set and found certain subsections of data which would not be relevant for purposes of margin analysis. These subsets of data include:

- Filter 1: Movies that had a very small or non-existant box office release *(less than 50k worldwide_gross)*
- Filter 2: Movies that were released in the distant past *(released before year 2000)*
- Filter 3: Movies with budgets that would likely not be considered business-practical by Microsoft *(budget of less than 500k or more than 100m)*
- Filter 4: Movies with margins that were statistically anomalous *(margin of +/- 400%)*

See below for further analysis of these exclusions and additional rationale

In [6]:
### Filter 1
df_budgets = df_budgets[(df_budgets['worldwide_gross'] > 50000)]

### Filter 2
df_budgets = df_budgets[df_budgets['year'] > 2000]

### Filter 3
df_budgets = df_budgets[(df_budgets['production_budget'] > 500000) & 
                        (df_budgets['production_budget'] < 100000000)]

### Filter 4
df_budgets = df_budgets[(df_budgets['profit_margin'] > 400) | (df_budgets['profit_margin'] < -400)]