![example](images/director_shot.jpeg)

# Microsoft Movie Studio Market Analysis & Recommendations

**Author:** Kristin Cooper
***

## Overview

This project analyzes modern movie performance in order to aid decision making as Microsoft enters the movie production market. Based on data from [The Movie Database](https://www.themoviedb.org/), [IMDB](https://www.imdb.com/), and [The Numbers](https://www.the-numbers.com/), conclusions around target production budget, genre, and release month have been determined.


## Business Problem

Microsoft is seeking to enter the movie production market, as they've seen competitors such as Netflix, Disney, Hulu, and Amazon succeed in original video content creation. However, there are many decisions to be made before emarking on a creative venture to ensure Microsoft's investment to enter the market yields positive returns.

This analysis considers what type of movie Microsoft should create as their foray into movie production:
* Are some genres more successful than others?
* Do longer or shorter movies perform better? Does runtime correlate with production budget?
* How much should Microsoft invest in the production budget?
* When should Micorosoft release their film to maximize performance?

## Data Understanding & Preparation

Source data includes:

* [The Numbers](https://www.the-numbers.com/) - Production budget, domestic and worldwide revenue
* [IMDB](https://www.imdb.com/) - Runtime, genres
* [The Movie Database](https://www.themoviedb.org/) - Popularity

All data was filtered to focus on modern movie trends, defined as movies released in 2010 or later.

### Import standard packages

In [60]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import statsmodels

%matplotlib inline

### Import & clean data from The Numbers
This data is primarily used to understand budgets and revenues for movies. 

SUMMARY: 2,195 movies released since 2010 with their release month, budget and revenue information

In [54]:
df_moviebudgets = pd.DataFrame(pd.read_csv('data/tn.movie_budgets.csv'))

# dropped ID column because there were duplicate values
df_moviebudgets.drop(columns='id', inplace=True) 

# created a column for release year to simplify/group
df_moviebudgets['release_year'] = df_moviebudgets['release_date'].apply(lambda x: x[-4:]).astype(int)

# created a column for release month to simplify/group
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
month_ids = [1,2,3,4,5,6,7,8,9,10,11,12]
df_moviebudgets['release_month'] = df_moviebudgets['release_date'].apply(lambda x: x[:3]).replace(months, month_ids)

# created a function to clean up the columns containing dollar amounts, then executed
def clean_money_columns(df, column):
    df[column] = df[column].apply([lambda x: x[1:]])
    df[column] = df[column].apply([lambda x: x.replace(',', '')])
    df[column] = df[column].apply([lambda x: int(x)])

clean_money_columns(df=df_moviebudgets, column='production_budget')
clean_money_columns(df=df_moviebudgets, column='domestic_gross')
clean_money_columns(df=df_moviebudgets, column='worldwide_gross')

# focused just on recent movies released since 2010
df_moviebudgets = df_moviebudgets.loc[df_moviebudgets['release_year']>=2010]

print('---DATAFRAME INFO---')
display(df_moviebudgets.info())
print('\n')
print('\n')
print('---STATISTICAL MEASURES OF BUDGET & REVENUE IN SAMPLE DATA---')
display(df_moviebudgets.describe().round().drop(columns='release_year'))
print('\n')
print('\n')
print('---DATAFRAME SORTED BY WORLDWIDE GROSS DESC---')
display(df_moviebudgets.sort_values('worldwide_gross', ascending=False))

---DATAFRAME INFO---
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2194 entries, 1 to 5780
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   release_date       2194 non-null   object
 1   movie              2194 non-null   object
 2   production_budget  2194 non-null   int64 
 3   domestic_gross     2194 non-null   int64 
 4   worldwide_gross    2194 non-null   int64 
 5   release_year       2194 non-null   int64 
 6   release_month      2194 non-null   int64 
dtypes: int64(5), object(2)
memory usage: 137.1+ KB


None





---STATISTICAL MEASURES OF BUDGET & REVENUE IN SAMPLE DATA---


Unnamed: 0,production_budget,domestic_gross,worldwide_gross,release_month
count,2194.0,2194.0,2194.0,2194.0
mean,36533472.0,44112031.0,111893400.0,7.0
std,51544152.0,79797353.0,215220200.0,4.0
min,1400.0,0.0,0.0,1.0
25%,4500000.0,93772.0,1023780.0,4.0
50%,16900000.0,12790900.0,27521350.0,7.0
75%,42000000.0,53324702.0,113270200.0,10.0
max,410600000.0,936662225.0,2053311000.0,12.0






---DATAFRAME SORTED BY WORLDWIDE GROSS DESC---


Unnamed: 0,release_date,movie,production_budget,domestic_gross,worldwide_gross,release_year,release_month
5,"Dec 18, 2015",Star Wars Ep. VII: The Force Awakens,306000000,936662225,2053311220,2015,12
6,"Apr 27, 2018",Avengers: Infinity War,300000000,678815482,2048134200,2018,4
33,"Jun 12, 2015",Jurassic World,215000000,652270625,1648854864,2015,6
66,"Apr 3, 2015",Furious 7,190000000,353007020,1518722794,2015,4
26,"May 4, 2012",The Avengers,225000000,623279547,1517935897,2012,5
...,...,...,...,...,...,...,...
1542,"Sep 13, 2019",The Goldfinch,40000000,0,0,2019,9
5037,"Apr 23, 2019",Living Dark: The Story of Ted the Caver,1750000,0,0,2019,4
5033,"Oct 20, 2015",Beginnerâs Guide to Sex,1800000,0,0,2015,10
5032,"Mar 11, 2014",Against the Wild,1800000,0,0,2014,3


### Import & clean data from IMDB
This data is primarily used to incorporate runtime and genre categorizations into the analysis.

SUMMARY: 146,018 movies started between 2010-2020, their runtime, and their genre tag(s)

In [22]:
df_imdbbasics = pd.DataFrame(pd.read_csv('data/imdb.title.basics.csv'))

# drop unnecessary columns
df_imdbbasics.drop(columns='tconst', inplace=True)

# dealt with missing data
df_imdbbasics['genres'] = df_imdbbasics['genres'].fillna('n/a') 
df_imdbbasics['original_title'] = df_imdbbasics['original_title'].fillna(df_imdbbasics['primary_title'])
df_imdbbasics['runtime_minutes'] = df_imdbbasics['runtime_minutes'].fillna('0').astype(int)

# created a column comprised of genres in list format
df_imdbbasics['genres_list'] = df_imdbbasics['genres'].apply(lambda x: x.split(','))

# removed rows where start year is later than 2020, assuming these may not have been released, 
# especially due to COVID production delays
df_imdbbasics = df_imdbbasics.loc[df_imdbbasics['start_year']<2021]
    
print('---DATAFRAME INFO---')
display(df_imdbbasics.info())
print('\n')
print('\n')
print('---DATAFRAME SORTED BY START YEAR DESC---')
display(df_imdbbasics.sort_values('start_year', ascending=False))

---DATAFRAME INFO---
<class 'pandas.core.frame.DataFrame'>
Int64Index: 146018 entries, 0 to 146143
Data columns (total 6 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   primary_title    146018 non-null  object
 1   original_title   146018 non-null  object
 2   start_year       146018 non-null  int64 
 3   runtime_minutes  146018 non-null  int64 
 4   genres           146018 non-null  object
 5   genres_list      146018 non-null  object
dtypes: int64(2), object(4)
memory usage: 7.8+ MB


None





---DATAFRAME SORTED BY START YEAR DESC---


Unnamed: 0,primary_title,original_title,start_year,runtime_minutes,genres,genres_list
4151,Ravens Fell,Ravens Fell,2020,0,Horror,[Horror]
4022,Rocket,Rocket,2020,0,Adventure,[Adventure]
4023,Winter Soldier: Retrieval,Winter Soldier: Retrieval,2020,0,Action,[Action]
124798,The beauty of my knees,The beauty of my knees,2020,0,Drama,[Drama]
4026,Uniting a Nation,Uniting a Nation,2020,0,Documentary,[Documentary]
...,...,...,...,...,...,...
14641,Vittorio racconta Gassman: Una vita da mattatore,Vittorio racconta Gassman: Una vita da mattatore,2010,88,Documentary,[Documentary]
14640,Toumast,Toumast,2010,89,Documentary,[Documentary]
14638,The World Within,The World Within,2010,70,Documentary,[Documentary]
14637,The Sound of Mumbai: A Musical,The Sound of Mumbai: A Musical,2010,63,"Documentary,Musical","[Documentary, Musical]"


### Import & clean data from The Movie Database
This data is primarily used to vet assumptions around language and incorporate popularity and vote data in the analysis.

SUMMARY: 26,291 movies released between 2010-2020 with their original language, popularity as measured by TMDB interactions, and votes.

In [31]:
df_tmdbmovies = pd.DataFrame(pd.read_csv('data/tmdb.movies.csv'))

# focused just on recent movies released since 2010
df_tmdbmovies['release_year'] = df_tmdbmovies['release_date'].apply(lambda x: x[:4]).astype(int)
df_tmdbmovies = df_tmdbmovies.loc[df_tmdbmovies['release_year']>=2010]

# removed genre since I have this data from IMDB and release date since I have this data from The Numbers
df_tmdbmovies.drop(columns=['Unnamed: 0','genre_ids', 'id', 'release_date', 'release_year'], inplace=True)

print('---DATAFRAME INFO---')
display(df_tmdbmovies.info())
print('\n')
print('\n')
print('---STATISTICAL MEASURES OF POPULARITY & VOTES IN SAMPLE DATA---')
display(df_tmdbmovies.describe().round())
print('\n')
print('\n')
print('---DATAFRAME SORTED BY POPULARITY GROSS DESC---')
display(df_tmdbmovies.sort_values('popularity', ascending=False))

---DATAFRAME INFO---
<class 'pandas.core.frame.DataFrame'>
Int64Index: 26291 entries, 0 to 26516
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   original_language  26291 non-null  object 
 1   original_title     26291 non-null  object 
 2   popularity         26291 non-null  float64
 3   title              26291 non-null  object 
 4   vote_average       26291 non-null  float64
 5   vote_count         26291 non-null  int64  
dtypes: float64(2), int64(1), object(3)
memory usage: 1.4+ MB


None





---STATISTICAL MEASURES OF POPULARITY & VOTES IN SAMPLE DATA---


Unnamed: 0,popularity,vote_average,vote_count
count,26291.0,26291.0,26291.0
mean,3.0,6.0,187.0
std,4.0,2.0,939.0
min,1.0,0.0,1.0
25%,1.0,5.0,2.0
50%,1.0,6.0,5.0
75%,4.0,7.0,27.0
max,81.0,10.0,22186.0






---DATAFRAME SORTED BY POPULARITY GROSS DESC---


Unnamed: 0,original_language,original_title,popularity,title,vote_average,vote_count
23811,en,Avengers: Infinity War,80.773,Avengers: Infinity War,8.3,13948
11019,en,John Wick,78.123,John Wick,7.2,10081
23812,en,Spider-Man: Into the Spider-Verse,60.534,Spider-Man: Into the Spider-Verse,8.4,4048
11020,en,The Hobbit: The Battle of the Five Armies,53.783,The Hobbit: The Battle of the Five Armies,7.3,8392
5179,en,The Avengers,50.289,The Avengers,7.6,19673
...,...,...,...,...,...,...
13873,en,Platonic Solid,0.600,Platonic Solid,5.0,1
13874,en,The Scanners Way: Creating the Special Effects...,0.600,The Scanners Way: Creating the Special Effects...,5.0,1
13875,en,L Word Mississippi: Hate the Sin,0.600,L Word Mississippi: Hate the Sin,5.0,1
13876,en,Send,0.600,Send,5.0,1


### Joined 3 datasets to create the base for most analyses

In [59]:
joined_df = df_moviebudgets.join(df_tmdbmovies.set_index('title'), on='movie', how='inner').join(df_imdbbasics.set_index('primary_title'), on='movie', how='inner', lsuffix='_imdb')
joined_df.drop(columns=['release_date', 'original_title_imdb', 'original_title', 'start_year', 'genres'], inplace=True)
joined_df.drop_duplicates(subset=['movie', 'release_year', 'production_budget'], inplace=True)
joined_df.sort_values('worldwide_gross', ascending=False)

print('---DATAFRAME INFO---')
display(joined_df.info())
print('\n')
print('\n')
print('---DATAFRAME SORTED BY WORLDWIDE GROSS DESC---')
display(joined_df.sort_values('worldwide_gross', ascending=False))

---DATAFRAME INFO---
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1648 entries, 1 to 5772
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   movie              1648 non-null   object 
 1   production_budget  1648 non-null   int64  
 2   domestic_gross     1648 non-null   int64  
 3   worldwide_gross    1648 non-null   int64  
 4   release_year       1648 non-null   int64  
 5   release_month      1648 non-null   int64  
 6   original_language  1648 non-null   object 
 7   popularity         1648 non-null   float64
 8   vote_average       1648 non-null   float64
 9   vote_count         1648 non-null   int64  
 10  runtime_minutes    1648 non-null   int64  
 11  genres_list        1648 non-null   object 
dtypes: float64(2), int64(7), object(3)
memory usage: 167.4+ KB


None





---DATAFRAME SORTED BY WORLDWIDE GROSS DESC---


Unnamed: 0,movie,production_budget,domestic_gross,worldwide_gross,release_year,release_month,original_language,popularity,vote_average,vote_count,runtime_minutes,genres_list
6,Avengers: Infinity War,300000000,678815482,2048134200,2018,4,en,80.773,8.3,13948,149,"[Action, Adventure, Sci-Fi]"
33,Jurassic World,215000000,652270625,1648854864,2015,6,en,20.709,6.6,14056,124,"[Action, Adventure, Sci-Fi]"
66,Furious 7,190000000,353007020,1518722794,2015,4,en,20.396,7.3,6538,137,"[Action, Crime, Thriller]"
26,The Avengers,225000000,623279547,1517935897,2012,5,en,50.289,7.6,19673,143,"[Action, Adventure, Sci-Fi]"
3,Avengers: Age of Ultron,330600000,459005868,1403013963,2015,5,en,44.383,7.3,13457,141,"[Action, Adventure, Sci-Fi]"
...,...,...,...,...,...,...,...,...,...,...,...,...
5395,Walter,700000,0,0,2015,3,en,3.277,5.5,31,0,[Thriller]
4997,The Curse of Downers Grove,2000000,0,0,2015,8,en,3.674,4.7,42,89,"[Drama, Horror, Mystery]"
5401,After,650000,0,0,2012,12,en,7.712,5.7,86,99,"[Drama, Mystery]"
5404,Treachery,625000,0,0,2013,12,en,1.874,4.8,5,67,"[Drama, Thriller]"


## Data Modeling

### Are some genres more successful than others?

The top 5 most successful genres as measured by average revenue are: 
1. Animation
1. Adventure
1. Sci-Fi
1. Action
1. Fantasy

However, the 3 genres with the highest return on investment as measured by revenue % production budget are:
1. Mystery
1. Horror
1. Thriller

In [172]:
# Duplicated movies with multiple genre tags, so that the movie is accounted for in genre aggregates
# Evaluated mean and medians; found no significant difference

exploded_df = joined_df.explode('genres_list').sort_values('worldwide_gross')
exploded_df = exploded_df.loc[exploded_df['genres_list']!='n/a']
exploded_df['return_on_investment'] = exploded_df['worldwide_gross']/exploded_df['production_budget']
exploded_df = exploded_df.loc[(exploded_df['worldwide_gross']>0) & (exploded_df['production_budget']>0)]

genres_df_median = exploded_df.groupby('genres_list').median().drop(columns=['release_year', 'release_month']).reset_index()
genres_df_mean = exploded_df.groupby('genres_list').mean().drop(columns=['release_year', 'release_month']).reset_index().round(decimals=2)

exploded_df.describe()

Unnamed: 0,production_budget,domestic_gross,worldwide_gross,release_year,release_month,popularity,vote_average,vote_count,runtime_minutes,return_on_investment
count,3791.0,3791.0,3791.0,3791.0,3791.0,3791.0,3791.0,3791.0,3791.0,3791.0
mean,47441100.0,58984520.0,150826100.0,2013.799261,6.814297,12.714958,6.283118,2109.838037,106.988657,3.925018
std,57233920.0,87807370.0,243132800.0,2.495602,3.411422,8.093703,0.905888,2890.904184,19.490019,13.01349
min,9000.0,0.0,26.0,2010.0,1.0,0.6,1.0,1.0,0.0,2.6e-05
25%,10000000.0,4210454.0,10898290.0,2012.0,4.0,8.0,5.8,311.0,95.0,0.852637
50%,25000000.0,28848690.0,56445530.0,2014.0,7.0,10.993,6.3,1008.0,105.0,2.182908
75%,60000000.0,71595360.0,170293900.0,2016.0,10.0,15.762,6.9,2684.0,117.0,4.068014
max,410600000.0,700059600.0,2048134000.0,2019.0,12.0,80.773,10.0,22186.0,180.0,416.56474


In [177]:
color_sequence = px.colors.sequential.dense[1:]

fig1 = px.bar(genres_df_mean.sort_values('worldwide_gross', ascending=False), 
              x='genres_list', y='worldwide_gross', color='genres_list',
              color_discrete_sequence=color_sequence, 
              title='Average Revenue Per Movie by Genre',
              labels={'worldwide_gross': 'Average Worldwide Revenue per Movie ($M)',
                      'genres_list': 'Genre'}
             )
fig1.update_layout(plot_bgcolor='#f2f2f2', height=625)
fig1.show()

fig2 = px.bar(genres_df_mean.sort_values('worldwide_gross', ascending=False), 
              x='genres_list', y='return_on_investment', color='genres_list',
              color_discrete_sequence=color_sequence, 
              title='Average Return on Investment by Genre',
              labels={'return_on_investment': 'Average Return on Investment per Movie',
                      'genres_list': 'Genre'}
            )
fig2.update_layout(plot_bgcolor='#f2f2f2', height=625)
fig2.show()

fig3 = px.bar(genres_df_median.sort_values('worldwide_gross', ascending=False), 
              x='genres_list', y='return_on_investment', color='genres_list',
              color_discrete_sequence=color_sequence, 
              title='Median Return on Investment by Genre',
              labels={'return_on_investment': 'Median Return on Investment per Movie',
                      'genres_list': 'Genre'}
            )
fig3.update_layout(plot_bgcolor='#f2f2f2', height=625)
fig3.show()

fig4 = px.box(exploded_df, x='genres_list', y='return_on_investment', range_y=[0,20], 
              title='Return On Investment by Genre',
              color='genres_list', color_discrete_sequence=color_sequence, 
              labels={'return_on_investment': 'Return on Investment',
                      'genres_list': 'Genre'}
             )
fig4.update_layout(plot_bgcolor='#f2f2f2', height=800)
fig4.show()

### Do longer or shorter movies perform better? Does runtime correlate with production budget?

### How much should Microsoft invest in the production budget?

### When should Micorosoft release their film to maximize performance?

## Evaluation
Evaluate how well your work solves the stated business problem.

***
Questions to consider:
* How do you interpret the results?
* How well does your model fit your data? How much better is this than your baseline model?
* How confident are you that your results would generalize beyond the data you have?
* How confident are you that this model would benefit the business if put into use?
***

## Conclusions
Provide your conclusions about the work you've done, including any limitations or next steps.

***
Questions to consider:
* What would you recommend the business do as a result of this work?
* What are some reasons why your analysis might not fully solve the business problem?
* What else could you do in the future to improve this project?
***