![example](images/director_shot.jpeg)

# Project Title

**Authors:** Dietrich Nigh, Annie Zheng, Paul Schulken
***

## Overview

A one-paragraph overview of the project, including the business problem, data, methods, results and recommendations.

With the return of theatrical movie releases, in addition to the increase in streaming video content, Microsoft has expressed interest in creating their own movie studio. To maximize their chances of success, they have requested an analysis of the best performing movies at the box office. This project used several data sets from a variety of sources like [IMDB](https://www.imdb.com/) and [The Numbers](https://www.the-numbers.com/) to explore the factors that tend to make a movie commercially and critically successful. The relationship between box office results and genre, awards votes, and release dates were analyzed to determine what combinations were most likely to produce popular and profitable movies. The analysis found that action-adventure and science fiction movies were big hits with both critics and audiences. Additionally, animated family movies were highly profitable relative to their production costs and the best time to release a movie in any genre is early summer or around Thanksgiving and Christmas.

## Business Problem

Summary of the business problem you are trying to solve, and the data questions that you plan to answer to solve them.

***
Questions to consider:
* What are the business's pain points related to this project?
Creating a movie studio is a big undertaking and therefore requires a good amount of forethought. Like any business, the studio must make money to remain operational so the box office results were focused on from the beginning.
* How did you pick the data analysis question(s) that you did?
The highest grossing movies were analyzed to find what they had in common.
* Why are these questions important from a business perspective?
By determining the factors most correlated with box office success, Microsoft's new studio can use the results to produce movies with confidence that they'll be profitable and well received.
***

  

## Data Understanding

Describe the data being used for this project.
***
Questions to consider:
* Where did the data come from, and how do they relate to the data analysis questions?
The data used in this analysis was taken from [IMDB](https://www.imdb.com/), [The Numbers](https://www.the-numbers.com/), and [The Movie DB](https://www.themoviedb.org/), websites that track multiple metrics related to movies and allow users to review and discuss these movies.
* What do the data represent? Who is in the sample and what variables are included?
The data files provide release dates, genre information, awards vote numbers, and figures for production budgets and worldwide gross for thousands of movies.
* What is the target variable?
Profit numbers could be calculated by subtracting the budget from worldwide gross and provided a foundation for the rest of the analysis.
* What are the properties of the variables you intend to use?
The data used in the analysis is primarily numerical. By representing profit, award vote counts, and release years and months as numbers, statistical analysis and conditional filtering could be performed. The genres were categorical data.
***


In [2]:
# Import necessary packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sqlite3

%matplotlib inline

In [3]:
# Suppress scientific notation
pd.set_option('display.float_format', lambda x: '%.5f' % x)

In [4]:
# Load data set 'tmdb.movies.csv.gz' with data obtained from TMDB
movies = pd.read_csv('zippedData//tmdb.movies.csv.gz')

# Load dataset 'tn.movie_budgets.csv.gz' with data obtained from The-Numbers
movie_budgets = pd.read_csv('zippedData//tn.movie_budgets.csv.gz')

# Connect and read in SQLite3 database
conn = sqlite3.connect('zippedData/im.db')

## Data Preparation

Describe and justify the process for preparing the data for analysis.

***
Questions to consider:
* Were there variables you dropped or created?
Due to merging several tables together, it was necessary to remove columns that contained duplicate or unnecessary data. In addition, the genres were listed as numerical values that correlated to The Movie DB's codes, which were broken out into separate columns. Lastly, columns for Profit and Profit:Gross Ratio were calculated to be used in the visualizations.
* How did you address missing values or outliers?
To keep the analysis focused on current movie metrics, only the most recent decade of movies were analyzed. To remove smaller, less succesful movies the bottom 25% (based on award vote count) of movies were removed. Similarly, the bottom 25% of movies based on worldwide gross were removed.
* Why are these choices appropriate given the data and the business problem?
These changes were made to filter out the least popular movies in terms of critical reception and financial success. Identifying trends in the movies that remain in the data set allowed us to base our recommendations on the most well received movies.
***

In [5]:
# Merge the 'movie_basics' and 'movie_rating' tables together with a left join to create 'basics_and_ratings' 
# dataframe
basics_and_ratings = pd.read_sql("""
SELECT *
FROM movie_basics
LEFT JOIN movie_ratings
    ON movie_basics.movie_id = movie_ratings.movie_id
    """, conn )

In [6]:
# Merge movie_budgets and movies with an inner join to create a 'masterdf' dataframe
masterdf = movie_budgets.merge(movies, how='inner', left_on='movie', right_on='title', \
                               suffixes=('_budgets', '_movies'))

In [7]:
# Merge the 'basics_and_ratings' and 'masterdf' dataframes with an inner join to create a 'new_masterdf' dataframe
new_masterdf = masterdf.merge(basics_and_ratings, how='inner', left_on='movie', right_on='primary_title', \
                              suffixes=('_master','_database')).drop_duplicates(subset='movie')

In [8]:
# Drop columns with repeated information or information not relevant to analysis from the datasets
new_masterdf.drop(['id_movies', 'Unnamed: 0', 'original_title_database', 'movie_id', 'primary_title', 'title', \
                   'id_budgets', 'start_year', 'original_title_master'], axis=1, inplace=True)

In [9]:
# Filter out the bottom 25% of vote_count
new_masterdf[new_masterdf['vote_count'] > 79]

Unnamed: 0,release_date_budgets,movie,production_budget,domestic_gross,worldwide_gross,genre_ids,original_language,popularity,release_date_movies,vote_average,vote_count,runtime_minutes,genres,averagerating,numvotes
0,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279","[28, 12, 14, 878]",en,26.52600,2009-12-18,7.40000,18676,93.00000,Horror,6.10000,43.00000
1,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875","[12, 28, 14]",en,30.57900,2011-05-20,6.40000,8571,136.00000,"Action,Adventure,Fantasy",6.60000,447624.00000
2,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963","[28, 12, 878]",en,44.38300,2015-05-01,7.30000,13457,141.00000,"Action,Adventure,Sci-Fi",7.30000,665594.00000
3,"Apr 27, 2018",Avengers: Infinity War,"$300,000,000","$678,815,482","$2,048,134,200","[12, 28, 14]",en,80.77300,2018-04-27,8.30000,13948,149.00000,"Action,Adventure,Sci-Fi",8.50000,670926.00000
4,"Nov 17, 2017",Justice League,"$300,000,000","$229,024,295","$655,945,209","[28, 12, 14, 878]",en,34.95300,2017-11-17,6.20000,7510,120.00000,"Action,Adventure,Fantasy",6.50000,329135.00000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4491,"May 18, 2012",Indie Game: The Movie,"$100,000",$0,$0,[99],en,6.20200,2012-05-18,7.80000,259,103.00000,"Documentary,Drama",7.70000,19538.00000
4514,"Jul 25, 2014",Happy Christmas,"$70,000","$30,312","$30,312","[35, 18]",en,5.76700,2014-06-26,5.10000,95,109.00000,,,
4516,"Dec 31, 2011",Absentia,"$70,000",$0,"$8,555","[9648, 27, 53]",en,10.35700,2011-03-03,5.90000,175,87.00000,"Drama,Horror,Mystery",5.80000,15507.00000
4522,"Nov 12, 2010",Tiny Furniture,"$50,000","$391,674","$424,149","[10749, 35, 18]",en,6.69500,2010-11-12,5.90000,82,98.00000,"Comedy,Drama,Romance",6.20000,13397.00000


In [10]:
# Define function to remove '$' and ',' from dataset
def remove_dollarsigncommas(data, column):
    data[column] = data[column].str.replace(',','')
    data[column] = data[column].str.replace('$','')
    return print('all done') ; print(data[column].head())

# Remove '$' and ',' from relevant columns
remove_dollarsigncommas(new_masterdf, 'production_budget')
remove_dollarsigncommas(new_masterdf, 'worldwide_gross')
remove_dollarsigncommas(new_masterdf, 'domestic_gross')

all done
all done
all done


In [11]:
# Cast the revelant columns as integers for data manipulation
new_masterdf[['production_budget', 'domestic_gross', 'worldwide_gross']] = new_masterdf[['production_budget', 'domestic_gross', 'worldwide_gross']].applymap(lambda x: int(x))

In [12]:
# Create 'release_month' column for data manipulation
new_masterdf['release_month'] = new_masterdf['release_date_movies'].map(lambda x: x[5:7])

In [13]:
# TMDB movie genre ID's and respective genre names
# Source: https://www.themoviedb.org/talk/5daf6eb0ae36680011d7e6ee

tmdb_movie_genreIDs = {'genres':[{'id':28,'name':'Action'},
    {'id':12,'name':'Adventure'},
    {'id':16,'name':'Animation'},
    {'id':35,'name':'Comedy'},
    {'id':80,'name':'Crime'},
    {'id':99,'name':'Documentary'},
    {'id':18,'name':'Drama'},
    {'id':10751,'name':'Family'},
    {'id':14,'name':'Fantasy'},
    {'id':36,'name':'History'},
    {'id':27,'name':'Horror'},
    {'id':10402,'name':'Music'},
    {'id':9648,'name':'Mystery'},
    {'id':10749,'name':'Romance'},
    {'id':878,'name':'Science Fiction'},
    {'id':10770,'name':'TV Movie'},
    {'id':53,'name':'Thriller'},
    {'id':10752,'name':'War'},
    {'id':37,'name':'Western'}]}

In [14]:
# Identify and Replace 'genre_ids' with respective genre names
new_masterdf['genre_ids'] = new_masterdf['genre_ids'].str.replace('28', 'Action')
new_masterdf['genre_ids'] = new_masterdf['genre_ids'].str.replace('12', 'Adventure')
new_masterdf['genre_ids'] = new_masterdf['genre_ids'].str.replace('16', 'Animation')
new_masterdf['genre_ids'] = new_masterdf['genre_ids'].str.replace('35', 'Comedy')
new_masterdf['genre_ids'] = new_masterdf['genre_ids'].str.replace('80', 'Crime')
new_masterdf['genre_ids'] = new_masterdf['genre_ids'].str.replace('99', 'Documentary')
new_masterdf['genre_ids'] = new_masterdf['genre_ids'].str.replace('18', 'Drama')
new_masterdf['genre_ids'] = new_masterdf['genre_ids'].str.replace('10751', 'Family')
new_masterdf['genre_ids'] = new_masterdf['genre_ids'].str.replace('14', 'Fantasy')
new_masterdf['genre_ids'] = new_masterdf['genre_ids'].str.replace('36', 'History')
new_masterdf['genre_ids'] = new_masterdf['genre_ids'].str.replace('27', 'Horror')
new_masterdf['genre_ids'] = new_masterdf['genre_ids'].str.replace('10402', 'Music')
new_masterdf['genre_ids'] = new_masterdf['genre_ids'].str.replace('9648', 'Mystery')
new_masterdf['genre_ids'] = new_masterdf['genre_ids'].str.replace('10749', 'Romance')
new_masterdf['genre_ids'] = new_masterdf['genre_ids'].str.replace('878', 'Science Fiction')
new_masterdf['genre_ids'] = new_masterdf['genre_ids'].str.replace('10770', 'TV Movie')
new_masterdf['genre_ids'] = new_masterdf['genre_ids'].str.replace('53', 'Thriller')
new_masterdf['genre_ids'] = new_masterdf['genre_ids'].str.replace('10752', 'War')
new_masterdf['genre_ids'] = new_masterdf['genre_ids'].str.replace('37', 'Western')

In [15]:
# Create new columns breaking down the genre types of each movie
new_masterdf['genre_ids'] = new_masterdf['genre_ids'].map(lambda x: x.lstrip('[').rstrip(']').split(','))

In [16]:
# Reset the index
new_masterdf.reset_index(inplace=True)

In [17]:
# Merging the genre breakdown data with the master dataset with an left join to create 'final_df' dataframe
final_df = new_masterdf.join(pd.DataFrame(new_masterdf.genre_ids.values.tolist()).add_prefix('genre_'), how='left', \
                             lsuffix='_votes')

# Reset the column labeled index
final_df.drop('index', axis=1, inplace=True)

# Find the amount of missing data from each column
final_df.isna().sum()/len(final_df)

# Drop all 'genre_' breakdown columns missing more than 80% of data
final_df.drop(['genre_6','genre_5','genre_4','genre_3' ], axis=1, inplace=True)

# Drop the 'genres' column 
final_df.drop('genres', axis=1, inplace=True)

In [18]:
#Find the counts of unique values of the original_language
final_df['original_language'].value_counts()

# Drop the original_language column as majority of movies are in English
final_df.drop('original_language', axis=1, inplace=True)

In [19]:
# Get the descriptive statistics of the domestic_gross column
final_df['domestic_gross'].describe()

# Filter dataset to movies that have domestic gross profits greater than the 25th quartile
final_df = final_df[final_df['domestic_gross'] > 1065429]
final_df

Unnamed: 0,release_date_budgets,movie,production_budget,domestic_gross,worldwide_gross,genre_ids,popularity,release_date_movies,vote_average,vote_count,runtime_minutes,averagerating,numvotes,release_month,genre_0,genre_1,genre_2
0,"Dec 18, 2009",Avatar,425000000,760507625,2776345279,"[Action, Adventure, Fantasy, Science Fiction]",26.52600,2009-12-18,7.40000,18676,93.00000,6.10000,43.00000,12,Action,Adventure,Fantasy
1,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,410600000,241063875,1045663875,"[Adventure, Action, Fantasy]",30.57900,2011-05-20,6.40000,8571,136.00000,6.60000,447624.00000,05,Adventure,Action,Fantasy
2,"May 1, 2015",Avengers: Age of Ultron,330600000,459005868,1403013963,"[Action, Adventure, Science Fiction]",44.38300,2015-05-01,7.30000,13457,141.00000,7.30000,665594.00000,05,Action,Adventure,Science Fiction
3,"Apr 27, 2018",Avengers: Infinity War,300000000,678815482,2048134200,"[Adventure, Action, Fantasy]",80.77300,2018-04-27,8.30000,13948,149.00000,8.50000,670926.00000,04,Adventure,Action,Fantasy
4,"Nov 17, 2017",Justice League,300000000,229024295,655945209,"[Action, Adventure, Fantasy, Science Fiction]",34.95300,2017-11-17,6.20000,7510,120.00000,6.50000,329135.00000,11,Action,Adventure,Fantasy
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1738,"Jun 19, 2015",The Overnight,200000,1109808,1165996,"[Mystery, Comedy]",6.57600,2015-06-19,6.00000,200,88.00000,7.50000,24.00000,06,Mystery,Comedy,
1745,"Jul 22, 2011",Another Earth,175000,1321194,2102779,"[Drama, Science Fiction]",10.03000,2011-07-22,6.70000,853,92.00000,7.00000,85839.00000,07,Drama,Science Fiction,
1752,"Jun 15, 2012",Your Sister's Sister,120000,1597486,3090593,"[Drama, Comedy]",7.11500,2012-06-14,6.60000,192,90.00000,6.70000,24780.00000,06,Drama,Comedy,
1753,"Jul 10, 2015",The Gallows,100000,22764410,41656474,"[Horror, Thriller]",9.16600,2015-07-10,4.80000,591,81.00000,4.20000,17763.00000,07,Horror,Thriller,


In [20]:
final_df.sort_values(by='release_date_movies')

Unnamed: 0,release_date_budgets,movie,production_budget,domestic_gross,worldwide_gross,genre_ids,popularity,release_date_movies,vote_average,vote_count,runtime_minutes,averagerating,numvotes,release_month,genre_0,genre_1,genre_2
1511,"Nov 21, 1946",The Best Years of Our Lives,2100000,23600000,23600000,"[Drama, History, Romance]",9.64700,1946-12-25,7.80000,243,,,,12,Drama,History,Romance
675,"Oct 18, 2013",Carrie,30000000,35266619,82409520,"[Horror, Thriller]",9.46700,1976-11-03,7.10000,1766,100.00000,5.90000,125424.00000,11,Horror,Thriller,
778,"Apr 18, 1986",Legend,25000000,15502112,23506237,"[Adventure, Fantasy]",10.54200,1986-04-18,6.20000,509,134.00000,,,04,Adventure,Fantasy,
1269,"Feb 12, 1988",Action Jackson,7000000,20257000,20257000,"[Action, Adventure, Comedy, Crime, Drama]",6.74400,1988-02-12,5.20000,81,144.00000,3.30000,2862.00000,02,Action,Adventure,Comedy
928,"Jun 3, 1988",Big,18000000,114968774,151668774,"[Fantasy, Drama, Comedy, Romance, Family]",15.03100,1988-06-03,7.00000,1813,99.00000,8.50000,6.00000,06,Fantasy,Drama,Comedy
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
875,"Dec 25, 2018",On the Basis of Sex,20000000,24622687,38073377,"[Drama, History]",32.62400,2018-12-25,7.40000,225,120.00000,6.90000,12083.00000,12,Drama,History,
1221,"Dec 25, 2018",Destroyer,9000000,1533324,3681096,"[Thriller, Crime, Drama, Action]",17.81500,2018-12-25,5.90000,176,121.00000,6.20000,13683.00000,12,Thriller,Crime,Drama
491,"Dec 25, 2018",Holmes & Watson,42000000,30568743,41926605,"[Mystery, Adventure, Comedy, Crime]",19.33100,2018-12-25,4.10000,217,90.00000,3.80000,17661.00000,12,Mystery,Adventure,Comedy
557,"Jan 11, 2019",The Upside,37500000,108235497,119024536,"[Comedy, Drama]",28.13800,2019-01-11,7.30000,274,126.00000,6.70000,24169.00000,01,Comedy,Drama,


## Data Modeling
Describe and justify the process for analyzing or modeling the data.

***
Questions to consider:
* How did you analyze or model the data?
Charts were created to provide a visual reference for the master data set. Charting the highest grossing movies compared to their genre or release month allows for quick identification of the most frequently occuring variables in successful movies.
* How did you iterate on your initial approach to make it better?
Statistical measures like the mean and bottom 25% quartile were applied to find average values across the dataset and remove the worst performing movies.
* Why are these choices appropriate given the data and the business problem?
We don't want to consider movies that performed poorly when making recommendations on how to produce profitable movies.
***

In [20]:
# Here you run your code to model the data


## Evaluation
Evaluate how well your work solves the stated business problem.

***
Questions to consider:
* How do you interpret the results?
* How well does your model fit your data? How much better is this than your baseline model?
* How confident are you that your results would generalize beyond the data you have?
* How confident are you that this model would benefit the business if put into use?
***

## Conclusions
Provide your conclusions about the work you've done, including any limitations or next steps.

***
Questions to consider:
* What would you recommend the business do as a result of this work?
Action-Adventure films released in the summer tend to make the most money. Animated family films tend to make a lot of money relative to their budget. Movies released around the winter holidays also tend to be commercially and critically successful.
* What are some reasons why your analysis might not fully solve the business problem?
Many of the highest grossing movies are part of existing franchises or intellectual property. Audiences enjoy seeing movies based on stories and characters that they're already familiar with. The cost of acquiring pre-existing intellectual property was not factored in to this analysis.
* What else could you do in the future to improve this project?
Attempt 
***