# Recommending Films for Box Office Success!

![image](https://vip-go.premiumbeat.com/wp-content/uploads/2022/02/vr_2.jpg)

*Image by DOP Eben Bolter on the LED volume stage at Rebellion Film Studios in Oxford, UK.*

## Overview

Our company sees competitors creating original video content and now wants to join the market trend. We create a new film studio despite the lack of knowledge in everything related to movies.

As the data scientist of the company, my goal is to explore what types of films are currently performing the best at the box office and with my findings – create actionable insights that the company's new studio can use to decide what type of films to create.

## Challenge

With the task at hand, we will be presenting to our company stakeholders three different recommendations that will directly affect our business revenue. The goal is to provide data-driven direction for our new film studio using evidence that our recommendations will result in truly benefitting the company.

## Datasets

In the folder `zippedData`, we have datasets from:

- [Box Office Mojo](https://www.boxofficemojo.com/)
- [IMDB](https://www.imdb.com/)
- [The Numbers](https://www.the-numbers.com/)

## Solution

This project uses statistical analysis, including formulating three hypotheses of what contributes to a film's success. We infer three different business recommendations and use box-office data to prove our hypotheses and provide statistics as evidence to support our recommendations for the company and the direction for our new film studio.

## Results

# Loading Essentials

## Loading Tools

Import our datascience tools.

In [1]:
import itertools
import numpy as np
import pandas as pd 
from numbers import Number
import sqlite3
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
import zipfile
import os
import warnings
warnings.filterwarnings('ignore')
plt.style.use('ggplot')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 200)
pd.set_option('display.float_format', '{:,.2f}'.format)

## Loading Datasets

### IMdb, SQL Database

IMdb aims to collect any and all data of all films to showcase everyday people.

In [2]:
zip_path = 'zippedData/im.db.zip'
extract_path = 'zippedData/'

with zipfile.ZipFile(zip_path,'r') as zip_ref:
    zip_ref.extractall(extract_path)

db_path = os.path.join(extract_path, 'im.db')

conn = sqlite3.connect(db_path)
pd.read_sql("""
    SELECT *
    FROM sqlite_master
    WHERE type = 'table';
""",conn)

Unnamed: 0,type,name,tbl_name,rootpage,sql
0,table,movie_basics,movie_basics,2,"CREATE TABLE ""movie_basics"" (\n""movie_id"" TEXT..."
1,table,directors,directors,3,"CREATE TABLE ""directors"" (\n""movie_id"" TEXT,\n..."
2,table,known_for,known_for,4,"CREATE TABLE ""known_for"" (\n""person_id"" TEXT,\..."
3,table,movie_akas,movie_akas,5,"CREATE TABLE ""movie_akas"" (\n""movie_id"" TEXT,\..."
4,table,movie_ratings,movie_ratings,6,"CREATE TABLE ""movie_ratings"" (\n""movie_id"" TEX..."
5,table,persons,persons,7,"CREATE TABLE ""persons"" (\n""person_id"" TEXT,\n ..."
6,table,principals,principals,8,"CREATE TABLE ""principals"" (\n""movie_id"" TEXT,\..."
7,table,writers,writers,9,"CREATE TABLE ""writers"" (\n""movie_id"" TEXT,\n ..."


In [3]:
df_imdb = pd.read_sql("""
    SELECT *
    FROM movie_basics AS mb
    INNER JOIN movie_ratings AS mr ON mb.movie_id =mr.movie_id;
""",conn)
df_imdb['title'] = df_imdb['primary_title']
df_imdb

Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres,movie_id.1,averagerating,numvotes,title
0,tt0063540,Sunghursh,Sunghursh,2013,175.00,"Action,Crime,Drama",tt0063540,7.00,77,Sunghursh
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.00,"Biography,Drama",tt0066787,7.20,43,One Day Before the Rainy Season
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.00,Drama,tt0069049,6.90,4517,The Other Side of the Wind
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama",tt0069204,6.10,13,Sabse Bada Sukh
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.00,"Comedy,Drama,Fantasy",tt0100275,6.50,119,The Wandering Soap Opera
...,...,...,...,...,...,...,...,...,...,...
73851,tt9913084,Diabolik sono io,Diabolik sono io,2019,75.00,Documentary,tt9913084,6.20,6,Diabolik sono io
73852,tt9914286,Sokagin Çocuklari,Sokagin Çocuklari,2019,98.00,"Drama,Family",tt9914286,8.70,136,Sokagin Çocuklari
73853,tt9914642,Albatross,Albatross,2017,,Documentary,tt9914642,8.50,8,Albatross
73854,tt9914942,La vida sense la Sara Amat,La vida sense la Sara Amat,2019,,,tt9914942,6.60,5,La vida sense la Sara Amat


### TheNumbers, CSV

TheNumbers goal is to collect the most accurate details of any film's budgets and revenues.

In [4]:
df_tn_movie_budgets = pd.read_csv('zippedData/tn.movie_budgets.csv.gz')
df_tn_movie_budgets['title'] = df_tn_movie_budgets['movie']
df_tn_movie_budgets

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross,title
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279",Avatar
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875",Pirates of the Caribbean: On Stranger Tides
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350",Dark Phoenix
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963",Avengers: Age of Ultron
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747",Star Wars Ep. VIII: The Last Jedi
...,...,...,...,...,...,...,...
5777,78,"Dec 31, 2018",Red 11,"$7,000",$0,$0,Red 11
5778,79,"Apr 2, 1999",Following,"$6,000","$48,482","$240,495",Following
5779,80,"Jul 13, 2005",Return to the Land of Wonders,"$5,000","$1,338","$1,338",Return to the Land of Wonders
5780,81,"Sep 29, 2015",A Plague So Pleasant,"$1,400",$0,$0,A Plague So Pleasant


### Merge Data For Analysis

Reasons for Combined Data
- It was neccessary we `merge` our data from `IMdb` and `TheNumbers` because IMdb is a popular platform that houses all details of films, including genres, ratings, and votes and TheNumbers has reliable information regarding film budgets and revenues across the world.
- By combining these two datasets, we couple our data of `films`, `genres`, `ratings`, and `votes` with `financial data` to help us investigate deeper in our analysis.

In [5]:
df_box_office = pd.merge(df_imdb, df_tn_movie_budgets, on='title', how='inner')
df_box_office

Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres,movie_id.1,averagerating,numvotes,title,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,tt0249516,Foodfight!,Foodfight!,2012,91.00,"Action,Animation,Comedy",tt0249516,1.90,8248,Foodfight!,26,"Dec 31, 2012",Foodfight!,"$45,000,000",$0,"$73,706"
1,tt0326592,The Overnight,The Overnight,2010,88.00,,tt0326592,7.50,24,The Overnight,21,"Jun 19, 2015",The Overnight,"$200,000","$1,109,808","$1,165,996"
2,tt0337692,On the Road,On the Road,2012,124.00,"Adventure,Drama,Romance",tt0337692,6.10,37886,On the Road,17,"Mar 22, 2013",On the Road,"$25,000,000","$720,828","$9,313,302"
3,tt0359950,The Secret Life of Walter Mitty,The Secret Life of Walter Mitty,2013,114.00,"Adventure,Comedy,Drama",tt0359950,7.30,275300,The Secret Life of Walter Mitty,37,"Dec 25, 2013",The Secret Life of Walter Mitty,"$91,000,000","$58,236,838","$187,861,183"
4,tt0365907,A Walk Among the Tombstones,A Walk Among the Tombstones,2014,114.00,"Action,Crime,Drama",tt0365907,6.50,105116,A Walk Among the Tombstones,67,"Sep 19, 2014",A Walk Among the Tombstones,"$28,000,000","$26,017,685","$62,108,587"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2870,tt9746500,Earth,Erde,2019,115.00,Documentary,tt9746500,7.30,49,Earth,36,"Apr 22, 2009",Earth,"$47,000,000","$32,011,576","$116,773,317"
2871,tt9851050,Sisters,Sisters,2019,,"Action,Drama",tt9851050,4.70,14,Sisters,57,"Dec 18, 2015",Sisters,"$30,000,000","$87,044,645","$106,030,660"
2872,tt9861522,Ali,Ali,2019,110.00,Drama,tt9861522,7.70,79,Ali,45,"Dec 25, 2001",Ali,"$109,000,000","$58,183,966","$87,683,966"
2873,tt9899880,Columbus,Columbus,2018,85.00,Comedy,tt9899880,5.80,5,Columbus,93,"Aug 4, 2017",Columbus,"$700,000","$1,017,107","$1,110,511"


# Exploratory Data Analysis

## 1. Understanding The Data
- Dataframe `shape`
- `head` and `tail`
- `info`
- `describe`

In [6]:
df_box_office.head()

Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres,movie_id.1,averagerating,numvotes,title,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,tt0249516,Foodfight!,Foodfight!,2012,91.0,"Action,Animation,Comedy",tt0249516,1.9,8248,Foodfight!,26,"Dec 31, 2012",Foodfight!,"$45,000,000",$0,"$73,706"
1,tt0326592,The Overnight,The Overnight,2010,88.0,,tt0326592,7.5,24,The Overnight,21,"Jun 19, 2015",The Overnight,"$200,000","$1,109,808","$1,165,996"
2,tt0337692,On the Road,On the Road,2012,124.0,"Adventure,Drama,Romance",tt0337692,6.1,37886,On the Road,17,"Mar 22, 2013",On the Road,"$25,000,000","$720,828","$9,313,302"
3,tt0359950,The Secret Life of Walter Mitty,The Secret Life of Walter Mitty,2013,114.0,"Adventure,Comedy,Drama",tt0359950,7.3,275300,The Secret Life of Walter Mitty,37,"Dec 25, 2013",The Secret Life of Walter Mitty,"$91,000,000","$58,236,838","$187,861,183"
4,tt0365907,A Walk Among the Tombstones,A Walk Among the Tombstones,2014,114.0,"Action,Crime,Drama",tt0365907,6.5,105116,A Walk Among the Tombstones,67,"Sep 19, 2014",A Walk Among the Tombstones,"$28,000,000","$26,017,685","$62,108,587"


In [7]:
df_box_office.shape

(2875, 16)

In [8]:
df_box_office.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2875 entries, 0 to 2874
Data columns (total 16 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   movie_id           2875 non-null   object 
 1   primary_title      2875 non-null   object 
 2   original_title     2875 non-null   object 
 3   start_year         2875 non-null   int64  
 4   runtime_minutes    2757 non-null   float64
 5   genres             2867 non-null   object 
 6   movie_id           2875 non-null   object 
 7   averagerating      2875 non-null   float64
 8   numvotes           2875 non-null   int64  
 9   title              2875 non-null   object 
 10  id                 2875 non-null   int64  
 11  release_date       2875 non-null   object 
 12  movie              2875 non-null   object 
 13  production_budget  2875 non-null   object 
 14  domestic_gross     2875 non-null   object 
 15  worldwide_gross    2875 non-null   object 
dtypes: float64(2), int64(3),

In [9]:
df_box_office.describe()

Unnamed: 0,start_year,runtime_minutes,averagerating,numvotes,id
count,2875.0,2757.0,2875.0,2875.0,2875.0
mean,2013.92,102.95,6.25,66280.38,50.94
std,2.55,20.79,1.19,134307.71,28.7
min,2010.0,3.0,1.6,5.0,1.0
25%,2012.0,90.0,5.6,141.0,27.0
50%,2014.0,101.0,6.4,7951.0,51.0
75%,2016.0,113.0,7.1,75081.0,76.0
max,2019.0,280.0,9.3,1841066.0,100.0


## Step 2. Data Preparation
- Dropping irrelevant columns and rows
- Identifying duplicated columns
- Renaming columns
- Feature creation

Let's clean up our dataset by `dropping` irrelevant `columns` and `renaming` them to represent the data better.

In [17]:
df = df_box_office[[
    # 'movie_id', 'original_title', 'start_year', 
    # 'runtime_minutes', 'movie_id', 'title', 'id', 'movie'
    'primary_title',  'genres', 'averagerating', 'numvotes', 
    'release_date', 'production_budget', 'worldwide_gross' ]]

df.rename(columns={ 'primary_title':'film', 'averagerating':'rating', 'numvotes':'votes', 
                    'release_date':'release', 'production_budget':'budget',
                    'worldwide_gross':'revenue' }, inplace=True)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2875 entries, 0 to 2874
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   film     2875 non-null   object 
 1   genres   2867 non-null   object 
 2   rating   2875 non-null   float64
 3   votes    2875 non-null   int64  
 4   release  2875 non-null   object 
 5   budget   2875 non-null   object 
 6   revenue  2875 non-null   object 
dtypes: float64(1), int64(1), object(5)
memory usage: 157.4+ KB


There are some important cleaning we need to do:
- Replace `genre` values that are `NaN` with string `'unknown'` if data is missing. This way we are not losing any data even if we don't know the genre.
- Convert `release` to pandas `datetime` so that we can do some logical operations over it for deeper investigation.
- Convert `budget` and `revenue` into `int` so that we can do logical operations over the data for better analysis.

In [11]:
df['genres'].fillna('unknown', inplace=True)
df.genres

0       Action,Animation,Comedy
1                       unknown
2       Adventure,Drama,Romance
3        Adventure,Comedy,Drama
4            Action,Crime,Drama
                 ...           
2870                Documentary
2871               Action,Drama
2872                      Drama
2873                     Comedy
2874                Documentary
Name: genres, Length: 2875, dtype: object

In [12]:
mask = pd.to_datetime(df['release'], format='%b %d, %Y')
df['release'] = mask
df.release

0      2012-12-31
1      2015-06-19
2      2013-03-22
3      2013-12-25
4      2014-09-19
          ...    
2870   2009-04-22
2871   2015-12-18
2872   2001-12-25
2873   2017-08-04
2874   2010-11-12
Name: release, Length: 2875, dtype: datetime64[ns]

In [13]:
mask = df['budget'].str.strip('$').str.replace(',','').astype(int)
df['budget'] = mask
mask = df['revenue'].str.strip('$').str.replace(',','').astype(int)
df['revenue'] = mask
df

Unnamed: 0,film,genres,rating,votes,release,budget,revenue
0,Foodfight!,"Action,Animation,Comedy",1.90,8248,2012-12-31,45000000,73706
1,The Overnight,unknown,7.50,24,2015-06-19,200000,1165996
2,On the Road,"Adventure,Drama,Romance",6.10,37886,2013-03-22,25000000,9313302
3,The Secret Life of Walter Mitty,"Adventure,Comedy,Drama",7.30,275300,2013-12-25,91000000,187861183
4,A Walk Among the Tombstones,"Action,Crime,Drama",6.50,105116,2014-09-19,28000000,62108587
...,...,...,...,...,...,...,...
2870,Earth,Documentary,7.30,49,2009-04-22,47000000,116773317
2871,Sisters,"Action,Drama",4.70,14,2015-12-18,30000000,106030660
2872,Ali,Drama,7.70,79,2001-12-25,109000000,87683966
2873,Columbus,Comedy,5.80,5,2017-08-04,700000,1110511


In [14]:
df

Unnamed: 0,film,genres,rating,votes,release,budget,revenue
0,Foodfight!,"Action,Animation,Comedy",1.90,8248,2012-12-31,45000000,73706
1,The Overnight,unknown,7.50,24,2015-06-19,200000,1165996
2,On the Road,"Adventure,Drama,Romance",6.10,37886,2013-03-22,25000000,9313302
3,The Secret Life of Walter Mitty,"Adventure,Comedy,Drama",7.30,275300,2013-12-25,91000000,187861183
4,A Walk Among the Tombstones,"Action,Crime,Drama",6.50,105116,2014-09-19,28000000,62108587
...,...,...,...,...,...,...,...
2870,Earth,Documentary,7.30,49,2009-04-22,47000000,116773317
2871,Sisters,"Action,Drama",4.70,14,2015-12-18,30000000,106030660
2872,Ali,Drama,7.70,79,2001-12-25,109000000,87683966
2873,Columbus,Comedy,5.80,5,2017-08-04,700000,1110511
