# Recommending Films for Box Office Success!

![image](https://vip-go.premiumbeat.com/wp-content/uploads/2022/02/vr_2.jpg)

*Image by DOP Eben Bolter on the LED volume stage at Rebellion Film Studios in Oxford, UK.*

## Overview

Our company sees competitors creating original video content and now wants to join the market trend. We create a new film studio despite the lack of knowledge in everything related to movies.

As the data scientist of the company, my goal is to explore what types of films are currently performing the best at the box office and with my findings – create actionable insights that the company's new studio can use to decide what type of films to create.

## Challenge

With the task at hand, we will be presenting to our company stakeholders three different recommendations that will directly affect our business revenue. The goal is to provide data-driven direction for our new film studio using evidence that our recommendations will result in truly benefitting the company.

## Datasets

In the folder `zippedData`, we have datasets from:

- [Box Office Mojo](https://www.boxofficemojo.com/)
- [IMDB](https://www.imdb.com/)
- [The Numbers](https://www.the-numbers.com/)

## Solution

This project uses statistical analysis, including formulating three hypotheses of what contributes to a film's success. We infer three different business recommendations and use box-office data to prove our hypotheses and provide statistics as evidence to support our recommendations for the company and the direction for our new film studio.

## Results

# Data Science Analysis

## Loading Tools

Import our datascience tools.

In [14]:
import itertools
import numpy as np
import pandas as pd 
from numbers import Number
import sqlite3
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
import zipfile
import os
import warnings
warnings.filterwarnings('ignore')
plt.style.use('ggplot')
pd.set_option('display.max_columns', 200)
pd.set_option('display.float_format','{:,2f}'.format)

## Loading Datasets

### IMdb, SQL Database

IMdb aims to collect any and all data of all films to showcase everyday people.

In [2]:
zip_path = 'zippedData/im.db.zip'
extract_path = 'zippedData/'

with zipfile.ZipFile(zip_path,'r') as zip_ref:
    zip_ref.extractall(extract_path)

db_path = os.path.join(extract_path, 'im.db')

conn = sqlite3.connect(db_path)
pd.read_sql("""
    SELECT *
    FROM sqlite_master
    WHERE type = 'table';
""",conn)

Unnamed: 0,type,name,tbl_name,rootpage,sql
0,table,movie_basics,movie_basics,2,"CREATE TABLE ""movie_basics"" (\n""movie_id"" TEXT..."
1,table,directors,directors,3,"CREATE TABLE ""directors"" (\n""movie_id"" TEXT,\n..."
2,table,known_for,known_for,4,"CREATE TABLE ""known_for"" (\n""person_id"" TEXT,\..."
3,table,movie_akas,movie_akas,5,"CREATE TABLE ""movie_akas"" (\n""movie_id"" TEXT,\..."
4,table,movie_ratings,movie_ratings,6,"CREATE TABLE ""movie_ratings"" (\n""movie_id"" TEX..."
5,table,persons,persons,7,"CREATE TABLE ""persons"" (\n""person_id"" TEXT,\n ..."
6,table,principals,principals,8,"CREATE TABLE ""principals"" (\n""movie_id"" TEXT,\..."
7,table,writers,writers,9,"CREATE TABLE ""writers"" (\n""movie_id"" TEXT,\n ..."


In [3]:
df_imdb = pd.read_sql("""
    SELECT *
    FROM movie_basics AS mb
    INNER JOIN movie_ratings AS mr ON mb.movie_id =mr.movie_id;
""",conn)
df_imdb['title'] = df_imdb['primary_title']
df_imdb

Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres,movie_id.1,averagerating,numvotes,title
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama",tt0063540,7.0,77,Sunghursh
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama",tt0066787,7.2,43,One Day Before the Rainy Season
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama,tt0069049,6.9,4517,The Other Side of the Wind
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama",tt0069204,6.1,13,Sabse Bada Sukh
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy",tt0100275,6.5,119,The Wandering Soap Opera
...,...,...,...,...,...,...,...,...,...,...
73851,tt9913084,Diabolik sono io,Diabolik sono io,2019,75.0,Documentary,tt9913084,6.2,6,Diabolik sono io
73852,tt9914286,Sokagin Çocuklari,Sokagin Çocuklari,2019,98.0,"Drama,Family",tt9914286,8.7,136,Sokagin Çocuklari
73853,tt9914642,Albatross,Albatross,2017,,Documentary,tt9914642,8.5,8,Albatross
73854,tt9914942,La vida sense la Sara Amat,La vida sense la Sara Amat,2019,,,tt9914942,6.6,5,La vida sense la Sara Amat


### TheNumbers, CSV

TheNumbers goal is to collect the most accurate details of any film's budgets and revenues.

In [4]:
df_tn_movie_budgets = pd.read_csv('zippedData/tn.movie_budgets.csv.gz')
df_tn_movie_budgets['title'] = df_tn_movie_budgets['movie']
df_tn_movie_budgets

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross,title
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279",Avatar
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875",Pirates of the Caribbean: On Stranger Tides
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350",Dark Phoenix
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963",Avengers: Age of Ultron
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747",Star Wars Ep. VIII: The Last Jedi
...,...,...,...,...,...,...,...
5777,78,"Dec 31, 2018",Red 11,"$7,000",$0,$0,Red 11
5778,79,"Apr 2, 1999",Following,"$6,000","$48,482","$240,495",Following
5779,80,"Jul 13, 2005",Return to the Land of Wonders,"$5,000","$1,338","$1,338",Return to the Land of Wonders
5780,81,"Sep 29, 2015",A Plague So Pleasant,"$1,400",$0,$0,A Plague So Pleasant


### Box Office Mojo, CSV

Box Office Mojo's goal is to collect ONLY the details of films that reach box-office success.

In [5]:
df_bom_movie_gross = pd.read_csv('zippedData/bom.movie_gross.csv.gz')
df_bom_movie_gross

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010
...,...,...,...,...,...
3382,The Quake,Magn.,6200.0,,2018
3383,Edward II (2018 re-release),FM,4800.0,,2018
3384,El Pacto,Sony,2500.0,,2018
3385,The Swan,Synergetic,2400.0,,2018


### Merge Data For Analysis

Reasons for Combined Data
- It was neccessary we `merge` our data from `IMdb` and `TheNumbers` because IMdb is a popular platform that houses all details of films, including genres, ratings, and votes and TheNumbers has reliable information regarding film budgets and revenues across the world.
- By combining these two datasets, we couple our data of `films`, `genres`, `ratings`, and `votes` with `financial data` to help us investigate deeper in our analysis.

In [6]:
df = pd.merge(df_imdb, df_tn_movie_budgets, on='title', how='inner')
df = df[[
    # 'movie_id', 'original_title', 'start_year', 
    # 'runtime_minutes', 'movie_id', 'title', 'id', 'movie'
    'primary_title',  'genres', 'averagerating', 'numvotes', 
    'release_date', 'production_budget', 'worldwide_gross']]


df.rename(columns={'primary_title':'film', 'averagerating':'rating', 'numvotes':'votes', 
                              'release_date':'release', 'production_budget':'budget',
                              'worldwide_gross':'revenue'}, inplace=True)

df

Unnamed: 0,film,genres,rating,votes,release,budget,revenue
0,Foodfight!,"Action,Animation,Comedy",1.9,8248,"Dec 31, 2012","$45,000,000","$73,706"
1,The Overnight,,7.5,24,"Jun 19, 2015","$200,000","$1,165,996"
2,On the Road,"Adventure,Drama,Romance",6.1,37886,"Mar 22, 2013","$25,000,000","$9,313,302"
3,The Secret Life of Walter Mitty,"Adventure,Comedy,Drama",7.3,275300,"Dec 25, 2013","$91,000,000","$187,861,183"
4,A Walk Among the Tombstones,"Action,Crime,Drama",6.5,105116,"Sep 19, 2014","$28,000,000","$62,108,587"
...,...,...,...,...,...,...,...
2870,Earth,Documentary,7.3,49,"Apr 22, 2009","$47,000,000","$116,773,317"
2871,Sisters,"Action,Drama",4.7,14,"Dec 18, 2015","$30,000,000","$106,030,660"
2872,Ali,Drama,7.7,79,"Dec 25, 2001","$109,000,000","$87,683,966"
2873,Columbus,Comedy,5.8,5,"Aug 4, 2017","$700,000","$1,110,511"


## Exploratory Data Analysis

### Step 1. Data Understanding
- Dataframe `shape`
- `head` and `tail`
- `info`
- `describe`

In [9]:
df.head()

Unnamed: 0,film,genres,rating,votes,release,budget,revenue
0,Foodfight!,"Action,Animation,Comedy",1.9,8248,"Dec 31, 2012","$45,000,000","$73,706"
1,The Overnight,,7.5,24,"Jun 19, 2015","$200,000","$1,165,996"
2,On the Road,"Adventure,Drama,Romance",6.1,37886,"Mar 22, 2013","$25,000,000","$9,313,302"
3,The Secret Life of Walter Mitty,"Adventure,Comedy,Drama",7.3,275300,"Dec 25, 2013","$91,000,000","$187,861,183"
4,A Walk Among the Tombstones,"Action,Crime,Drama",6.5,105116,"Sep 19, 2014","$28,000,000","$62,108,587"


In [11]:
df.shape

(2875, 7)

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2875 entries, 0 to 2874
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   film     2875 non-null   object 
 1   genres   2867 non-null   object 
 2   rating   2875 non-null   float64
 3   votes    2875 non-null   int64  
 4   release  2875 non-null   object 
 5   budget   2875 non-null   object 
 6   revenue  2875 non-null   object 
dtypes: float64(1), int64(1), object(5)
memory usage: 157.4+ KB


In [16]:
df.describe()

ValueError: Invalid format specifier

ValueError: Invalid format specifier