# Recommending Films for Box Office Success!

![image](https://vip-go.premiumbeat.com/wp-content/uploads/2022/02/vr_2.jpg)

*Image by DOP Eben Bolter on the LED volume stage at Rebellion Film Studios in Oxford, UK.*

## Overview

Our company sees competitors creating original video content and now wants to join the market trend. We create a new film studio despite the lack of knowledge in everything related to movies.

As the data scientist of the company, my goal is to explore what types of films are currently performing the best at the box office and with my findings – create actionable insights that the company's new studio can use to decide what type of films to create.

## Challenge

With the task at hand, we will be presenting to our company stakeholders three different recommendations that will directly affect our business revenue. The goal is to provide data-driven direction for our new film studio using evidence that our recommendations will result in truly benefitting the company.

## Datasets

In the folder `zippedData`, we have datasets from:

- [Box Office Mojo](https://www.boxofficemojo.com/)
- [IMDB](https://www.imdb.com/)
- [The Numbers](https://www.the-numbers.com/)

## Solution

This project uses statistical analysis, including formulating three hypotheses of what contributes to a film's success. We infer three different business recommendations and use box-office data to prove our hypotheses and provide statistics as evidence to support our recommendations for the company and the direction for our new film studio.

## Results

# Loading Essentials

## Loading Tools

Import our datascience tools.

In [None]:
import itertools
import numpy as np
import pandas as pd 
from numbers import Number
import sqlite3
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
import zipfile
import os
import warnings
warnings.filterwarnings('ignore')
plt.style.use('ggplot')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 200)
pd.set_option('display.float_format', '{:,.2f}'.format)

## Loading Datasets

### IMdb, SQL Database

IMdb aims to collect any and all data of all films to showcase everyday people.

In [None]:
zip_path = 'zippedData/im.db.zip'
extract_path = 'zippedData/'

with zipfile.ZipFile(zip_path,'r') as zip_ref:
    zip_ref.extractall(extract_path)

db_path = os.path.join(extract_path, 'im.db')

conn = sqlite3.connect(db_path)
pd.read_sql("""
    SELECT *
    FROM sqlite_master
    WHERE type = 'table';
""",conn)

In [None]:
df_imdb = pd.read_sql("""
    SELECT *
    FROM movie_basics AS mb
    INNER JOIN movie_ratings AS mr ON mb.movie_id =mr.movie_id;
""",conn)
df_imdb['title'] = df_imdb['primary_title']
df_imdb

### TheNumbers, CSV

TheNumbers goal is to collect the most accurate details of any film's budgets and revenues.

In [None]:
df_tn_movie_budgets = pd.read_csv('zippedData/tn.movie_budgets.csv.gz')
df_tn_movie_budgets['title'] = df_tn_movie_budgets['movie']
df_tn_movie_budgets

### Merge Data For Analysis

Reasons for Combined Data
- It was neccessary we `merge` our data from `IMdb` and `TheNumbers` because IMdb is a popular platform that houses all details of films, including genres, ratings, and votes and TheNumbers has reliable information regarding film budgets and revenues across the world.
- By combining these two datasets, we couple our data of `films`, `genres`, `ratings`, and `votes` with `financial data` to help us investigate deeper in our analysis.

In [None]:
df_box_office = pd.merge(df_imdb, df_tn_movie_budgets, on='title', how='inner')
df_box_office

# Exploratory Data Analysis

## 1. Understanding The Data
- Dataframe `shape`
- `head` and `tail`
- `info`
- `describe`

In [None]:
df_box_office.head()

In [None]:
df_box_office.shape

In [None]:
df_box_office.info()

In [None]:
df_box_office.describe()

## Step 2. Data Preparation
- Dropping irrelevant columns and rows
- Identifying duplicated columns
- Renaming columns
- Feature creation

Let's clean up our dataset by `dropping` irrelevant `columns` and `renaming` them to represent the data better.

In [None]:
df = df_box_office[[
    # 'movie_id', 'original_title', 'start_year', 
    # 'runtime_minutes', 'movie_id', 'title', 'id', 'movie'
    'primary_title',  'genres', 'averagerating', 'numvotes', 
    'release_date', 'production_budget', 'worldwide_gross' ]]

df.rename(columns={ 'primary_title':'film', 'averagerating':'rating', 'numvotes':'votes', 
                    'release_date':'release', 'production_budget':'budget',
                    'worldwide_gross':'revenue' }, inplace=True)

df.info()

There are some important cleaning we need to do:
- Replace `genre` values that are `NaN` with string `'unknown'` if data is missing. This way we are not losing any data even if we don't know the genre.
- Convert `release` to pandas `datetime` so that we can do some logical operations over it for deeper investigation.
- Convert `budget` and `revenue` into `int` so that we can do logical operations over the data for better analysis.
- Normalize `rating` because the feature is directly affected by `votes` which disproportianately affects the true rating.

In [None]:
df['genres'].fillna('unknown', inplace=True)
df.genres

In [None]:
mask = pd.to_datetime(df['release'], format='%b %d, %Y')
df['release'] = mask
df.release

In [None]:
mask = df['budget'].str.strip('$').str.replace(',','').astype(float)
df['budget'] = mask
mask = df['revenue'].str.strip('$').str.replace(',','').astype(float)
df['revenue'] = mask
df.info()

In [None]:
df.head()

We will be creating a `new feature` to represent the `profit` the film made by taking the `difference` between each film's `budget` and `revenue`. This reasons for this new feature is:
- The feature `profit` will immediately show us whether a movie resulted in a loss or profit.
- We can compare actual `revenue` earned between films for better analysis.
- The `profit` will tell us if `budget` plays a role in `revenue` and the film's success.

In [None]:
df['profit'] = df['revenue'] - df['budget']
df.tail()

We have two columns `rating` and `votes` that directly are affected by each other. This makes our analysis a little difficult because films with significantly less votes will affect comparing ratings with films that got a much larger number of votes.

For that reason, we created a new feature `weighted_rating` which accurately represents the film rating based on the amounts of votes the film received. This feature has been normalized throughout the data.

In [None]:
C = df['rating'].mean()
m = df['votes'].quantile(0.6) # Only considering votes that are greater than the bottom 60%.
def weighted_rating(x, m=m, C=C):
    v = x['votes']
    R = x['rating']
    return (v / (v + m) * R) + (m / (v + m) * C)

df['weighted_rating'] = df.apply(weighted_rating, axis=1)
df[['film', 'rating', 'votes', 'weighted_rating']]

We apply the changes onto our original dataframe `rating` so that we don't add irrelevant features; including dropping `votes` since our rating scale has been weighted with votes.

In [None]:
df['rating'] = df['weighted_rating']
df = df[['film', 'genres', 'rating', 'release', 'budget', 'revenue', 'profit', # 'votes'
   ]]
df