For part 4 of the project, you will be using your MySQL database from part 3 to answer meaningful questions for your stakeholder. *They want you to use your hypothesis testing and statistics knowledge to answer 3 questions about what makes a successful movie.*

Questions to Answer
The stakeholder's first question is: does the MPAA rating of a movie (G/PG/PG-13/R) affect how much revenue the movie generates?

They want you to perform a statistical test to get a mathematically-supported answer.
They want you to report if you found a significant difference between ratings.
If so, what was the p-value of your analysis?
And which rating earns the most revenue?
They want you to prepare a visualization that supports your finding.
It is then up to you to think of 2 additional hypotheses to test that your stakeholder may want to know.

Some example hypotheses you could test:

Do movies that are over 2.5 hours long earn more revenue than movies that are 1.5 hours long (or less)?
Do movies released in 2020 earn less revenue than movies released in 2018?
How do the years compare for movie ratings?
Do some movie genres earn more revenue than others?
Are some genres higher rated than others?
etc.
Specifications
Your Data
A critical first step for this assignment will be to retrieve additional movie data to add to your SQL database.
You will want to use the TMDB API again and extract data for additional years.
You may want to review the optional lesson from Week 1 on "Using Glob to Load Many Files" to load and combine all of your API results for each year.
However, trying to extract the TMDB data for all movies from 2000-2022 could take >24 hours!
To address this issue, you should EITHER:
Define a smaller (but logical) period of time to use for your analyses (e.g., last 10 years, 2010-2019 (pre-pandemic, etc).
OR coordinate with cohort-mates and divide the API calls so that you can all download the data for a smaller number of years and then share your downloaded JSON data.


Deliverables
You should use the same project repository you have been using for Parts 1-3 (for your portfolio).
Create a new notebook in your project repository just for the hypothesis testing (like "Part 4 - Hypothesis Testing.ipynb")
Make sure the results and visualization for all 3 hypotheses are in your notebook.
Please submit the link to your GitHub repository for this assignment.

# Imports

In [6]:
import pymysql
import json
pymysql.install_as_MySQLdb()
from sqlalchemy import create_engine
from sqlalchemy.types import *
from sqlalchemy_utils import database_exists
from urllib.parse import quote_plus
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd
import statsmodels.stats.multicomp as mc

# MySQL Connection

In [7]:
with open('/Users/jasontracey/.secret/mysql.json') as f: #change the path to match YOUR path!!
    login = json.load(f)
login.keys()

dict_keys(['username', 'password'])

In [8]:
# Create mySQL connection
username = "root"
password = quote_plus(login['password'])
db_name = "Movie Profitability"
connection = f"mysql+pymysql://{username}:{password}@localhost/{db_name}"

In [9]:
# create engine
engine = create_engine(connection)

In [10]:
# create new database if doesn't exist
if database_exists(connection) == False:
    create_database(connection)
    print('Database created.')
else:
    print("The database already exists.")

The database already exists.


# Check Tables in Database

In [13]:
q = """SHOW TABLES;
"""

pd.read_sql_query(q, engine)

Unnamed: 0,Tables_in_movie profitability
0,genres
1,title_basics
2,title_genres
3,title_ratings
4,tmdb_data


# Create Function

In [14]:
# create helper function to check and remove outliers
# argument 'dictionary' is a dictionary with the groups as keys
# and series of data as values
def check_and_remove_outliers(dictionary):
    
    # iterate over keys (groups) in dictionary
    for key in dictionary.keys():
        
        # check original number of observations
        original_obs = len(dictionary[key])
        
        # check number of outliers
        is_outlier = np.abs(stats.zscore(dictionary[key])) > 3
        number_of_outliers = np.sum(is_outlier)
        
        # remove outliers
        dictionary[key] = dictionary[key][(np.abs(stats.zscore(dictionary[key])) <= 3)]
        
        # print summary
        print(f"Outliers ({number_of_outliers}) removed from group {key};\n",
        f"Number of current observations {len(dictionary[key])} should be {original_obs - number_of_outliers}.")

# 2. Does the MPAA rating of a movie (G/PG/PG-13/R) affect how much revenue the movie generates?

## Asumptions

- Even though the question asks for G, PG, PG-13, and R ratings, I will include all movies ratings.

- I will not include movies with a revenue of 0 

- The alpha value is 0.05.

In [21]:
q = """SELECT revenue, certification
FROM tmdb_data
WHERE certification IS NOT NULL
AND revenue <> 0;
"""

# save results to dataframe
df = pd.read_sql(q, engine)

df.head()

Unnamed: 0,revenue,certification
0,76019000.0,PG-13
1,5271670.0,PG-13
2,14204600.0,PG
3,5227350.0,R
4,14904.0,R


In [22]:
# check that 'certification' coulmn has 
df['certification'].value_counts()

R         2517
PG-13     1820
PG         694
NR         344
G          131
NC-17       20
PG-13        1
Name: certification, dtype: int64