# Stakeholder Questions
1) Does the MPAA rating of a movie (G/PG/PG-13/R) affect how much revenue the movie generates?
2) Do movies that are over 2.5 hours long earn more revenue than movies that are 1.5 hours long (or less)?
3) Are some genres higher rated than others?


- For each question, they would like:
    - perform a statistical test to get a mathematically-supported answer.
    - report if you found a significant difference between ratings.
    - what was the p-value of your analysis?
    - which rating earns the most revenue?
    - prepare a visualization that supports your finding.



In [1]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import StrMethodFormatter
price_fmt = StrMethodFormatter("${x:,.0f}")
import seaborn as sns
import json

import pymysql
pymysql.install_as_MySQLdb()
from sqlalchemy.types import *
from sqlalchemy_utils import create_database, database_exists
from sqlalchemy import create_engine

from scipy import stats
## Post Hoc
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# SQL

In [2]:
# Obtain the MySQL login username and password
with open('/Users/patelmedzy/.secret/mysql.json') as f:
    login = json.load(f)
# Display the MySQL login keys
login.keys()

dict_keys(['username', 'password'])

In [3]:
# Create a connection string using credentials following this format:
# connection = "dialect+driver://username:password@host:port/database"
database_name = "Movies"
connection_str = f"mysql+pymysql://{login['username']}:{login['password']}@localhost/{database_name}"

In [4]:
# Create an instance of the sqlalchemy Engine Class using create_engine
engine = create_engine(connection_str)

# Questions 

## Function to interpret p-value

In [5]:
# Create a function to evaluate the p=value of a hypothesis test
def interpret_pvalue(ho, ha, alpha=.05):
    ho = 'No statistical significance exists. The null hypothesis was not rejected.'
    ha = 'A statistical significance exists. The null hypothesis is rejected and the alternative is supported that..'
    if pvalue < alpha:
        print(f'The p-value for the test was {pvalue}')
        print(f'It was < the alpha value of {alpha}, so')
        print(ha_desc)
        print(ha)
    else:
        print(f'The p-value for the test was {pvalue}')
        print(f'It was > the alpha value of {alpha}, so')
        print(ho_desc)
        print(ho)

## Does the MPAA rating of a movie (G/PG/PG-13/R) affect how much revenue the movie generates?


### Stating Hypothesis
- $H_0$ (Null Hypothesis): The MPAA rating of a movie does not affect how much revenue the movie generates.
- $H_A$ (Alternate Hypothesis): The MPAA rating of a movie does affect how much revenue the movie generates.

In [6]:
# Import Data
# Use an SQL query to create a dataframe
q = """
SELECT certification, revenue
FROM tmdb_data
WHERE revenue > 0 AND certification IS NOT NULL
;"""
df = pd.read_sql(q, engine)

In [7]:
df.head()

Unnamed: 0,certification,revenue
0,PG-13,76019000.0
1,PG-13,5271670.0
2,PG,14204600.0
3,R,14904.0
4,G,224835000.0


In [8]:
# Display the unique values and their counts for this column
df['certification'].value_counts()

R        171
PG-13    123
PG        31
G         15
NR        12
Name: certification, dtype: int64

## Selecting the correct test to perform
- Target Datatype: 
    - Numeric (revenue)
- Number of Samples: 
    - more than 5 samples (certification)
- Test to perform:
    - parametric: ANOVA and/or Tukey
    - non-parametric: Kruskal-Wallis and/or Tukey

### Assumptions
- No significant outliers
- Normality 
- Equal Variance


#### Testing Assumptions: No Significant Outliers

In [9]:
# Creating a loop function to test for number or outliers

groups = {}
for certification in df['certification'].unique():
    cert_df = df.loc[df['certification'] == certification, 'revenue']
    groups[certification] = cert_df
groups.keys()

# Loop through the groups to obtain
# the number of outliers, display the result and then remove them
for certification, revenue in groups.items():
    # Calculate the number of outliers
    outliers = np.abs(stats.zscore(revenue)) > 3
    # Display the number of outliers in the group
    print(f"{outliers.sum()} outliers were removed from the {certification} group.")
    # Remove the outliers
    groups[certification] = revenue.loc[~outliers]

2 outliers were removed from the PG-13 group.
1 outliers were removed from the PG group.
4 outliers were removed from the R group.
1 outliers were removed from the G group.
0 outliers were removed from the NR group.


 - No Significant Outliers Assumptions has been met.

#### Testing Assumptions: Normality

In [10]:
# Loop through groups to obtain group count and p-value for Normality test
results = {}
for certification, revenue in groups.items():
    stat, p = stats.normaltest(revenue)
    results[certification] = {'n':len(revenue), 'p':p}



In [11]:
results = pd.DataFrame(results)
results.head()

Unnamed: 0,PG-13,PG,R,G,NR
n,121.0,30.0,167.0,14.0,12.0
p,1.342146e-09,7.7e-05,3.987101e-15,0.328928,0.24408


In [12]:
# transposing the results dataframe
#https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transpose.html
results = results.T
results.head()

Unnamed: 0,n,p
PG-13,121.0,1.342146e-09
PG,30.0,7.731518e-05
R,167.0,3.987101e-15
G,14.0,0.3289276
NR,12.0,0.2440801


In [13]:
results['sig?'] = results['p'] < .05

In [14]:
results

Unnamed: 0,n,p,sig?
PG-13,121.0,1.342146e-09,True
PG,30.0,7.731518e-05,True
R,167.0,3.987101e-15,True
G,14.0,0.3289276,False
NR,12.0,0.2440801,False


- The p-values are less than 0.05 for PG-13, PG and R the MPAA ratings. While the p-values for G and NR are greater than 0.05.
- The groups are not large enough to ignore the assumption of Normality. Hence, Assumption for Normality is not met.
- I will perform the non-parametric equivalent of your test: Kruskal-Wallis and/or Tukey

#### Non-Parametric test: Kruskal-Wallis and/or Tukey

In [None]:
stats.kruskal