# Module 2 Code Challenge

Welcome to your Mod 2 Code Challenge. You will be tested for your understanding of concepts and ability to solve problems that have been covered in class and in the curriculum.

Use any libraries you want to solve the problems in the code challenge.

_Read the instructions carefully_. You will be asked both to write code and respond to a few short answer questions.

**Note on the short answer questions**: For the short answer questions please use your own words. The expectation is that you have not copied and pasted from an external source, even if you consult another source to help craft your response. While the short answer questions are not necessarily being assessed on grammatical correctness or sentence structure, you should do your best to communicate yourself clearly.

The sections of the code challenge are:
- Statistical Distributions
- Statistical Tests
- Bayesian Statistics
- Linear Regression and Extensions

In [None]:
# import the necessary libraries
import itertools
import numpy as np
import pandas as pd 
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import pickle

from sklearn.metrics import mean_squared_error, roc_curve, roc_auc_score, accuracy_score
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
import statsmodels.api as sm
from statsmodels.formula.api import ols

In [None]:
# __SOLUTION__ 
# import the necessary libraries
import itertools
import numpy as np
import pandas as pd 
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import pickle

from sklearn.metrics import mean_squared_error, roc_curve, roc_auc_score, accuracy_score
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
import statsmodels.api as sm
from statsmodels.formula.api import ols

---
## Part 1: Statistical Distributions [Suggested time: 25 minutes]
---

### a. Normal Distributions

Say we have check totals for all checks ever written at a TexMex restaurant. 

The distribution for this population of check totals happens to be normally distributed with a population mean of $\mu = 20$ and population standard deviation of $\sigma = 2$. 

1.a.1) Write a function to compute the z-scores for single checks of amount `check_amt`.

In [None]:
def z_score(check_amt):
    """
    check_amt = the amount for which we want to compute the z-score
    """
    pass

In [None]:
# __SOLUTION__ 
def z_score(check_amt):
    
    """ Using the formula (X - mu)/std """
    
    return (check_amt - 20)/2

1.a.2) I go to the TexMex restaurant and get a check for 24 dollars. 

Use your function to compute your check's z-score, and interpret the result using the empirical rule. 

In [None]:
# your code here 

In [None]:
# your answer here

In [None]:
# __SOLUTION__
print("My check has a z-score of {}.".format(z_score(24)))

In [None]:
# __SOLUTION__
# The z-score of your check tells you how many standard deviations your check amount is away from the mean
# check amount. In this case, your check is two standard deviations away from the mean.

# According to the empirical rule, 95% of expected check amounts will be within two standard deviations from the mean.
# Your check amount is just within this region.

1.a.3) Using $\alpha = 0.05$, is my 25 dollar check significantly **greater** than the mean? How do you know this?  

Hint: Here's a link to a [z-table](https://www.math.arizona.edu/~rsims/ma464/standardnormaltable.pdf). 

In [None]:
# your code here 

In [None]:
# your answer here 

In [None]:
# __SOLUTION__
print("My check has a z-score of {}.".format(z_score(24)))
print("The critical threshold z is {}.".format(round(stats.norm.ppf(0.95),2)))

In [None]:
# __SOLUTION__
# For alpha = 0.05, the critical threshold z for an upper-tailed z-test is 1.64 (this can be found using the linked z-table or scipy.stats.)
# We obtain a z-score of 2.5, which is greater than the critical threshold of 1.64. 
# Thus, my 25 dollar check is significantly greater than the mean at alpha = 0.05. 

### b. Confidence Intervals and the Central Limit Theorem

1.b.1) Determine the 95% confidence interval around the mean check total for this population. Interpret your result. 

In [None]:
# your code here 

In [None]:
# __SOLUTION__ 
# 95% confidence interval has z-score of 1.96 (read where p = 0.975)
mean = 20
std = 2
conf = (mean - 1.96*std, mean + 1.96*std)
print("The 95% confidence interval is ", conf)

In [None]:
# your written answer here

In [None]:
# __SOLUTION__
"""
A 95% confidence interval means that there is a 95% chance for the interval to contain the true population mean.

A confidence interval is an interval containing the true population mean with a certain probability. 
i.e. For a 95% confidence interval, there is a 95% chance the interval contains the true population mean.

Frequentist interpretation of the confidence interval:  
Were this procedure to be repeated on 100 samples, approximately 95 of the calculated confidence intervals 
(which would be different for each sample) would be expected to contain the true population mean.

INCORRECT: a confidence interval contains 95% of all values.
"""

1.b.2) Imagine that we didn't know how the population of check totals was distributed. How would **sampling** and the **Central Limit Theorem** allow us to **make inferences on the population mean**, i.e. estimate $\mu, \sigma$ of the population mean?

In [None]:
# Your written answer here

In [None]:
# __SOLUTION__
"""
Solution: The Central Limit Theorem says that we can take repeated samples of the population, 
and estimate population parameters by finding the average mean and standard deviation of the samples. 
Sample means will also tend to a normal distribution.
"""

---
## Part 2: Statistical Testing [Suggested time: 15 minutes]
---

The TexMex restaurant recently introduced Queso to its menu.

We have random samples of 1000 "No Queso" order check totals and 1000 "Queso" order check totals for orders made by different customers.

In the cell below, we load the sample data for you into the arrays `no_queso` and `queso` for the "no queso" and "queso" order check totals. Then, we create histograms of the distribution of the check amounts for the "no queso" and "queso" samples. 

In [None]:
# Load the sample data 
no_queso = pickle.load(open("data/no_queso.pkl", "rb"))
queso = pickle.load(open("data/queso.pkl", "rb"))

# Plot histograms

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

ax1.set_title('Sample of Non-Queso Check Totals')
ax1.set_xlabel('Amount')
ax1.set_ylabel('Frequency')
ax1.hist(no_queso, bins=20)

ax2.set_title('Sample of Queso Check Totals')
ax2.set_xlabel('Amount')
ax2.set_ylabel('Frequency')
ax2.hist(queso, bins=20)
plt.show()

In [None]:
# __SOLUTION__ 
# Load the sample data 
no_queso = pickle.load(open("data/no_queso.pkl", "rb"))
queso = pickle.load(open("data/queso.pkl", "rb"))

# Plot histograms

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

ax1.set_title('Sample of Non-Queso Check Totals')
ax1.set_xlabel('Amount')
ax1.set_ylabel('Frequency')
ax1.hist(no_queso, bins=20)

ax2.set_title('Sample of Queso Check Totals')
ax2.set_xlabel('Amount')
ax2.set_ylabel('Frequency')
ax2.hist(queso, bins=20)
plt.show()

### a. Hypotheses and Errors

The restaurant owners want to know if customers who order Queso spend **more or less** than customers who do not order Queso.

2.a.1) Set up the null $H_{0}$ and alternative hypotheses $H_{A}$ for this test.

In [None]:
# Your written answer here

In [None]:
# __SOLUTION__

"""
Null hypothesis: Customers who order queso spend the same as those who do not order queso. 

Alternative hypothesis: Customers who order queso do not spend the same as those who do not order queso. 
"""

2.a.2) What does it mean to make `Type I` and `Type II` errors in this specific context?

In [None]:
# your answer here

In [None]:
# __SOLUTION__
"""
Type I: (Rejecting the null hypothesis given it's true): Saying queso customers' total check amounts are different 
than non-queso customers' total check amounts when they are the same.

Type II: (Failing to reject the null hypothesis given it's false): Saying queso customers' total check amounts are 
the same as non-queso customers' total check amounts when they are different.
"""

# Give partial credit to students who describe what type I and type II errors are. 

### b. Sample Testing

2.b.1) Run a statistical test on the two samples. Use a significance level of $\alpha = 0.05$. You can assume the two samples have equal variance. Can you reject the null hypothesis? 

_Hint: Use `scipy.stats`._

In [None]:
# your code here 

In [None]:
# your answer here

In [None]:
# __SOLUTION__ 

# Run a two-tailed t-test
print(stats.ttest_ind(no_queso, queso))

# Students may compute the critical t-statistics for the rejection region
critical_t = (stats.t.ppf(0.025, df=999), stats.t.ppf(0.975, df=999))
print(critical_t)

In [None]:
# __SOLUTION__
# We have enough evidence to reject the null hypothesis at a significance level of alpha = 0.05. We obtain a p-value
# much smaller than 0.025 (two-tailed test). Alternatively, our t-statistic is smaller than the critical t-statistic.
# Both answers (p-value or critical t-statistic) are valid. 

---
## Part 3: Bayesian Statistics [Suggested time: 15 minutes]
---
### a. Bayes' Theorem

Thomas wants to get a new puppy 🐕 🐶 🐩 


<img src="https://media.giphy.com/media/rD8R00QOKwfxC/giphy.gif" />

He can choose to get his new puppy either from the pet store or the pound. The probability of him going to the pet store is $0.2$. 

He can choose to get either a big, medium or small puppy.

If he goes to the pet store, the probability of him getting a small puppy is $0.6$. The probability of him getting a medium puppy is $0.3$, and the probability of him getting a large puppy is $0.1$.

If he goes to the pound, the probability of him getting a small puppy is $0.1$. The probability of him getting a medium puppy is $0.35$, and the probability of him getting a large puppy is $0.55$.

3.a.1) What is the probability of Thomas getting a small puppy? 

3.a.2) Given that he got a large puppy, what is the probability that Thomas went to the pet store?

3.a.3) Given that Thomas got a small puppy, is it more likely that he went to the pet store or to the pound?

3.a.4) For Part 2, what is the prior, posterior and likelihood?

In [None]:
ans1 = None
ans2 = None
ans3 = "answer here"
ans4_prior = "answer here"
ans4_posterior = "answer here"
ans4_likelihood = "answer here"

In [None]:
# __SOLUTION__ 
ans1 = 0.2
ans2 = 0.02/0.46
ans3 = "Pet Store" # pet store! (0.12 vs 0.08)
ans4_prior = "P(Store)"
ans4_posterior = "P(Store | Large)"
ans4_likelihood = "P(Large | Store)"

"""
Question 1:

P(Small) = P(Small|Pet Store) + P(Small|Pound) = 0.2*0.6 + 0.8*0.1 = 0.2

Question 2:

P(Pet Store|Large)  = P(Large|Pet Store)*P(Pet Store) / P(Large) 
                    = 0.1*0.2 / (0.1*0.2 + 0.55*0.8)
                    = 0.02 / 0.46 = 0.04348
                    
Question 3:

P(Pet Store|Small) = 0.6
P(Pound|Small) = 0.4

More likely he went to the pet store.

Question 4:
P(Pet Store|Large) = P(Large|Pet Store)*P(Pet Store) / P(Large) 

Prior: P(Store)
Posterior: P(Store | Large)
Likelihood: P(Large | Store)
"""

---
## Part 4: Linear Regression [Suggested Time: 10 min]
---

In this section, you'll be using the Advertising data, and you'll be creating linear models that are more complicated than a simple linear regression. The relevant modules have already been imported at the beginning of this notebook. We'll load and prepare the dataset for you below.

In [None]:
data = pd.read_csv('data/advertising.csv').drop('Unnamed: 0',axis=1)
data.describe()

In [None]:
# __SOLUTION__
data = pd.read_csv('data/advertising.csv').drop('Unnamed: 0',axis=1)
data.describe()

In [None]:
X = data.drop('sales', axis=1)
y = data['sales']

In [None]:
# __SOLUTION__
X = data.drop('sales', axis=1)
y = data['sales']

In [None]:
# split the data into training and testing set. Do not change the random state please!
X_train , X_test, y_train, y_test = train_test_split(X, y,random_state=2019)

In [None]:
# __SOLUTION__
# split the data into training and testing set. Do not change the random state please!
X_train , X_test, y_train, y_test = train_test_split(X, y,random_state=2019)

### a. Multiple Linear Regression

In the linear regression section of the curriculum, you've analyzed how TV, Radio and Newspaper spendings individually affected the Sales figures. Here, we'll use all three together in a multiple linear regression model!

4.a.1) Create a Correlation Matrix for `X`.

In [None]:
# your code here 

In [None]:
# __SOLUTION__
X.corr()

4.a.2) Based on this correlation matrix only, would you recommend to use `TV`, `radio` and `newspaper` in the same multiple linear regression model?

In [None]:
# Your written answer here

In [None]:
# __SOLUTION__
# The highest correlation can be observed between radio and newspaper. 
# Since the correlation is only 0.35 (much smaller than what would be considered a high correlation ~>0.7), there
# are no multicollinearity issues for the three variables, and from a multicollinearity perspective, 
# they can be used together in the same model

4.a.3) Use StatsModels' `ols`-function to create a multiple linear regression model with `TV`, `radio` and `newspaper` as independent variables and sales as the dependent variable. Use the **training set only** to create the model.

Required output: the model summary of this multiple regression model.

In [None]:
# your code here 

In [None]:
# __SOLUTION__
train_data = pd.concat([X_train,y_train], axis = 1) # needed in the OLS-formula
formula = 'y_train ~ X_train.TV + X_train.radio + X_train.newspaper'
model = ols(formula = formula, data = train_data).fit()
model.summary()

4.a.4) Do we have any statistically significant coefficients? If the answer is yes, list them below.

In [None]:
# Your written answer here

In [None]:
# __SOLUTION__
# Since the p-value is very small for TV and radio, they are statistically significant.