A quick intro to Google Colab - we'll be using this to run jupyter notebooks in class

In [None]:
print("Hello world!")

See detailed introductions at: 
https://colab.research.google.com/

Some of the most useful features for our class: 
- linear algebra
    - Recommended reference: https://cheatsheets.quantecon.org/
- data management
    - pandas 
- plotting 
    - Recommended libraries: Seaborn, matplotlib
- statistics packages 
    - scipy.stats: test statistics, reference distributions
    - statsmodels: classical statistics, e.g. regression, ANOVA
    - scikit-learn: statistical learning software, for clustering, classification, regression, etc 

# Exercise: Exponential Hypothesis Testing
A firm produces engine parts that wear out over time. 
The lifetime of an engine part can be modeled by an exponential distribution; that is, for average lifetime of the part $\lambda$, the probability density of lifetime $x$ is: $f(x; \lambda) = \begin{cases} \lambda e^{- \lambda x} \qquad & x \geq 0 \\ 0 & \text{ else} \end{cases}$. 
The firm has developed a new model, has conducted tests to find the lifetimes of the new parts, and wants you to determine if the new part is a significant improvement. 

## Part 1:
First, find the maximum likelihood estimator for the parameter $\lambda$. Use this parameter to find the MLE for distributions of the files `old_part.csv` and `new_part.csv`. 

In [25]:
import numpy as np
import pandas as pd
np.random.seed(1234)
X_old_part = np.random.exponential(scale=3, size=100)
X_new_part = np.random.exponential(scale=3.5, size=100)
pd.DataFrame(X_old_part,columns=['lifetime']).to_csv('old_part.csv',index=False)
pd.DataFrame(X_new_part,columns=['lifetime']).to_csv('new_part.csv',index=False)


# path_base = 'https://raw.githubusercontent.com/maxoboe/6419_recitations/main/R1/'
# X_old_part = pd.read_csv(path_base + 'old_part.csv').values
# X_new_part = pd.read_csv(path_base + 'new_part.csv').values

lambda_old = 1 / np.mean(X_old_part)
lambda_new = 1 / np.mean(X_new_part)
print("Lambda for old part: {}, lambda for new part: {}".format(lambda_old, lambda_new))

Lambda for old part: 0.3321527061334148, lambda for new part: 0.25159196418944174


### Part 2: 
Write down a null hypothesis and alternate hypothesis for the question of whether the new part lasts longer than the old part.

$\lambda = 1 / \mu$

$H_0$: $\mu_{new} = \mu_{old}$

$H_A$: $\mu_{new} > \mu_{old}$

$H_A$: $\lambda_{new} < \lambda_{old}$

### Part 3: 
Evaluate a likelihood ratio test to evaluate the specified null hypothesis. Find the value of the test, and the appropriate parameter for the corresponding $\chi^2$ distribution.

A helper function is provided that finds the likelihood of one observation given a parameter guess. 

Likelihood ratio: $\displaystyle -2 \log\left(\frac{\sup_{\lambda \in \Theta_0} \mathcal{L}(X | \lambda)}{\sup_{\lambda \in \Theta} \mathcal{L}(X | \lambda)} \right)$ 

Likelihood ratio: $\Lambda = \displaystyle -2 \log\left(\frac{\sup_{\lambda \in \Theta_0} \mathcal{L}(X_{old} | \lambda) \mathcal{L}(X_{new} | \lambda)}{\sup_{\lambda_1 \in \Theta} \mathcal{L}(X_{old} | \lambda_1)\sup_{\lambda_2 \in \Theta} \mathcal{L}(X_{new} | \lambda_2)} \right)$ 

What we found before: $\hat{\lambda}_{old} = \sup_{\lambda \in \Theta_0} \mathcal{L}(X_{old} | \lambda)$

MLE in $H_0$: $\hat{\lambda}_{new} = \hat{\lambda}_{old}$

MLE overall: from part 1.

In [37]:
def likelihood(X, param):
    def indiv_likelihood(x):
        if x < 0: return 0
        return param * np.exp(-param * x)
    return np.prod([indiv_likelihood(x) for x in X])
numerator_lambda = 1 / np.mean(np.hstack([X_old_part,X_new_part]))
numerator = likelihood(X_old_part, numerator_lambda) * likelihood(X_new_part, numerator_lambda)
denomenator = likelihood(X_old_part, lambda_old) * likelihood(X_new_part, lambda_new)

print("The raw likelihood ratio is {}".format(numerator / denomenator))

The raw likelihood ratio is 0.14617379707091524


### Part 4:
At a significance level of $\alpha = 0.05$, do you reject the null hypothesis? 


Hint: use the package `scipy.stats` and and refer to [this link](https://docs.scipy.org/doc/scipy/reference/stats.html#statistical-tests) for reference. 

In [39]:
from scipy.stats import chi2 
test_value = -2 * np.log(numerator / denomenator)
dof = 1 
p_value = 1 - chi2.cdf(test_value, dof)
print("Test value is {}, which has p value of {}".format(test_value,p_value))

Test value is 3.8459179486611847, which has p value of 0.04986721783570636


### Part 5: 
Now consider some alternate tests. For each, find the desired test statistic and discuss the result. 

Hint: use the package `scipy.stats` and and refer to [this link](https://docs.scipy.org/doc/scipy/reference/stats.html#statistical-tests) for reference. 

1. Assuming that durations are normally distributed, evaluate a test for the hypothesis that the two distributions have the same mean. 
2. Without making any distributional assumptions, test the hypothesis that the two distributions have the same mean.
3. Test the hypothesis that the two distributions are the same.

In [33]:
import scipy.stats as stats
test1 = stats.ttest_ind(X_old_part, X_new_part)
test2 = stats.ranksums(X_old_part,X_new_part)
test3 = stats.ks_2samp(X_old_part,X_new_part)
print("Test 1: {}".format(test1))
print("Test 2: {}".format(test2))
print("Test 3: {}".format(test3))

Test 1: Ttest_indResult(statistic=-2.0670154517829284, pvalue=0.04003346528145235)
Test 2: RanksumsResult(statistic=-1.5271180544538152, pvalue=0.1267316582174145)
Test 3: KstestResult(statistic=0.17, pvalue=0.11119526053829192)
