# Inference and Hypothesis Testing

**OBJECTIVES**

- Review confidence intervals
- Review standard error of the mean
- Introduce Hypothesis Testing
 - Hypothesis test with one sample
 - Difference in two samples
 - Difference in multiple samples

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.stats as stats
import seaborn as sns

#### Quiz Review

Using the `titanic` data, determine which features seem to discriminate well between passengers who survived and those that did not.

In [None]:
titanic = sns.load_dataset('titanic')

### Standardization

Suppose we have two distributions on different domains from which we would like to compare scores.  
- An English Class has test scores normally distributed with mean 95 and standard deviation 5.

- A Mathematics Class has test scores normally distributed with mean 80 and standard deviation 7.

In [None]:
#math class
math_class = stats.norm(loc = 80, scale = 7)

In [None]:
#histogram
plt.hist(math_class.rvs(100))

In [None]:
#english scores
english_class = stats.norm(loc = 95, scale = 5)

In [None]:
#make a dataframe
tests_df = pd.DataFrame({'math': math_class.rvs(1000), 'english': english_class.rvs(1000)})
tests_df.head()

In [None]:
#plot the histograms together
plt.hist(tests_df['math'])
plt.hist(tests_df['english'])

In [None]:
#problem: Student A -- 82 in math How many std's away from the mean is 82???
#.        Student B -- 97 in English
#Who did better?

#### `Standardizer`

The work of standardizing our data is extremely important for many models.  To get a feel for an important library, your task is to build a `Standardizer` class that has two methods:

```python
.fit()
.transform()
```

When the `.fit` method is called, you will learn the mean and standard deviation of the data.  Upon learning these, assign them to the attributes `.mean_` and `.scale_`.  Then, use the `.transform` method to actually transform the data.  Demonstrate its use with the `tests_df`.  Note, you will need to call the `.fit` method prior to the `.transform`.  As a bonus, try adding an error message that warns the user when calling `fit` prior to calling `transform`.

In [None]:
class Standardizer:
    def __init__(self):
        self.mean_ = None
        self.scale_ = None
        
    def fit():
        pass
    
    def transform():
        pass

#### Differences between groups

In [None]:
#read in the polls data
polls = pd.read_csv('https://raw.githubusercontent.com/jfkoehler/nyu_bootcamp_fa24/refs/heads/main/data/polls.csv')

In [None]:
#take a peek
polls.head()

### Confidence intervals

$$\mu \pm t_{1 - \alpha / 2} \times \frac{s}{\sqrt{n}}$$

- $\alpha$: significance level -- we determine this
- *t*: t-score -- we look this up
- $\mu$: we get this from the data
- $s$: we get this from the data **NOTE**: This is different than a population standard deviation.

In [None]:
#examine the first question data
q1 = polls['p1']

In [None]:
#determine degrees of freedom
#i.e. length - 1
dof = len(q1) - 1

In [None]:
#look up test statistic
#we need our alpha and dof
#where do we bound 97.5% of our data
t_stat = stats.t.ppf(1 - 0.05/2, dof)

In [None]:
#compute sample standard deviation
s = np.std(q1, ddof = 1)

In [None]:
#sample size
n = len(q1)

In [None]:
#compute upper limit
upper = q1.mean() + t_stat*s/np.sqrt(n)

In [None]:
#compute the lower bound
lower = q1.mean() - t_stat*s/np.sqrt(n)

In [None]:
#print it
(lower, upper)

In [None]:
#use scipy
#1 - alpha
#dof
#sem
#(1 - alpha, dof, mean, sem)
stats.t.interval(.95, n - 1, np.mean(q1), stats.sem(q1))

In [None]:
#plot it
#take 500 samples of size 7 from poll 1, find mean, kde of the results
sample_means = [q1.sample(20).mean() for _ in range(5000)]
sns.displot(sample_means, kind = 'kde')

### Problem

- Find the 95% confidence interval for the second poll
- Compare the two intervals, is there much overlap?  What does this mean?

### Confidence interval for Difference in Means

In [None]:
#statsmodels imports
from statsmodels.stats.weightstats import CompareMeans, DescrStatsW

In [None]:
#create our objects polls are DescrStatsWeights
#compare means of these
dq1 = DescrStatsW(q1)
dq2 = DescrStatsW(q2)
c = CompareMeans(dq1, dq2)

In [None]:
#90% confidence interval -- represents the difference between 
c.tconfint_diff(.05)

In [None]:
#so what?

### Jobs Data

The data below is a sample of job postings from New York City.  We want to investigate the lower and upper bound columns.

In [None]:
#read in the data
jobs = pd.read_csv('https://raw.githubusercontent.com/jfkoehler/nyu_bootcamp_fa24/refs/heads/main/data/jobs.csv')

In [None]:
#salary from
jobs.head()

### Margin of Error

Now, the question is to build a confidence interval that achieves a given amount of error.

$$error = z_{\alpha/2} \times \frac{\sigma}{\sqrt{n}}$$

**PROBLEM**

What is the minimum sample size necessary to estimate the upper salary range with 95% confidence within \$3000?

- need $z$-score: 1.96
- E: 3000
- $\sigma$: `np.std(jobs['salary_to'])`

In [None]:
#do the computation


In [None]:
#repeat for $500


### Testing Significance

Now that we've tackled confidence intervals, let's wrap up with a final test for significance.  With a Hypothesis Test, the first step is declaring a null and alternative hypothesis.  Typically, this will be an assumption of no difference.

$$H_0: \text{Null Hypothesis}$$
$$H_a: \text{Alternative Hypothesis}$$

For example, our data below have to do with a reading intervention and assessment after the fact.  Our null hypothesis will be:

$$H_0: \mu_1 = \mu_2$$
$$H_a: \mu_1 \neq \mu_2$$

In [None]:
#read in the data
reading = pd.read_csv('https://raw.githubusercontent.com/jfkoehler/nyu_bootcamp_fa24/refs/heads/main/data/DRP.csv')
reading.head()

In [None]:
#distributions of groups
sns.displot(x = 'drp', hue = 'group', data = reading, kind='kde')

For our hypothesis test, we need two things:

- Null and alternative hypothesis

$$H_0: \mu_t = \mu_c $$
$$H_a: \mu_t \neq \mu_c $$
- Significance Level

 - $\alpha = 0.05$
Just like before, we will set a tolerance for rejecting the null hypothesis.

In [None]:
#split the groups
treatment = reading.loc[reading['g'] == 0]['drp']
control = reading.loc[reading['g'] == 1]['drp']

In [None]:
#run the test
stats.ttest_ind(treatment, control)

In [None]:
#alpha at 0.05

SUPPOSE WE WANT TO TEST IF INTERVENTION MADE SCORES HIGHER

$$H_0: \mu_0 = \mu_1$$
$$H_1: \mu_0 < \mu_1$$

In [None]:
#alpha at 0.05

In [None]:
t_score, p = stats.ttest_ind(treatment, control)

In [None]:
p/2

**PROBLEMS**

1. Given the `mileage` dataset, test the claim on the cars sticker that the average mpg for city driving is 30 mpg.

2. If we increase our food intake, we generally gain weight.  In one study, researchers fed 16 non-obese adults, age 25-36 1000 excess calories a day.  According to theory, 3500 extra calories will translate into a weight gain of 1 point, therefore we expect each of the subjects to gain 16 pounds.  the `wtgain` dataset contains the before and after eight week period gains.

  - Create a new column to represent the weight change of each subject.
  - Find the mean and standard deviation for the change.
  - Determine the 95% confidence interval for weight change and interpret in complete sentences.
  - Test the null hypothesis that the mean weight gain is 16 lbs.  What do you conclude?
  
3. Insurance adjusters are concerned about the high estimates they are receiving from Jocko's Garage.  To see if the estimates are unreasonably high, each of 10 damaged cars was take to Jocko's and to another garage and the estimates were recorded in the `jocko.csv` file.  

  - Create a new column that represents the difference in prices from the two garages. Find the mean and standard deviation of the difference.
  - Test the null hypothesis that there is no difference between the estimates at the 0.05 significance level.