# Inferential Statistics (frequentist)

## Concepts covered in this lesson

1. Estimation and Estimators
2. Confidence intervals (quantifying sampling error)
3. Hypothesis testing

## Estimation and Estimators

Think of the following study:
- Research question: What's the average weight of people in the Los Angeles (LA) metro area?
- Sampling technique: Ask every third account on Instagram who posts mainly in the LA metro area.

Now, let's answer the following questions:
1. What is estimation? Obtain information about a parameter using a statistic
2. What is an estimator? Some statistical method for estimation based on observable data
3. What is estimator bias? Long-run error between sample statistic and population statistic
4. What is sampling error? Error caused by technique for random sampling is not representative of the population
5. What is the difference between standard error and standard deviation? Standard error is computed on an estimator; standard error becomes smaller as sample size increases
6. What is sampling bias? Selecting groups that are not representative of the full population
7. What is measurement error? Error in the data collection process

## Confidence intervals (quantifying sampling error)

Let's go back to the weight study above. Say that we will begin collecting our data.

1. How do we know when to stop?
2. How do we quantify the significance of the data we have collected so far?

### Calculating CIs using Python

Study: `as_datasets/ExamScores.csv` (exam scores of a class over time)

Write a function that computes confidence intervals for a mean given a `pd.Series` of data, using the following signature.
```
def get_confidence_interval(dataset: pd.Series, ci_level: float) -> Tuple[float, float]:
```
Then, use your function to get the confidence interval for each column in `ExamScores.csv`.

In [5]:
import numpy as np
import scipy.stats
import pandas as pd
from typing import Tuple


def get_confidence_interval(dataset: pd.Series, ci_level: float = 0.95, force_t: bool = False) -> Tuple[float, float]:
    """
    Returns the confidence interval for the given data series, based on the 
      z-distribution if the number of samples > 30 and the t-distribution if
      the number of samples is less than or equal to 30.

    :param dataset: a single series of data to get the confidence interval for the mean.
    :param ci_level: level for the confidence interval
    :param force_t: True if forced to use t distribution
    """
    n = len(dataset)
    mean = dataset.mean()
    stdev = dataset.std()
    stderr = stdev / np.sqrt(n)
    if n > 30 and not force_t:
        return scipy.stats.norm.interval(ci_level, mean, stderr)
    else:
        ddof = n - 1
        return scipy.stats.t.interval(ci_level, ddof, mean, stderr)

In [6]:
df_exam = pd.read_csv('../as_datasets/ExamScores.csv')

df_exam_cis = df_exam.apply(get_confidence_interval, axis=0)  # axis=0 for columns
print(df_exam_cis)
print(df_exam_cis.mean())

       Exam1      Exam2      Exam3      Exam4
0  80.124504  75.427229  67.310128  74.266876
1  85.275496  83.372771  79.369872  78.733124
Exam1    82.70
Exam2    79.40
Exam3    73.34
Exam4    76.50
dtype: float64


## Hypothesis testing

Continuing with the exam scores, **how do we know that everyone _did better_ on the second exam than the first exam?**

In other words, what is the **significance** of our test statistic?  

How do we determine that this is **statistically significant**?

When would **statistical significance** not be important **practically**?

### Choosing statistical tests
![statistical test table](testing_table.PNG)

### Errors in hypothesis testing
![confusion matrix with Type 1/2 errors](confusion_matrix.PNG)

### Mean-based testing

#### 1-sample t-test

File: `as_datasets/ExamScores.csv`

Research question: Is the class's scores for Exam 2 different from the expected score of 86?

In [14]:
# H0 is mu == 86
# HA is mu != 86

print(f"The mean of Exam2 is: {df_exam['Exam2'].mean()}")
print(f"The 95% confidence interval for Exam2 is: {get_confidence_interval(force_t=True, dataset=df_exam['Exam2'])}")
scipy.stats.ttest_1samp(df_exam['Exam2'], 86.0)

The mean of Exam2 is: 79.4
The 95% confidence interval for Exam2 is: (75.32666910888537, 83.47333089111464)


Ttest_1sampResult(statistic=-3.256105851002791, pvalue=0.0020525657751595604)

#### 2-sample unpaired t-test

File: `http://data-analytics.zybooks.com/Memory.csv`

Research question: Does this memory enhancement drug actually reduce the number of memory-related errors?

In [20]:
df_memory = pd.read_csv('http://data-analytics.zybooks.com/Memory.csv')
# df_memory.head()
print(f"The means are: {df_memory.mean()}")
print(f"The 95% confidence interval for nodrug is: {get_confidence_interval(force_t=True, dataset=df_memory['nodrug'])}")
print(f"The 95% confidence interval for drug is: {get_confidence_interval(force_t=True, dataset=df_memory['drug'])}")
scipy.stats.ttest_ind(df_memory['nodrug'], df_memory['drug'])

The means are: nodrug    27.8
drug      15.4
dtype: float64
The 95% confidence interval for nodrug is: (18.07418565183631, 37.52581434816369)
The 95% confidence interval for drug is: (12.987032359742942, 17.812967640257057)


Ttest_indResult(statistic=2.7992880505646385, pvalue=0.011854795066226269)

#### 2-sample paired t-test

File: `as_datasets/ExamScores.csv`

Research question: Did the class improve on the second exam?

In [22]:
print(f"The means are: {df_exam.mean()}")
print(f"The 95% confidence interval for Exam1 is: {get_confidence_interval(force_t=True, dataset=df_exam['Exam1'])}")
print(f"The 95% confidence interval for Exam2 is: {get_confidence_interval(force_t=True, dataset=df_exam['Exam2'])}")
scipy.stats.ttest_rel(df_exam['Exam1'], df_exam['Exam2'])

The means are: Exam1    82.70
Exam2    79.40
Exam3    73.34
Exam4    76.50
dtype: float64
The 95% confidence interval for Exam1 is: (80.05931208777747, 85.34068791222253)
The 95% confidence interval for Exam2 is: (75.32666910888537, 83.47333089111464)


Ttest_relResult(statistic=1.417925258248465, pvalue=0.16254101610053864)

#### One-way ANOVA

File: `as_datasets/ExamScores.csv`

Research question: Do the exam scores truly have different means?

In [23]:
scipy.stats.f_oneway(df_exam['Exam1'], df_exam['Exam2'], df_exam['Exam3'], df_exam['Exam4'])

F_onewayResult(statistic=3.8569608879310637, pvalue=0.010348669251964107)

#### Linear statistical modeling with OLS

File: `as_datasets/ExamScores.csv`

Same research question as above: Do the exam scores truly have different means?

In [None]:
import statsmodels.api as sm
import statsmodels.formula.api

#### Multiple comparison with Tukey's HSD

File: `as_datasets/ExamScores.csv`

Research question: Which exam(s) did the course struggle with?

### Proportion-based testing:

#### 1-sample z-test on a proportion

File: `as_datasets/ExamScores.csv`

Research question: Does sufficient evidence exist that the proportion of scores over 80 on exam 1?

#### 2-sample z-test on a proportion

10,000 individuals are divided evenly into two groups. The treatment group is given a vaccine and the control group is given a placebo. 95 of the 5,000 individuals in the treatment group developed a disease. 125 of the 5,000 individuals in the control group developed a particular disease. A research team wants to determine whether the vaccine is effective in decreasing the incidence of disease. Does sufficient evidence exist to conclude that the proportion of developing a disease in individuals given the vaccine is less than that of individuals given a placebo?