# Tools and Methods of Data Analysis
## Session 8 - Part 1

Niels Hoppe <<niels.hoppe.extern@srh.de>>

In [1]:
import math
import pandas as pd
from scipy import stats

### Recap: The Idea of Hypothesis Testing

Does the data support a **hypothesis** or not?

A hypothesis is **accepted** or **rejected** based on the test result.

Tests give **no absolute certainty**, but only a **probability of error**.

### Recap: The Idea of Hypothesis Testing (cont.)

1. What is the question (hypothesis)?
2. Which test is applicable?
3. How to perform the test?
4. How to interpret the result?

### 1. What is the question (hypothesis)?

A hypothesis is a claim (assumption) about a **population parameter**, e.g.,

Assumption about the **population mean** $\mu$:

"The average income in a country has fallen in the past 20 years."

Assumption about the **population proportion** $\pi$:

"The proportion of voters for a particular party has increased since the last election."

### 1. What is the question (hypothesis)?

There are three kinds of hypotheses we learn to test for:

* the parameter has **increased** wrt. a reference value
* the parameter has **decreased** wrt. a reference value
* the parameter has **changed** (increased or decreased) wrt. a reference value

There are always two hypothesis:

* The **null-hypothesis** ($H_0$) expresses the **absence** of the assumed effect.
* The **alternative hypothesis** ($H_1$) expresses the **presence** of the assumed effect.

### 2. Which test is applicable?

Many statistical tests exist, but we will focus on two of them:

* Student's **t-test** for hypotheses about the mean $\mu$
* **Z-test** for hypotheses about the proportion $\pi$

### 3. How to perform the test?

The general procedure is:

1. Calculate a test statistics / summary statistics from the data
2. Calculate a p-value from the test statistics

The calculation of the test statistics is specific to the respective test.

### Student's t-test for means (two samples)

Calculating the test statistics for Student's t-test:

$$t = \frac{\bar{x}_1 - \bar{x}_2}{s_p \cdot \sqrt{\frac{2}{n}}}$$

where:

$$s_p = \sqrt{\frac{s^2_1 + s^2_2}{2}}$$

Assumptions:

* Sample sizes are equal, i.e., $n_1 = n_2$
* Variances are equal, i.e., $\sigma_1 = \sigma_2$

Calculate the p-value based on the t-distribution with $2n - 2$ degrees of freedom.

### Student's t-test for means (two samples)

Calculating the test statistics for Student's t-test:

$$t = \frac{\bar{x}_1 - \bar{x}_2}{s_p \cdot \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}}$$

where:

$$s_p = \sqrt{\frac{(n_1 - 1) \cdot s^2_1 + (n_2 - 1) \cdot s^2_2}{n_1 + n_2 - 2}}$$

Assumptions:

* Sample sizes are equal, i.e., $n_1 \ne n_2$
* Variances are similar, i.e., $\frac{1}{2} \lt \frac{\sigma_1}{\sigma_2} \lt 2$

Calculate the p-value based on the t-distribution with $n_1 + n_2 - 2$ degrees of freedom.

### Student's t-test for means (two independent samples)

Calculating the test statistics for Student's t-test:

$$t = \frac{\bar{x}_1 - \bar{x}_2}{s_{\bar{\Delta}}}$$

where:

$$s_{\bar{\Delta}} = \sqrt{\frac{s^2_1}{n_1} + \frac{s^2_2}{n_2}}$$

Assumptions:

* Sample sizes are unequal, i.e., $n_1 \ne n_2$
* Variances are unequal, i.e., $\sigma_1 \gt 2 \sigma_2$ or $\sigma_2 \gt 2 \sigma_1$

Calculate the p-value based on the t-distribution.
Use [Welch-Satterthwaite](https://en.wikipedia.org/wiki/Welch%E2%80%93Satterthwaite_equation) equation to calculate the degrees of freedom.

### Example: Stopping Distance

Suppose $\mu_x$ and $\mu_y$ are true mean stopping distances at 50 mph for cars of a certain type equipped with two different types of braking systems.

Do both systems have the same mean stopping distance? Use significance level α = .01

In [2]:
x = pd.Series([114, 114, 118, 118, 112, 111])
y = pd.Series([133, 120, 124, 120, 116, 121, 110])

* $H_0: \mu_x = \mu_y$
* $H_1: \mu_x \neq \mu_y$

### Student's t-test for means (two independent samples)

In [3]:
stat, pval = stats.ttest_ind(x, y)
pval

0.07663322278868127

### Student's t-test for means (two related samples)

Calculating the test statistics for Student's t-test:

$$t = \frac{\bar{x}_D - \mu_0}{s_D / \sqrt{n}}$$

Calculate the p-value based on the t-distribution with $n - 1$ degrees of freedom.

### Example: Test Scores

Test scores for 20 students before and after a training program:

In [4]:
before = pd.Series([38, 53, 61, 27, 54, 55, 44, 45, 44, 41, 45, 40, 42, 51, 60, 49, 45, 41, 42, 74])
after = pd.Series([61, 55, 56, 65, 53, 46, 66, 50, 60, 51, 71, 55, 53, 55, 44, 48, 38, 47, 57, 55])

Does the training improve in average the students scores? Use α = 5%

* $H_0: \mu_1 \leq \mu_0$
* $H_1: \mu_1 > \mu_0$

### Student's t-test for means (two related samples)

In [5]:
stat, pval = stats.ttest_rel(after, before, alternative='greater')
pval

0.025676121095572903

In [6]:
stat, pval = stats.ttest_1samp(after - before, popmean=0, alternative='greater')
pval

0.025676121095572903

### Excursion: Confidence Interval for Difference of Means

Confidence interval:

$$ \bar{x}_1 - \bar{x}_2 \pm Z \cdot s $$

where $Z$ is the critical value and $s$ is the standard error:

$$ s = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}} $$

In [7]:
def std_err(a, b):
    return math.sqrt(a.std()**2 / a.size + b.std()**2 / b.size)

def critical_value_t(alpha, df):
    return stats.t.ppf(1 - alpha / 2, df)

def ttest_ind_confint(a, b, alpha: float = .05):
    x = a.mean() - b.mean() # center of CI
    n = (a.size + b.size) / 2 # mean sample size
    s = std_err(a, b) # standard error of the mean difference
    z = critical_value_t(alpha, df=n-1)
    half_length = s * z
    return (x - half_length, x + half_length)

ci = ttest_ind_confint(x, y, alpha=0.05)
ci

(-13.402367156297982, 1.2595100134408437)

### Z-test for proportions (two samples)

Calculating the test statistics for proportions z-test:

$$z = (p_1 - p_2) \cdot \sqrt{\frac{p \cdot (1 - p)}{\frac{1}{n_1} + \frac{1}{n_2}}}$$

where:

$$p = \frac{p_1 \cdot n_1 + p_2 \cdot n_2}{n_1 + n_2}$$

Calculate the p-value based on the normal distribution.

### Example: Production Plants

Defective Items produced on two different production lines: 

* A: 12 defectives in a batch of 200
$$12/200 = 0.06 = 6\%$$
* B: 24 defectives in a batch of 300
$$24/300 = 0.08 = 8\%$$

Does B produce more defectives? Use $\alpha = 5\%$.

* $H_0: \pi_B \leq \pi_A$
* $H_1: \pi_B > \pi_A$

### Z-test for proportions (two samples)

In [8]:
from statsmodels.stats.proportion import proportions_ztest

stat, pval = proportions_ztest(count=[24, 12], nobs=[300, 200],
                               alternative='larger')
pval

0.1983361310972807

### Excursion: Confidence Interval for Difference of Proportions

Confidence interval

$$ \hat{p_1} - \hat{p_2} \pm Z \cdot s $$

Where $Z$ is the critical value and the $s$ is the standard error:

$$ s = \sqrt{\frac{\hat{p_1} (1-\hat{p_1})}{n_1} + \frac{\hat{p_2} (1-\hat{p_2})}{n_2}}$$

In [9]:
def proportions_std_err(count, nobs):
    p1 = count[0] / nobs[0]
    p2 = count[1] / nobs[1]
    return math.sqrt(p1 * (1 - p1) / nobs[0] + p2 * (1 - p2) / nobs[1])

def critical_value_norm(alpha):
    return stats.norm.ppf(1 - alpha / 2)

def proportions_ztest_confint(count, nobs, alpha: float = .05):
    p1 = count[0] / nobs[0]
    p2 = count[1] / nobs[1]
    p = p1 - p2
    s = proportions_std_err(count, nobs)
    z = critical_value_norm(alpha)
    half_length = s * z
    return (p - half_length, p + half_length)

ci = proportions_ztest_confint(count=[24, 12], nobs=[300, 200], alpha=0.05)
ci

(-0.02500810243477687, 0.06500810243477688)

### Recap: Hypothesis Testing

1. Answer the following questions, then select appropriate test:

* One or two samples?
* Mean or proportion?
* If mean: independent or related populations?
* Test for increase, decrease or any change?

2. If one sample, find reference value.

3. Execute test to find p-value and decide:

* reject $H_0$ when $pval < \alpha$
* do not reject $H_0$ when $pval \geq \alpha$