# Tools and Methods of Data Analysis
## Session 7

Niels Hoppe <<niels.hoppe.extern@srh.de>>

In [75]:
import math
import pandas as pd
from scipy import stats

### Example: Stopping Distance

Suppose $\mu_x$ and $\mu_y$ are true mean stopping distances at 50 mph for cars of a certain type equipped with two different types of braking systems.

Do both systems have the same mean stopping distance? Use significance level α = .01

In [76]:
x = pd.Series([114, 114, 118, 118, 112, 111])
y = pd.Series([133, 120, 124, 120, 116, 121, 110])

* $H_0: \mu_x = \mu_y$
* $H_1: \mu_x \neq \mu_y$

### T-Test with Two Independent Samples

In [77]:
stat, pval = stats.ttest_ind(x, y)
pval

0.07663322278868127

### Example: Test Scores

Test scores for 20 students before and after a training program:

In [78]:
before = pd.Series([38, 53, 61, 27, 54, 55, 44, 45, 44, 41, 45, 40, 42, 51, 60, 49, 45, 41, 42, 74])
after = pd.Series([61, 55, 56, 65, 53, 46, 66, 50, 60, 51, 71, 55, 53, 55, 44, 48, 38, 47, 57, 55])

Does the training improve in average the students scores? Use α = 5%

* $H_0: \mu_1 \leq \mu_0$
* $H_1: \mu_1 > \mu_0$

### (Paired) T-Test with Two Related Samples

In [79]:
stat, pval = stats.ttest_rel(after, before, alternative='greater')
pval

0.025676121095572903

### Excursion: Confidence Interval for Difference of Means

Confidence interval:

$$ \bar{x}_1 - \bar{x}_2 \pm Z \cdot s $$

where $Z$ is the critical value and $s$ is the standard error:

$$ s = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}} $$

In [80]:
def std_err(a, b):
    return math.sqrt(a.std()**2 / a.size + b.std()**2 / b.size)

def critical_value_t(alpha, df):
    return stats.t.ppf(1 - alpha / 2, df)

def ttest_ind_confint(a, b, alpha: float = .05):
    x = a.mean() - b.mean() # center of CI
    n = (a.size + b.size) / 2 # mean sample size
    s = std_err(a, b) # standard error of the mean difference
    z = critical_value_t(alpha, df=n-1)
    half_length = s * z
    return (x - half_length, x + half_length)

ci = ttest_ind_confint(x, y, alpha=0.05)
ci

(-13.402367156297982, 1.2595100134408437)

### Example: Production Plants

Defective Items produced on two different production lines: 

* A: 12 defectives in a batch of 200
$$12/200 = 0.06 = 6\%$$
* B: 24 defectives in a batch of 300
$$24/300 = 0.08 = 8\%$$

Does B produce more defectives? Use $\alpha = 5\%$.

* $H_0: \pi_B \leq \pi_A$
* $H_1: \pi_B > \pi_A$

### Proportions Test with Two Samples

In [81]:
from statsmodels.stats.proportion import proportions_ztest

stat, pval = proportions_ztest(count=[24, 12], nobs=[300, 200],
                               alternative='larger')
pval

0.1983361310972807

### Excursion: Confidence Interval for Difference of Proportions

Confidence interval

$$ \hat{p_1} - \hat{p_2} \pm Z \cdot s $$

Where $Z$ is the critical value and the $s$ is the standard error:

$$ s = \sqrt{\frac{\hat{p_1} (1-\hat{p_1})}{n_1} + \frac{\hat{p_2} (1-\hat{p_2})}{n_2}}$$

In [82]:
def proportions_std_err(count, nobs):
    p1 = count[0] / nobs[0]
    p2 = count[1] / nobs[1]
    return math.sqrt(p1 * (1 - p1) / nobs[0] + p2 * (1 - p2) / nobs[1])

def critical_value_norm(alpha):
    return stats.norm.ppf(1 - alpha / 2)

def proportions_ztest_confint(count, nobs, alpha: float = .05):
    p1 = count[0] / nobs[0]
    p2 = count[1] / nobs[1]
    p = p1 - p2
    s = proportions_std_err(count, nobs)
    z = critical_value_norm(alpha)
    half_length = s * z
    return (p - half_length, p + half_length)

ci = proportions_ztest_confint(count=[24, 12], nobs=[300, 200], alpha=0.05)
ci

(-0.02500810243477687, 0.06500810243477688)

### Recap: Hypothesis Testing

1. Answer the following questions, then select appropriate test:

* One or two samples?
* Mean or proportion?
* If mean: independent or related populations?
* Test for increase, decrease or any change?

2. If one sample, find reference value.

3. Execute test to find p-value and decide:

* reject $H_0$ when $pval < \alpha$
* do not reject $H_0$ when $pval \geq \alpha$