*Constants*

In [1]:
import random
import math
import scipy.stats as stats

SIM_COUNT = 10000

# Statistical Tests
This notebook is a summary of the most common statistical tests used in data science. The goal is to provide a quick reference for the most common tests and when to use them. Each test has a brief explanation detailing its assumptions, null hypothesis, and when to use it. There will also be a code implementation of each test using Python.

## Table of Contents
1. [Z-Test](#z-test)
2. [Formulas](#formulas)

## Z-Test
A z-test is a statistical test used to determine whether there is a significant difference between a sample mean and population mean or a sample proportion and a population proportion. It assumes normality, meaning that the popoulation from which the sample is drawn should be normally distributed, or if working with proportions, the sample size should be large enough. The population variance should also be known or estimated well. If working with proportions, the sample size should be large enough to ensure that the sample proportion is normally distributed. The z-score measures how many standard deviations the observed sample value is from the hypothesized population value. The critical value for a $0.05$ significance level is $\pm 1.96$. If the z-score is greater than $1.96$ or less than $-1.96$, then the null hypothesis can be rejected.

$$z = \frac{\hat{p} - p_0}{\text{Standard Error}}$$

$$z = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}}$$

### Hypotheses
$H_0 : p = p_0$

The assumption that there is no significant differenct between the sample mean/proportion and the population mean/proportion. Typically, the null hypothese states that the sample mean/proportion is equal to the population mean/proportion. 

$H_A : p \neq p_0$

The alternative hypothesis states that the sample mean/proportion is different from the population mean/proportion. One-tailed tests can also be used to determine if the sample mean/proportion is greater or less than the population mean/proportion.
### Use
- For large sample sizes ($n > 30$).
- When the variance/standard deviation of the population is known (or can be estimated well).
- For proportion testing when comparing an observed proportion to a hypothesized proportion.

In [7]:
def ztest_proportion(obs, exp, n):
    z = (obs - exp) / math.sqrt(obs* (1 - obs) / n)
    p = 2 * (1 - stats.norm.cdf(abs(z)))
    return z, p

ztest_proportion(0.4, 1/3, 1000)

(4.303314829119355, 1.6826148341975156e-05)

## Formulas
- **Mean**: $$\mu = \frac{\sum_{i=1}^{n} x_i}{n}$$
- **Variance**: $$\sigma^2 = \frac{\sum_{i=1}^{n} (x_i - \mu)^2}{n}$$
- **Standard Deviation**: $$\sigma = \sqrt{\sigma^2}$$
- **Covariance**: $$Cov(X, Y) = \frac{\sum_{i=1}^{n} (x_i - \mu_x) * (y_i - \mu_y)}{n}$$
- **Correlation**: $$Corr(X, Y) = \frac{Cov(X, Y)}{\sigma_x * \sigma_y}$$
- **Least Squares Regression Line**: $$y = mx + b$$
- **Least Squares Regression Slope**: $$m = \frac{\sum_{i=1}^{n} (x_i - \mu_x) * (y_i - \mu_y)}{\sum_{i=1}^{n} (x_i - \mu_x)^2}$$
- **Least Squares Regression Intercept**: $$b = \mu_y - m * \mu_x$$
- **Least Squares Regression Line Prediction**: $$\hat{y} = m * x + b$$
- **Least Squares Regression Line Residual**: $$e = y - \hat{y}$$
- **Least Squares Regression Line Residual Sum of Squares**: $$RSS = \sum_{i=1}^{n} e_i^2$$
- **Least Squares Regression Line Total Sum of Squares**: $$TSS = \sum_{i=1}^{n} (y_i - \mu_y)^2$$
- **Least Squares Regression Line Explained Sum of Squares**: $$ESS = \sum_{i=1}^{n} (\hat{y_i} - \mu_y)^2$$
- **Least Squares Regression Line R-Squared**: $$R^2 = \frac{ESS}{TSS} = 1 - \frac{RSS}{TSS}$$
- **Least Squares Regression Line Standard Error**: $$SE = \sqrt{\frac{RSS}{n-2}}$$
- **Least Squares Regression Line Confidence Interval**: $$CI = m \pm t_{\alpha/2} * SE$$
- **Least Squares Regression Line Prediction Interval**: $$PI = \hat{y} \pm t_{\alpha/2} * SE$$
- **Least Squares Regression Line Hypothesis Test**: $$t = \frac{m - 0}{SE}$$
- **Least Squares Regression Line Hypothesis Test P-Value**: $$P = 2 * (1 - t_{n-2})$$
- **Least Squares Regression Line Hypothesis Test Confidence Interval**: $$CI = m \pm t_{n-2} * SE$$
- **Least Squares Regression Line Hypothesis Test Prediction Interval**: $$PI = \hat{y} \pm t_{n-2} * SE$$
- **Least Squares Regression Line Hypothesis Test F-Statistic**: $$F = \frac{ESS/k}{RSS/(n-k-1)}$$
- **Least Squares Regression Line Hypothesis Test F-Statistic P-Value**: $$P = 1 - F_{n-k-1, k}$$
- **Least Squares Regression Line Hypothesis Test F-Statistic Confidence Interval**: $$CI = \frac{1}{F_{k, n-k-1}}$$
- **Least Squares Regression Line Hypothesis Test F-Statistic Prediction Interval**: $$PI = \frac{1}{F_{k, n-k-1}}$$
- **Least Squares Regression Line Hypothesis Test Chi-Squared Statistic**: $$\chi^2 = \frac{n * R^2}{1 - R^2}$$
- **Least Squares Regression Line Hypothesis Test Chi-Squared Statistic P-Value**: $$P = 1 - \chi^2_{1, n-2}$$
- **Least Squares Regression Line Hypothesis Test Chi-Squared Statistic Confidence Interval**: $$CI = \frac{1}{\chi^2_{n-2, 1}}$$
- **Least Squares Regression Line Hypothesis Test Chi-Squared Statistic Prediction Interval**: $$PI = \frac{1}{\chi^2_{n-2, 1}}$$
- **Least Squares Regression Line Hypothesis Test Z-Statistic**: $$Z = \frac{m - 0}{SE}$$
- **Least Squares Regression Line Hypothesis Test Z-Statistic P-Value**: $$P = 2 * (1 - Z)$$
- **Least Squares Regression Line Hypothesis Test Z-Statistic Confidence Interval**: $$CI = m \pm Z$$
- **Least Squares Regression Line Hypothesis Test Z-Statistic Prediction Interval**: $$PI = \hat{y} \pm Z$$
- **Least Squares Regression Line Hypothesis Test T-Statistic**: $$T = \frac{m - 0}{SE}$$
- **Least Squares Regression Line Hypothesis Test T-Statistic P-Value**: $$P = 2 * (1 - T_{n-2})$$
- **Least Squares Regression Line Hypothesis Test T-Statistic Confidence Interval**: $$CI = m \pm T_{n-2} * SE$$
- **Least Squares Regression Line Hypothesis Test T-Statistic Prediction Interval**: $$PI = \hat{y} \pm T_{n-2} * SE$$
