# Essential Statistics

## Descriptive Statistics

Summarizing data is important to understanding it at scale, and descriptive statistics help us to do so.

----
<i>"Statistics is a broad set of algorithms for transforming numerical data into a small set of interpretable values that describe the world."</i>
- Modern Statistics, Mike X. Cohen
---

#### Mean, Median, and Mode

The mean of a series of numbers is the average, the sum divided by the count.

<h4>$\bar{x} ~~=~~ \frac{ \sum_{i=1}^n x_i }{n} ~~= ~~\frac{1}{n} \sum_{i=1}^n x_i ~~=~~ n^{-1} \sum_{i=1}^n x_i$</h4>

- $\bar{x}$ is the mean of the series of numbers $x$
- $\sum_{i=1}^n$ indicates a summation of all $x_i$ in $x$
- $n$ is the number of observations
- $i$ is the index of a particular data point of $x$
- $x_i$ is a data point of $x$ with index $i$
- $1/n$ is a multiplication equivalent to dividing by $n$
- $n^{-1}$ is equal to $1/n$

The median is the data point in the middle of the list when you sort the series of numbers $x$ by value. It can be preferable to the mean as a measure of central tendency, as it is unaffected by outliers.

The mode is the value observed most frequently in the series of numbers, and can be tied amongst two or more values. With an unskewed Gaussian/normal distribution, the median and mode happen to be equal to the mean.

#### Percentile and Quartile

You can think of the concept of percentiles as a generalization of the concept of the median, as the median represents the value at the $50^{th}$ percentile. Other special cases of the percentile are the values $25\%$ and $75\%$ of the way down the list of sorted values, and together with the median, these are called quartiles, because they divide the data into four sections with an equal number of observations. The range between the $25^{th}$ and $75^{th}$ quartile is called the interquartile range (IQR). 

The percentiles at multiples of $10\%$ are called deciles, and multiples of $1\%$ are called quantiles.

#### Variance

Variance is a measure of how much spread there is around the expected value, and is calculated as the average of squared differences from the expected value. Squared because we don't want negative differences to offset the positive ones, but rather blend together to give us an idea of average distance regardless of direction. A nuance is that when dealing with a sample rather than a population, we divide by $n-1$ instead of $n$.

<i>Population Variance:</i>

<h4>$\sigma^2 = \frac{\sum(x - \mu)^2}{n}$</h4>

<i>Sample Variance:</i>

<h4>$s^2 = \frac{\sum(x - \bar{x})^2}{n-1}$</h4>

- $\sigma^2$ is the population variance
- $s^2$ is the sample variance
- $x$ is the value of an instance in the dataset
- $\mu$ is the population average value
- $\bar{x}$ is the sample average value
- $n$ is the number of observations

#### Standard Deviation

The squared values of variance make for an interpretability issue - the units of variance are not in the same unit of measurement of the data. This is why we commonly speak in terms of standard deviation - the square root of variance - which brings us back to the same units as the data.

<i>Population Standard Deviation:</i>

<h4>$\sigma = \sqrt{\sigma^2} = \sqrt{\frac{\sum(x - \mu)^2}{n}}$</h4>

<i>Sample Standard Deviation:</i>

<h4>$s = \sqrt{s^2} = \sqrt{\frac{\sum(x - \bar{x})^2}{n-1}}$</h4>

- $\sigma$ is the population standard deviation
- $s$ is the sample standard deviation
- $x$ is the value of an instance in the dataset
- $\mu$ is the population average value
- $\bar{x}$ is the sample average value
- $n$ is the number of observations

#### Covariance and Correlation

You are likely familiar with the concept of correlation - the degree to which the behavior of one variable explains another. To understand correlation, it helps to understand covariance, the unscaled version of correlation. While correlation ranges $-1$ to $+1$, covariance ranges $-X$ to $+X$, and correlation is just scaled covariance.

Covariance is the sum of the product of differences from the mean for two variables, scaled by the number of observations $n$, or $n-1$ if sampling. Two coinciding positive differences make for a large contribution to covariance, as do two coinciding negative differences, through multiplication.

<i>Population Covariance:</i>

<h4>$cov(x,y) = \frac{\sum (x_i - \bar{x}) (y_i - \bar{y})}{n}$</h4>

<i>Sample Covariance:</i>

<h4>$cov(x,y) = \frac{\sum (x_i - \bar{x}) (y_i - \bar{y})}{n-1}$</h4>

- $x$ is a vector (series) of values
- $y$ is a vector of values
- $x_i$ is an individual data point of $x$
- $\bar{x}$ is the average value of $x$
- $y_i$ is an individual data point of $y$
- $\bar{y}$ is the average value of $y$
- $n$ is the number of observations

Back to correlation - correlation is covariance with the numerator scaled by the product of standard deviations for the two variables, rather than $n$ or $n-1$.

<h4>$r = \frac{ \sum (x_i - \bar{x}) (y_i - \bar{y}) }{ \sqrt{\sum(x_i - \bar{x})^2} \sqrt{\sum(y_i - \bar{y})^2} }$</h4>

- $x$ is a vector of values
- $y$ is a vector of values
- $x_i$ is an individual data point of $x$
- $\bar{x}$ is the average value of $x$
- $y_i$ is an individual data point of $y$
- $\bar{y}$ is the average value of $y$
- $n$ is the number of observations

#### Z-Scores and Standardization

Z-Scores indicate how many standard deviations away a data point is from the mean of the dataset.

<h4>$z = \frac{x_i - \mu}{\sigma}$</h4>

- $z$ is the distance in standard deviations of $x_i$ from the mean of $x$, $\mu$
- $x_i$ is an individual data point of $x$
- $\mu$ is the average value of $x$
- $\sigma$ is the standard deviation of $x$

Z-scores can be helpful for standardization, a form of scaling in which data points are expressed in terms of the number of standard deviations away from the mean.

Another way to scale data is with min/max scaling, which follows the following formula:

<h4>$scaled ~x_i = \frac{x_i - min(x)}{max(x) - min(x)}$</h4>

- $scaled ~x_i$ is the proportion of range from the minimum to maximum of the vector $x$
- $x_i$ is an individual instance of $x$
- $min(x)$ is the minimum value of $x$
- $max(x)$ is the maximum value of $x$

## Statistical Hypothesis Testing

#### Inferential Statistics

Inferential statistics refers to algorithms that are applied to one or more sample datasets in order to test whether the descriptive statistics are likely to generalize to another dataset.

Probability distributions provide the mathematical framework for describing the likelihood of events in a random process or experiment. Methods like maximum likelihood estimation use analytical functions to find parameters that best fit observed data to a theoretical distribution.

----
<i>It would not be controversial to claim that inferential statistics is basically just applied probability.</i>
- Modern Statistics, Mike X. Cohen
----

#### Sampling and Sample Size

Sample size is the number of observations in a dataset. It is often denoted by N, but also by n. We can generalize about the population from a sample only if the sample is random, representative, and sufficiently large. An appropriate sample size depends on a number of factors, including effect size, variability in the sample and population, how closely the sample characteristics match the population, and how the samples were collected.

#### The Gaussian (Normal) and t-Distribution

Going forward, we will assume either a Gaussian/normal distribution, reflecting an assumption of known population parameters, or the t-distribution, reflecting smaller sample sizes. The Gaussian distribution represents many natural phenomena, and many other distributions (including the t-distribution) converge to it when mixed with others, or when sample size is high.

The parameters of the Gaussian distribution are mean $\mu$ and variance $\sigma^2$, and the parameter of the t-distribution is the degrees of freedom, symbolized by $\nu$ (pronounced nu), which for one-sample and paired tests, is set to $n-1$ (where n is the number of observations). We'll talk more about the t-distribution and degrees of freedom when we get to t-tests.

#### p-Values

If testing the effect of an experiment, we expect to see that a sample mean from the experiment is unlikely to come from the assumed probability distribution, meaning that the difference in treatment/environment/etc. has had a significant effect.

A p-value (probability value) is the probability than an effect you observe in data was actually due to chance and not a true effect. A significance level $\alpha$, reflecting the maximum threshold for the p-value, is commonly selected to be 0.05 or 0.01. This reflects the chance of making a 'type 1 error', a.k.a. a false positive, where a test incorrectly indicates that a condition is present when it is not. It is when the p-value is lower than the significance level that we consider a result to be significant, i.e., not likely due to random chance.

#### Null and Alternative Hypotheses

The null hypothesis $H_0$ is that there is no difference between the sampled data and the population. The drug had no effect, the sale provided no uplift, etc. When the p-value is less than the significance level $\alpha$, the null hypothesis $H_0$ is rejected in favor of the alternative hypothesis $H_A$ (or $H_1$). Rejecting the null hypothesis does not prove that the alternative hypothesis is true, but suggests there is sufficient enough evidence that the null hypothesis is unlikely to be true.

#### Critical Values

A critical value is defined in the context of the population distribution and a probability, and is used as a threshold for interpreting the result of a statistical test. The values in the population beyond the critical value are called the critical region or region of rejection.

A one-tailed test has a single critical value, on the left or the right of the distribution, and if the calculated statistic is less or equally extreme than the critical value, the null hypothesis of the test fails to be rejected. A two-tailed test has two critical values, one on each side of the distribution, which is often assumed to be symmetrical. When using a two-tailed test, the significance level $\alpha$ used in the calculation of critical values must be divided by two.

<img src="img/critical_region.png" style="height: 400px; width:auto;">

In [1]:
# content re: reference tables

### The Z-Test

A z-test quantifies the probability of a number as or more extreme than another given number, when drawn from a normal distribution, given the data observed. Notable p-z combinations include:
- 68.3% of data is between -1 and +1 standard deviations from the mean
- 95.5% of data is between -2 and +2 standard deviations from the mean
- 99.7% of data is between -3 and +3 standard deviations from the mean

<img src="img/norm_dist.png" style="height: 350px; width:auto;">

A z-test for a population mean investigates the significance of the difference between an assumed population mean $\mu_0$ and a sample mean $\bar{x}$. For  test, we assume to known the population variance $\sigma^2$. If we don't assume to know the variance, then the t-test for a population mean should be used instead.

#### The Z-Statistic

The z-statistic, like the t-statistic, is equal to the mean of a sample divided by the standard error. The standard error of the mean is a descriptive statistic that you can compute from a sample, and estimates the precision with which that sample mean estimates the population mean. It is defined as the population standard deviation scaled by the square root of the sample size.

$SE = \sigma / \sqrt{n}$

- $SE$ is the standard error of the mean
- $\sigma$ is the standard deviation
- $n$ is the number of observations

For a sample, we would use $s$, the sample standard deviation, instead of \sigma. 

Back to the z-statistic: from a population with assumed mean $\mu_0$ and known variance $\sigma^2$, a random sample of size $n$ is taken and the sample mean $\bar{x}$ calculated. The test statistic,

<h4>$z = \frac{\bar{x} - \mu_o}{\sigma / \sqrt{n}}$</h4>

- $z$ is the calculated test statistic
- $\bar{x}$ is the average value of variable $x$
- $\mu_0$ is the null hypothesis mean
- $\sigma$ is the standard deviation
- $n$ is the number of observations

may be compared with the standard normal distribution using either a one-tailed or two-tailed test (with critical region of size $\alpha$).

#### Example: Z-Test for Population Mean

A cosmetics filling process fills tubs of powder with $4$ grams on average with a standard deviation of $1$ gram. A sample of $9$ tubs are weighed, and the average weight is $4.6$ grams. What can be said about the filling process, at a significance level of $0.05$?

- $\bar{x} = 4.6$
- $\mu_0 = 4.0$
- $\sigma = 1.0$
- $n = 9$

<h4>$z = \frac{\bar{x} - \mu_o}{\sigma / \sqrt{n}} = \frac{4.6-4.0}{1 / \sqrt{9}} = 1.8$</h4>

The critical value $z_{0.05} = 1.96$. The z-stat is not as extreme as the critical z; our range of acceptance of the null hypothesis is $-1.96$ to $1.96$, and so we fail to reject the null hypothesis.

However, if we are only concerned about over-filling and not under-filling, it becomes a one-tailed test instead of a two-tailed test, in which the acceptance region is now $z \lt 1.645$. Therefore, we can reject the null hypothesis and suspect that the tubs are being over-filled.

### The T-Test

Like the z-test for larger samples, the objective of a t-test can be to investigate the significance of the difference between an assumed mean $\mu_0$ and unknown variance. A random sample of size $n$ is taken and the sample mean $\bar{x}$ calculated as well as the sample standard deviation. The test statistic, 

The test statistic, 

<h4>$t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}$</h4>

- $t$ is the calculated test statistic
- $\bar{x}$ is the average value of variable $x$
- $\mu_0$ is the null hypothesis mean
- $s$ is the standard deviation
- $n$ is the number of observations

may be compared with the t-distribution (a.k.a. Student's t-distribution, as the originator used the pseudonym 'Student'), with $n-1$ degrees of freedom, and may be one-tailed or two-tailed. The t-distribution approximates a normal distribution as the number of samples increases, but for smaller samples, is shorter and more spread out.

<img src="img/t_dist.png" style="height: 350px; width:auto;">

The purpose of a t-test is to determine whether the mean of a sample is different from a specified null hypothesis $H_0$ value. There are three scenarios in which you would use a t-test:

1. One-Sample T-Test: you have one data sample, and the objective is to determine if the sample mean significantly deviates from a predetermined $H_0$ value.

2. Paired Samples T-Test: you have one group of individuals that were measured twice; for example, before and after a treatment.

3. Independent Samples T-Test: you have two separate groups of individuals and want to determine whether the means of the two groups differ.

#### Degrees of Freedom

If I tell you that the average of three numbers is 4, and that two of those numbers are 2 and 3, then you know that the third number is 7. In other words, the statistic has $n-1=2$ numbers that can vary before the last number is fixed. We say that there are 2 degrees of freedom.

Using $n-1$ as the degree of freedom for one parameter (variable) generalizes to using $n-k$ in the multivariate case, where $n$ is the number of observations and $k$ is the number of parameters.

The degrees of freedom associated with a t-test is $n-1$ or $n-2$ depending on whether there is one group or two. The degrees of freedom associated with a correlation analysis is 2, and the degrees of freedom associated with a regression is $n-k$, where $k$ is the number of independent variables.

#### T-Test for a Population Mean

#### Example: T-Test for a Population Mean

A sample of $9$ plastic nuts yielded an average diameter of $3.1cm$ and estimated standard deviation of $1.0cm$. The population mean is assumed to be $4.0cm$. What does this say about the mean diameter of plastic nuts being produced?

- $\bar{x} = 4.27$
- $\mu_0 = 4.0$
- $s = 0.27$
- $n = 9$

<h4>$t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}} = \frac{4.27 - 4.0}{0.27 / \sqrt{9}} = 2.92$</h4>

Our computed t value is $2.92$ and the critical value is $t_{8; 0.025} = \pm 2.3$. The acceptance region is $-2.3$ to $2.3$, so we reject the null hypothesis and accept the alternative hypothesis that there is a difference between the sample and population means.

In [2]:
# using adjusted values

- $\bar{x} = 4.27$
- $\mu_0 = 4.0$
- $s = 0.27$
- $n = 9$

<h4>$t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}} = \frac{3.1 - 4.0}{1.0 / \sqrt{9}} = -2.7$</h4>

#### Example 2: T-Test for a Population Mean

A teacher wants to know if the average exam score of her students is significantly different from the national average of $75$ points. There are $15$ students, and the grades are as follows:

$[80, 85, 90, 70, 75, 72, 88, 77, 82, 65, 79, 81, 74, 86, 68]$

$H_0: \bar{X} = 75$

$H_A: \bar{X} \neq 75$

Mean $\bar{X} = 78.1$

$s = 7.5$

The test statistic is:

$t_{14} = \frac{78.1-75}{7.5/\sqrt{15}} = 1.624$

$t_{14} = 1.624, p \lt 0.127$

Because the p-value is larger than $0.05$, we cannot reject the $H_0$.

#### T-Test for Two Population Means - Method of Paired Comparisons

This test is to investigate the significance of the difference between two population means, $\mu_1$ and $\mu_2$, making no assumption about the population variances. The observations for the two samples must be obtained in pairs. Apart from population differences, the observations in each pair should be carried out under identical or near-identical conditions.

The differences $d_i$ are formed for each pair of observations. If there are $n$ such pairs of observations, we can calculate the variance of the differences by:

<h4>$s^2 = \sum_{i=1}^n \frac{ (d_i - \bar{d})^2 }{n-1}$</h4>

- $s^2$ is the variance of the differences
- $\sum_{i=1}^n$ is a summation over the differences for each individual
- $d_i$ is the difference (such as before vs. after) for the individual indexed by $i$
- $\bar{d}$ is the average differencel for all individuals

The test statistic becomes:

<h4>$t = \frac{ \bar{x}_1 - \bar{x}_2 }{s / \sqrt{n}} = \frac{\bar{d}-0}{s/\sqrt{n}}$</h4> 

- $t$ is the calculated test statistic
- $\bar{x}_1$ is the mean of the first group of observations
- $\bar{x}_2$ is the mean of the second group of observations
- $s$ is the standard deviation, the square root of the variance calculation above
- $n$ is the number of observations
- $\bar{d}$ is the mean difference between the two groups of observations

which follows the Student's t-distribution with $n-1$ degrees of freedom. The test may be one-tailed or two-tailed.

#### Example: Method of Paired Comparison

To compare the efficacy of two treatments for a respiratory condition, $10$ patients are randomly selected and the treatments are administered using a spray. Time is taken to ensure the treatments don't interact. We do not expect one particular treatment to be superior to the other so it is a two-tailed test.

data...





- $\bar{d} = \bar{x}_1 - \bar{x}_2 = -0.1$
- $s = 2.9$
- $n = 10$
- $\nu = 9$

<h4>$t = \frac{ \bar{x}_1 - \bar{x}_2 }{s / \sqrt{n}} = \frac{\bar{d}-0}{s/\sqrt{n}} = \frac{-0.1}{2.9 / \sqrt{10}} = -0.11$</h4>

The critical t-statistic is $t_{9; 0.025} = 2.26$. We do not reject the null hypothesis of no difference between means.

However, if the objective was to see whether one treatment is superior to another, a one-tailed t-test may be used...

#### Example 2: Method of Paired Comparisons

The effect of a treatment for milk production in dairy cows is tested. The milk yields were measured before and after.

Before = [27, 45, 38, 20, 22, 50, 40, 33, 18]

After = [31, 54, 43, 28, 21, 49, 41, 34, 20]

Diffs = [4, 9, 5, 8, -1, -1, 1, 1, 2]

$s = \sqrt{ \frac{\sum_i (y_i - \mu)}{n-1} } = 3.655$

$t = \frac{\bar{d} - 0}{s / \sqrt{n}} = \frac{3.11 - 0}{3.655 / \sqrt{9}} = 2.553$

**reframe calc to match equations in both problems**

The critical value for $n-1=8$ d.f. is $t_{0.05} = 2.306$

The calculated value $t=2.553$ is more extreme than the critical value $2.306$, the null hypothesis is rejected with $\alpha=0.05$ level of significance (the treatment increases milk yield.

In [3]:
# using adjusted values

- $\bar{x}_1 = 32.56$
- $\bar{x}_2 = 35.67$
- $\bar{d} = 3.11$
- $n = 9$

$s = \sqrt{ \frac{\sum_i (d_i - \bar{d})^2}{n-1} } = 3.655$

$t = \frac{\bar{d} - 0}{s / \sqrt{n}} = \frac{3.11 - 0}{3.655 / \sqrt{9}} = 2.553$

#### Independent Samples T-Test

The independent samples t-test evaluates whether the means of two groups significantly differ. The sample sizes may differ between the two groups. The formula that separates the sample sizes and variances is called Welch's t-test.

<h3>$t = \frac{ \bar{x} - \bar{y} }{ \sqrt{ \frac{s_x^2}{n_x} + \frac{s_y^2}{n_y} } }$</h3>

- $t$ is the calculated t-statistic
- $\bar{x}$ is the average of variable $x$
- $\bar{y}$ is the average of variable $y$
- $s_x$ is the sample standard deviation for $x$
- $s_y$ is the sample standard deviation for $y$
- $n_x$ is the number of observations in variable $x$
- $n_y$ is the number of observations in variable $y$

If the variances are roughly equal, the degrees of freedom are $n_x + n_y - 2$. If unequal, the formula is a little complicated:

(**move this to appendix**)

<h3>$\text{d.f.} = \frac{ (s_1^2~/~n_1 +s_2^2~/~n_2)^2 }{ \frac{s_1^2}{n_1^2(n-1)} + \frac{s^2}{n_1^2(n_1-1)} }$</h3>

#### Example: Independent Samples T-Test

$H_0: \bar{X} - \bar{Y} = 0$

$H_A: \bar{X} - \bar{Y} \neq 0$

Imagine that comprehension test scores range $0$ to $100$. $X_N$ and $X_Q$ indicate the comprehension scores in the noisy and quiet conditions.

$X_N = [60, 52, 90, 20, 33, 95, 18, 47, 78, 65]$

$X_Q = [65, 60, 84, 23, 37, 95, 17, 53, 88, 66]$

$\text{Diffs } (\Delta) = [5, 8, -6, 3, 4, 0, -1, 6, 10, 11]$


- $\bar{x}_1 = 55.80$
- $\bar{x}_2 = 58.80$
- $\bar{d} = 3.00$

$s = \sqrt{ \frac{\sum_i (d_i - \bar{d})^2}{n-1} } = 3.655$

$t = \frac{ \bar{x}_1 - \bar{x}_2 }{s / \sqrt{n}} = \frac{\bar{d}-0}{s/\sqrt{n}} = \frac{3}{4.69 / \sqrt{10}} = 2.023$

The result is $t_9 = 2.023$, $p \lt 0.074$. The test is not statistically significant.

#### Example 2: Independent Samples T-Test

Two financial organizations are about to merge, and are considering the level of service duplication. Two sales teams responsible for essentially identical products are compared by selecting samples from each and reviewing their respective profit contribution levels per employee over a period of two weeks. These are found to be $3,166$ and $2,240.4$, with estimated variance of $6,328.27$ and $221,661.3$ respectively. How do the two teams compare?

- $n_1 = 4$
- $n_2 = 9$
- $\bar{x}_1 = 3,166$
- $\bar{x}_2 = 2,240.4$
- $s_1^2 = 6,328.67$
- $s_2^2 = 221,663.3$

$t = 5.72$ ($\nu = 9$ when rounded)

Critical value $t_{9;0.025} = 2.26$.

We reject the $H_0$

### Chi-Square Test

The chi-square test is a one-sided test used for categorical features. Assume that for some categorical characteristic, the number of individuals in each of k categories has been counted. We want to determine if the numbers in the categories are significantly different from hypothetical numbers. 

Unless otherwise specified, the null hypothesis assumes all proportions are equal, and the alternative hypothesis is that there is at least one difference. The chi-square test statistic compares the observed frequency distribution of each observation $O$ with the expected frequency distribution $E$, derived from the marginal probabilities on the assumption that characteristics are independent.

<h4>$\chi^2 = \frac{\sum_i [y_i - E(y_i)]^2}{E(y_i)}$</h4>

- $\sum_i$ is a summation over all observations $y_i$
- $y_i$ is an individual observation
- $E(y_i)$ is the expected value of observation $y_i$

The chi-square statistic has $k-1$ degrees of freedom, where $k$ is the number of categories. The number of observations in each category should be at least $5$.

#### Example: Chi-Square Test for Goodness of Fit

The expected proportions of white, brown, and pied rabbits in a population are $0.36$, $0.48$, and $0.16$ respectively. In a sample of $400$, there were $140$ white, $240$ brown, and $20$ pied. Are the proportions in the sample different from expected?

<h4>$\chi^2 = \frac{(140-144)^2}{144} + \frac{(240-192)^2}{192} + \frac{(20-64)^2}{64} = 42.361$</h4>

The critical value $\chi^2_{2; 0.05}$ is $5.991$. Since the calculated $\chi^2$ is greater than the critical value, it can be concluded that the sample is different from the population with a $0.05$ level of significance.

#### Example: Chi-Square Test for Goodness of Fit

A die is thrown 120 times. The observed number of occurrences of $i$ is denoted $O_i, i=1, \ldots, 6$. The data is as follows:
    
- $O_1=25, O_2=17, O_3=15, O_4=23, O_5=24, O_6=16$

- $E_1=20, E_2=20, E_3=20, E_4=20, E_5=20, E_6=20$

$\chi^2 = \frac{\sum_i [y_i - E(y_i)]^2}{E(y_i)} = \frac{25}{20} + \frac{17}{20} + \frac{15}{20} + \frac{23}{20} + \frac{24}{20} + \frac{16}{20} = 5.0$

The critical value is $\chi^2_{5;0.05} = 11.1$

The calculated value is less than the critical value, hence, there is no indication that the die is unfair.

In [4]:
# above is incorrect

<h4>$\chi^2 = \frac{\sum_i [y_i - E(y_i)]^2}{E(y_i)}$</h4>

<h4>$\chi^2 = \frac{(25-20)^2 + (17-20)^2 + (15-20)^2 + (23-20)^2 + (24-20)^2 + (16-20)^2}{20}$</h4>

<h4>$\chi^2 = 5.00$</h4>

#### Confidence Intervals

A point estimate is the simplest approach to estimating a population parameter, and a confidence interval is a point estimate $\pm$ the margin of error. The wider the interval, the greater the confidence level. To be $95\%$ confident means that there is a $95\%$ chance that the mean is contained in the confidence interval, and that the method used to construct the interval will provide intervals that contain the population mean $95\%$ of the time.

Involved in the calculation of a confidence interval is a critical statistic, such as a z or t-statistic, representative of the confidence level. A z-statistic is assumed to be drawn from a normal distribution, and a t-statistic is assumed to be drawn from a t-distribution.

<i>Confidence Interval with z-Statistic:</i>

<h4>$CI = \bar{x} \pm z_{\alpha} \frac{s}{\sqrt{n}}$</h4>

<i>Confidence Interval with t-Statistic:</i>

<h4>$CI = \bar{x} \pm t_{n-1} \frac{s}{\sqrt{n}}$</h4>

- $\bar{x}$ is the sample average
- $z_{\alpha}$ is a critical z-statistic for the given $\alpha$ (significance) level
- $\alpha$ is the significance level ($0.05$ or $0.01$ is common)
- $s$ is the sample standard deviation
- $t_{n-1}$ is a critical t-statistic with degrees of freedom equal to $n-1$
- $n$ is the number of observations

This is not the same as a p-value. The p-value can be small while the confidence interval is large, and vice versa. It is also not the same thing as standard deviation. Standard deviation is a measure of variability within one sample, while a confidence interval provides a range that we expect our population parameter to fall into in future samples, by providing an uncertainty estimate of the true population mean. Increasing the sample size does not necessarily decrease standard deviation, but it does narrow a confidence interval.

Assumptions of confidence intervals include:
- Independence: random sampling; one observation does not affect another
- Normality: that the data are normally distributed in the population
- Known Standard Deviation: the sample standard deviation is a good approximation of the population standard deviation

## ANOVA

ANOVA quantifies the total amount of variability in a dataset, and determines how much is attributable to the independent (X) variables and how much is due to noise or unmeasured factors, before computing the ratio of explained to unexplained variability. It is to determine the effects of categorical independent variables on a numerical dependent (y) variable. If the independent variable is naturally numerical rather than categorical, you can discretize it into a relatively small number of bins, or consider using regression instead.

ANOVA creates a table of factors and levels. Factors are the independent variables in the study, and levels are the distinct categories or groups within each factor. Assumptions of ANOVA include:
- Independence: observations are sampled randomly; one observation does not affect another
- Normality: the population data is approximately normally distributed
- Homogeneity of Variance: comparable levels of variance across all levels of the independent variables
- Absence of Multicollinearity: absence of redundance (correlation) in the independent variables
- No Outliers

#### One-Way ANOVA

An ANOVA with one factor and two levels is essentially equivalent to a t-test. Both tests are used to compare means between groups. In a t-test, you directly compare the means of two groups, while in ANOVA, you are assessing whether there is a significant difference in means among groups.

$H_0: \mu_1 = \mu_2 = \ldots$

$H_A: \mu_i \neq \mu_j$

ANOVA relies on the sum of squares, which is very similar to variance. The only difference is that variance divides by $n-1$ (equivalent to factoring by $\frac{1}{n-1}$.

$$SS = \sum_{i=1}^n (x_i - \bar{x})^2$$

$$\sigma^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2$$

#### Partitioning Sum of Squares

The sum of squares, i.e. variance, is partitioned in three different ways:
- $SS_{Total}$ compares every individual within every group to the global mean of the data \bar{x}, computing the total variance.
- $SS_{Between}$ looks at the mean within each level minus the global mean.
- $SS_{Within}$ is also called the sum of squared errors, comparing the mean of each individual to the mean within the specific group.
    
<img src="img/SS_3.png" style="height: 300px; width:auto;">

-  $x_{ij}$ is the observation indexed by individual $i$ in level $j$
- $\bar{x}$ is the global mean of the variable $x$
- $\bar{x_j}$ is the mean for the level indexed by $j$
- $n_j$ is the number of individuals within the level indexed by $j$

We then compute the mean of squares between, and the mean of squares within:
    
<h4>$\text{MS}_{Between} = \frac{\text{SS}_{Between}}{\text{df}_{Between}}$</h4>

<h4>$\text{MS}_{Within} = \frac{\text{SS}_{Within}}{\text{df}_{Within}}$</h4>

#### The F-Statistic

The ratio of variabilities (normalized to control for sample size) is called an F-statistic, and large F-values provide evidence against the null hypothesis $H_0$. The critical F-value for significance depends on the degrees of freedom.

<img src="img/anova_one_way.png" style="height: 175px; width:auto;">

The F test statistic and F distribution are calculated as the ratio of two variances. It has two parameters, which are a degrees of freedom measure for the numerator and a degrees of freedom measure for the denominator. As with the normal distribution and t-distribution, critical values from the F-distribution are used to determine whether the observed F-statistic is significant.

<h4>$F = \frac{MS_{Between}}{MS_{Within}}$</h4>

The p-value and F-ratio don't actually tell you which groups are different, they only tell you there is a difference somewhere. So it's necessary to do subsequent data visualization and t-tests to determine exactly which groups and levels. A p-value of less than the significance level indicates that at least one level is statistically significantly different from the mean of at least one other level.

We can also evaluate the ANOVA by calculating an $R^2$ value, as:

<h4>$R^2 = \frac{SS_{Between}}{SS_{Total}}$</h4>

And we can also calculate an adjusted R^2 (to control for number of parameters), as:

<h4>$\text{Adjusted } R^2 = 1 - \frac{ (1-R^2)(n-1) }{ N - k - 1 }$</h4>

- $n$ is the number of observations
- $k$ is the number of parameters (independent variables)

#### Two-Way ANOVA

With two-way ANOVA, we are looking at the potential for interaction. If a medication has a different effect depending on whether you are young or old, that is an interaction.

The total variation is expressed as the sum of variation across individuals within each group, plus the variation across different levels within each factor plus the variation of the interaction between the factors.

<img src="img/SS_5.png" style="height: 450px; width:auto;">

- $x_{ijk}$ is an observation of the $i^{th}$ observation of levels B, $j^{th}$ observation of level A, and $k^{th}$ observation of level B
- $\bar{x}_j$ is the average among the $j^{th}$ level of factor A
- $\bar{x}$ is the global average of variable $x$
- $\bar{x}_k$ is the average among the $k^{th}$ level of factor B

To further describe the terms on the left:

- $SS_{Total}$ is the same concept of the sum of squares, and is really just the total variance of the dataset. 

- $SS_{Between}$ has two factors now because we have two factors in our design. A two-way ANOVA with three factors would have three SS_{Between}terms, and so on. For $SS_{Between}$, we're ignoring one of the factors, and computing the marginal variance. The multiplication by $bn$ in order to compute $SS_{A \times B}$ is not big $N$, the number of observations in the sample, but rather by $n$, the number of observations in each level of factor $B$, multiplied by $b$, the number of levels in factor $B$. The reverse is true for factor $A$. So $bn$ and $an$ represent the total number of observations for the other factor.

- $SS_{A \times B}$ is the interaction between A and B. We take the individual cell mean and subtract the marginal mean from factor B and the marginal mean from factor A within each level, and then add this to the total mean across the entire dataset.

- $SS_{Within}$ is sometimes called the sum of squared errors. We're subtracting individuals from their mean, and then we compute the variance, and the key difference is that we have the mean of each cell instead of the entire dataset.

The two-way ANOVA table is as follows:

<img src="img/anova_two_way.png" style="height: 250px; width:auto;">

If $p \lt 0.05$, at least one level [for the group?] is significantly different from at least one other level. Determining which group requires data visualization and follow-up t-tests.

## Linear Regression

Linear regression is like correlation, but extendable to multiple independent variables, whereas correlation is only defined for two. Linear regression takes a set of inputs $X$ and a set of outputs $y$, and models a linear relationship by which the input maximally explains the output. For a single variable, the relationship is written as:

<p>$y = \beta_0 + \beta_1 x + \epsilon$</p>
    <ul>
        <li>$y$ is a vector of output from the regression function</li>
        <li>$\beta_0$ is an estimated intercept value</li>
        <li>$\beta_1$ is the estimated coefficient for the $x$ variable</li>
        <li>$x$ is a vector of values</li>
        <li>$\epsilon$ is random error that cannot be explained by the model</li>
    </ul>

#### Multiple Regression

When multiple independent variables are being considered, we call it multiple regression. Each independent variable receives a coefficient (with a confidence interval) that describes the variable's contribution to the model, and predictions are created as a weighted sum of feature inputs.

<p>$y = \beta_0 + \beta_1 x1 + \beta_2 x_2 + ... + \beta_p x_p + \epsilon$</p>
    <ul>
        <li>$y$ is a vector of output from the regression function</li>
        <li>$\beta_0$ is an estimated intercept value</li>
        <li>$\beta_p$ is the estimated coefficient of the independent variable $x_p$</li>
        <li>$x$ is a matrix of observations containing multiple features</li>
        <li>$x_p$ is the independent variable indexed by $p$</li>
        <li>$\epsilon$ is random error that cannot be explained by the model</li>
    </ul>

One advantage of regression models is that they are highly interpretable. The features and weights can be interpreted as such:
- Numerical Feature: increasing the numerical feature by one unit changes the estimated outcome by the value of its weight.
- Binary Feature: changing the feature from the reference category to the other category changes the estimated outcome by the feature's weight.
- Categorical Feature: the interpretation of each category is the same as the interpretation for binary features.
- Intercept $\beta_0$: the intercept is the predicted outcome of an instance where all features are at their mean value.

Another advantage of regression is that it provides standard errors of the coefficients. Assumptions apply, such as a lack of correlated independent ($X$) features, however these assumptions are often violated with minimal consequence.

#### Evaluating Regression

A regression model can be evaluated as a single statistical object, such as by an F-test or Adjusted R^2. You can also evluate individual regressors (independent variables) using t-values.

##### Sum of Squared Terms

<img src="img/SS_3_reg.png" style="height: 300px; width:auto;">

- $SS_{Total}$ is the total variation in the dataset around its mean (when divided by d.f., it's equal to variance)
- $SS_{Model}$ is the total variation of the predicted data around the mean
- $SS_{\epsilon}$ is the total variation of the predicted data relative to the observed data

The three sum of squared terms are related to each other as in ANOVA, in that SS_{Total} is partitioned into the sum of two sources of variability, explained and unexplained.

$SS_{Total} = SS_{Model} + SS_{\epsilon}$

Two methods for evaluating model fit, F-ratio (i.e., F-statistic) and $R^2$, are based on comparing these SS terms

#### F-Statistic for Regression

In the case of regression, the F-ratio is defined as:
    
<h4>$F_{(k-1,N-k)} = \frac{ SS_{Model}~/~(k-1) }{ SS_{\epsilon}~/~(N-k) }$</h4>

- $k$ is the number of parameters in the regression model, including the intercept
- $k-1$ is the degrees of freedom for the numerator of the F-statistic
- $N-k$ is the degrees of freedom for the denominator of the F-statistic

The better the model fits the data, the smaller the $SS_{\epsilon}$ term, which increases the F-ratio. As $k$ gets larger, the numerator shrinks while the denominator increases, decreasing the F-value.

#### Evaluating Individual Regressors

If the model F-ratio is not statistically significant, it is inappropriate to evaluate individual regressors (you should not interpret a $p \lt 0.05$ regressor in a model that is a non-significant fit to the data). Individual regressors are evaluated using a t-value, where the null hypothesis is that the coefficient is not different from 0. i.e., the null hypothesis is the $\beta=0$. A t-statistic is a ratio of the mean effect to its standard error.

<h3>$t_{N-k} = \frac{\beta_j}{SE(\beta_j)} = \frac{\beta_j}{s_{\beta_j} ~/~ \sqrt{n}}$</h3>

- $t_{N-k}$ is a t-statistic with $N-k$ degrees of freedom
- $\beta_j$ is the coefficient of the regressor indexed by $j$
- $s_{\beta_j}$ is the sample standard deviation for the coefficient of the regressor indexed by $j$

#### Polynomial Regression

Polynomial regression is used to model curves. In a polynomial regression, the columns in the design matrix are the x-axis value raised to increasing powers. i.e., the first column is $x^0$ (all ones), the second is $x^1$, the third is $x^2$, and so on. It is still considered a lienar regression model, because the model comprises scalar multiplication and addition and the $\beta$ coefficients are estimated using linear methods.

$y = \beta_0 x^0 + \beta_1 x^1 + \ldots + B_k x^k$

- $y$ is the dependent variable
- $\beta_k$ is the coefficient for the regressor indexed by $k$
- $x$ is an independent variable raised to an increasing power for each term in the equation

Models with higher orders will tend to fit the data better, but have a higher risk of overfitting. One solution is to compare them using the Bayes Information Criterion.

$BIC_k = n ~ln (SS_{\epsilon}) + k ~ln (n)$
- $BIC_k$ is the Bayes Information Criterion for a model with $k$ regressors (independent variables)
- $n$ is the number of observations
- $ln$ is the natural log, meaning a logarithm of base $e$, where $e$ is Euler's constant
- $SS_{\epsilon}$ is the sum of squared errors

The lower the BIC, the more theoretically optimal the model.

#### Logistic Regression

Logistic regression is an extension of linear regression toward classification, and models the probabilities for classification problems. The model inherently solves binary classification problems, however extensions to integrate multiple classes exist.

Linear regression does not work for classification, because it does not output probabilities, but rather a line or hyperplane that minimizes the distance between itself and the points. A linear model extrapolates and gives values lower than 0 and greater than $1$, which can not be treated as probabilities. 

Logistic regression uses the logistic function to squeeze the output of a linear equation to between $0$ and $1$.

<h5>$logistic(x) = \frac{1}{1+e^{-x}}$</h5>

- $logistic(x)$ is the input $x$ transformed into the output of the function
- $x$ is an instance of input
- $e$ is Euler's constant, an irrational number starting with $2.7818...$ 

(**link to appendix**)

The $x$ in the logistic function is recognizable as the linear regression model:

$P(y^{(i)} = 1) = \frac{1}{ 1 + exp(-(\beta_0 + \beta_1 x_1^{(i)} + \ldots + \beta_p x_p^{(i)}) ) }$

- $P(y^{(i)} = 1)$ is the probability that an instance of $y$ equals the target class
- $exp()$ is equivalent to saying Euler's constant $e$ exponentiated to the expression in brackets

(**link to appendix**)

The interpretation of weights in logistic regression differs from the interpretability of weights in linear regression, because the weighted sum is transformed into a probability. We can reformulate the equation for the interpretation so that only the linear term is on the right side of the formula.

$ln \left( \frac{P(y=1)}{P(y=0)} \right) = \beta_0 + \beta_1 x_1^{(i)} + \ldots + \beta_p x_p^{(i)}$

- $P(y^{(i)} = 1)$ is the probability that an instance of $y$ equals the target class
- $P(y^{(i)} = 0)$ is the probability that an instance of $y$ equals the reference class
- $ln$ is the base-e logarithm operation, where $e$ is Euler's constant

We call the term in the brackets odds (the probability of one outcome divided by the probability of another), and wrapped in the logarithm, log-odds. A change in $x$ by one unit increases the log-odds ratio by the value of the corresponding weight, $\beta_j$. Another, perhaps more intuitive way to interpret the weight $\beta_j$ is that a change in $x$ by one unit increases the regular odds by $exp(\beta_j)$. 

The components of the model can be interpreted as such:

- Numerical Feature: if you increase the value of feature $x_j$ by one unit, the estimated odds change by a factor of $exp(\beta_j)$.

- Binary Categorical Feature: changing the feature $x_j$ from the reference category to the other category changes the estimated odds by a factor of $exp(\beta_j)$.

- Categorical Features: can be one-hot encoded (**link/appendix**) so that the binary categorical feature interpretation applies to each class.

- Intercept $\beta_0$: when all numerical features are zero and the categorical features are at the 'reference category' the estimated odds are $exp(\beta_0)$.

## Classification Metrics

#### Error Types

In a trial or experiment, an effect is either present or absent in each response, and there are four possible outcomes:

1. A true positive a.k.a. a hit
2. A false positive, a.k.a. a type I error or false alarm
3. A true negative, a.k.a. a correct rejection
4. A false negative, a.k.a. a type II error, or miss

A confusion matrix represents these quantities visually.

<img src="img/error_types.png" style="height: 300px; width:auto;">

</br>

<img src="img/conf_matrix2.png" style="height: 80px; width:auto;">

#### Accuracy

There are a number of metrics that can be calculated using the values from the confusion matrix, including accuracy:

Accuracy is calculated as follows:

$Accuracy = \frac{ TP + TN }{ TP + FP + TN + FN }$

The classification error rate is the inverse of classification accuracy:

$Error Rate = \frac{ FP + FN }{ TP + FP + TN + FN }$

But used on a dataset with imbalanced classes, such as when 90% of observations belong to the same class, an unskilled model can provide 90% accuracy just by blindly picking the same class each time. To improve the accuracy further, you may need to use one of the other classification metrics, and in some scenarios, you will find one or more of the classification metrics to be more important than accuracy.

#### Precision

Precision is the ratio of true positives to predicted positives. It is most used when there is a high cost for having false positives. Junk-mail classifiers should have a high degree of precision, so that they do not misclassify important emails as junk.

$Precision = \frac{TP}{TP+FP}$

#### Sensitivity, a.k.a. Recall

Sensitivity is important when concerned with identifying positive outcomes and the cost of a false positive is low. If predicting whether a patient has cancer, it is important that sensitivity be high so that we can capture as many positive cases as possible.

$Sensitivity = \frac{TP}{TP+FN}$

#### Specificity (a.k.a. TNR, True Negative Rate)

Specificity is the ratio of true negatives to all negative outcomes. This is of interest if you are concerned about the accuracy of your negative rate and there is a high cost to a positive outcome. An example would be if you are an auditor looking over financial transactions and a positive outcome would mean a one-year investigation, but not finding one would cost very little.

$Specificity = \frac{TN}{TN+FP}$

**link to appendix for more**

#### F1-Measure

There are also measures that blend multiple classification metrics into one, such as the F-Measure, a.k.a. the F-Score or F1-Score:

$\text{F-Measure} = \frac{ 2 ~\times ~Precision ~\times ~Recall }{ Precision ~+ ~Recall }$

## Cross-Validation

Models are sensitive to the data they are trained on - if not, it can be considered to under-fit the data. If the model is too sensitive, or the data is misrepresentative of the population, then the model can over-fit, producing high training accuracy but fail to generalize well toward new data. If you are comparing several models, such as a series of regression models fit with different combinations or transformations of features, then it is helpful to employ a validation strategy which holds back some of the data for a post-training evaluation of the model on this unseen validation or test set.

This is called cross-validation, and with a little automation, we could repeat the process several times, each time shuffling the data and choosing a new holdout-set at random. This is called k-fold cross-validation and provides an even better idea of how well a model will generalize on average.

## Appendix

#### Logarithms

$10^3 = 1000$, and we say that the base-10 logarithm (a.k.a. $log$) of 1000 is 3. It is the exponent to which the logarithm-base must be exponentiated in order to equal the other side of the equation. Different bases may be used, such as base-2, base-10, and base-e (Euler's constant), the irrational number starting with $2.7818$... A base-e logarithm, i.e., the natural logarithm, is often used in mathematics, and is written as $ln$.

- $log_{10} ~1000 = 3$
- $log_{2} ~8 = 3$
- $log_e ~8 = log_{2.7818...} ~8 = ln ~8 = 2.07944$

Logarithms increase and decrease monotonically, meaning that they preserve the order of the output among points. A logarithm upon a set of data in which each number is greater than the previous will result in another set of data in which each number is greater than the previous, just with a different shape and scale.

#### Euler's Constant $e$

$e$, like $\pi$, is an irrational number, meaning the digits after the decimal neither end nor repeat. I.n Excel, it is rounded to 2.71828182845905.

It is also similar to $\pi$ in that it has profound importance in mathematics. Euler's formula, named after Leonhard Euler, establishes the relationship between trigonometric functions and complex numbers.

$e^{ix} = cos ~x + i ~sin ~x$

A special case of this formula known as Euler's identity links 5 fundamental mathematical concepts together (e,\pi. the imaginary unit i, the number 1, and the number 0).

$e^{i \pi} + 1 = 0$

In the realm of calculus, $e$ has an interesting property, which is that it is equal to its own <a href="https://en.wikipedia.org/wiki/Derivative">derivative</a>.

Some additional resources on the subject are as follows:

- https://www.3blue1brown.com/lessons/eulers-number
- https://www.3blue1brown.com/lessons/eulers-formula-dynamically

#### The exp() Function

The expression in the brackets of an $exp()$ function becomes an exponent to Euler's constant, $e$. For example, $e^{-3}$ is equal to $exp(-3)$.

In [None]:
### PDF of the Normal Distribution

<h2>$f(x) = \frac{1}{\sigma \sqrt{2 \pi}} e^{ \frac{1}{2} \left( \frac{x-\mu}{\sigma} \right)^2 }$</h2>

$x \in (-\infty, \infty)$

In [None]:
### PDF of the t-Distribution

<h2>$\frac{ \Gamma((n+1)/2) }{ \sqrt{n \pi} \Gamma(n/2) } (1 + x^2/n)^{-(n+1)/2}$</h2>
$x \in (-\infty, \infty)$

- where $\Gamma$ is the <a href="https://en.wikipedia.org/wiki/Gamma_function">Gamma function</a>

In [None]:
### PDF of the F-Distribution

The F-distribution is defined as the ratio of two chi-squared variables divided by their degrees of freedom:
    
<h2>$F = \frac{X / \text{df}_X}{Y / \text{df}_Y}$</h2>