## Descriptive Statistics

Summarizing data is important to understanding it at scale, and descriptive statistics help us to do so.

#### Mean, Median, and Mode

The mean of a series of numbers is the average, the sum divided by the count.

<h4>$\bar{x} ~~= \frac{ \sum_{i=1}^n x_i }{n} ~~= ~~\frac{1}{n} \sum_{i=1}^n x_i$</h4>

- $\bar{x}$ is the mean of the series of numbers $x$
- $n$ is the number of observations
- $i$ is the index of a particular data point of $x$
- $x_i$ is a data point of $x$ with index $i$

The median is the data point in the middle of the list when you sort the series of numbers $x$ by value. It can be preferable to the mean as a measure of central tendency, as it is unaffected by outliers.

The mode is the value observed most frequently in the series of numbers, and can be tied amongst two or more values.

With an unskewed Gaussian/normal distribution, the median and mode happen to be equal to the mean.

#### Percentile and Quartile

You can think of the concept of percentiles as a generalization of the concept of the median, as the median represents the value at the $50^{th}$ percentile. Other special cases of the percentile are the values $25\%$ and $75\%$ of the way down the list of sorted values, and together with the median, these are called quartiles, because they divide the data into four sections with an equal number of observations.

#### Probability Distributions

Probability is the soul of statistics, which is all about quantifying uncertainty. It helps us to quantify randomness, and by calculating probabilities, we can determine whether observed phenomena are statistically significant or likely due to chance. Probabilities are non-negative numbers between $0$ and $1$, and part of a series of probabilities that together sum to $1$, called a probability distribution.

There are many types of probability distributions, and the math behind them gets pretty hairy. We will not delve into the various ways that these distributions are calculated, but a couple of articles on the subject can be found here:

- <a href="https://github.com/pw598/Articles/blob/main/Probability%20Distributions%20I%20-%20Discrete%20Distributions.ipynb">Probability Distributions I - Discrete Distributions</a>
- <a href="https://github.com/pw598/Articles/blob/main/Probability%20Distributions%20II%20-%20Continuous%20Distributions.ipynb">Probability Distributions II - Continuous Distributions</a>

What's important to know is that probability distributions, and the random variables which represent them, are mathematical functions. A function transforms a series of $x$ values input into some output, for example, $f(x)=x^2$. A probability distribution, known by its probability density function (PDF), is a parametric function where the outputs are probabilities. By parametric, we mean that the function is defined partly in terms of variable parameters - like with the Gaussian/normal distribution, mean and variance.

Several important concepts are agnostic to the type of distribution being used.

#### Expected Value

The expected value of a probability distribution is a measure of its center, and though equal to the mean in the case of symmetrical distributions, it can be thought of as a broader generalization. It is the average outcome you would expect when repeating an experiment many times, and is calculated by summing the products of each possible outcome and its probability.

<h4>$E(X) = \sum_i x_i \cdot P(X=x_i)$</h4>

- $E(X)$ is the expected value of $X$
- $\sum_i$ represents a sum over all indices $i$ of $X$
- $i$ is the index of the value of $X$ being considered
- $x_i$ is the value of $X$ for a given index
- $\cdot$ is the multiplication operation
- $P(X=x_i)$ is the probability that $X$ equals $x_i$

For a continuous distribution, we can use the above as a discrete approximation, but otherwise need to involve the concept of integrals from calculus (which we'll avoid).

#### Variance

Variance is a measure of how much spread there is around the expected value, and is calculated as the average of squared differences from the expected value. Squared because we don't want negative differences to offset the positive ones, but rather blend together to give us an idea of average distance regardless of direction. A nuance is that when dealing with a sample rather than a population, we divide by $n-1$ instead of $n$.

<i>Population Variance:</i>

<h4>$\sigma^2 = \frac{\sum(x - \mu)^2}{n}$</h4>

<i>Sample Variance:</i>

<h4>$s^2 = \frac{\sum(x - \bar{x})^2}{n-1}$</h4>

- $\sigma^2$ is the population variance
- $s^2$ is the sample variance
- $x$ is the value of an instance in the dataset
- $\mu$ is the population average value
- $\bar{x}$ is the sample average value
- $n$ is the number of observations

#### Standard Deviation

The squared values of variance make for an interpretability issue - the units of variance are not in the same unit of measurement of the data. This is why we commonly speak in terms of standard deviation - the square root of variance - which brings us back to the same units as the data.

<i>Population Standard Deviation:</i>

<h4>$\sigma = \sqrt{\sigma^2} = \sqrt{\frac{\sum(x - \mu)^2}{n}}$</h4>

<i>Sample Standard Deviation:</i>

<h4>$s = \sqrt{s^2} = \sqrt{\frac{\sum(x - \bar{x})^2}{n-1}}$</h4>

- $\sigma$ is the population standard deviation
- $s$ is the sample standard deviation
- $x$ is the value of an instance in the dataset
- $\mu$ is the population average value
- $\bar{x}$ is the sample average value
- $n$ is the number of observations

#### Covariance and Correlation

You are likely familiar with the concept of correlation - the degree to which the behavior of one variable explains another. To understand correlation, it helps to understand covariance, the unscaled version of correlation. While correlation ranges $-1$ to $+1$, covariance ranges $-X$ to $+X$, and correlation is just scaled covariance.

Covariance is the sum of the product of differences from the mean for two variables, scaled by the number of observations $n$, or $n-1$ if sampling. Two coinciding positive differences make for a large contribution to covariance, as do two coinciding negative differences, through multiplication.

<i>Population Covariance:</i>

<h4>$cov(x,y) = \frac{\sum (x_i - \bar{x}) (y_i - \bar{y})}{n}$</h4>

<i>Sample Covariance:</i>

<h4>$cov(x,y) = \frac{\sum (x_i - \bar{x}) (y_i - \bar{y})}{n-1}$</h4>

- $x$ is a vector (series) of values
- $y$ is a vector of values
- $x_i$ is an individual data point of $x$
- $\bar{x}$ is the average value of $x$
- $y_i$ is an individual data point of $y$
- $\bar{y}$ is the average value of $y$
- $n$ is the number of observations

Back to correlation - correlation is covariance with the numerator scaled by the product of standard deviations for the two variables, rather than $n$ or $n-1$.

<h4>$r = \frac{ \sum (x_i - \bar{x}) (y_i - \bar{y}) }{ \sqrt{\sum(x_i - \bar{x})^2} \sqrt{\sum(y_i - \bar{y})^2} }$</h4>

- $x$ is a vector of values
- $y$ is a vector of values
- $x_i$ is an individual data point of $x$
- $\bar{x}$ is the average value of $x$
- $y_i$ is an individual data point of $y$
- $\bar{y}$ is the average value of $y$
- $n$ is the number of observations

#### Z-Scores and Standardization

Z-Scores indicate how many standard deviations away a data point is from the mean of the dataset.

<h4>$z = \frac{x_i - \mu}{\sigma}$</h4>

- $z$ is the distance in standard deviations of $x_i$ from the mean of $x$, $\mu$
- $x_i$ is an individual data point of $x$
- $\mu$ is the average value of $x$
- $\sigma$ is the standard deviation of $x$

Z-scores can be helpful for standardization, a form of scaling in which data points are expressed in terms of the number of standard deviations away from the mean.

Another way to scale data is with min/max scaling, which follows the following formula:

<h4>$scaled ~x_i = \frac{x_i - min(x)}{max(x) - min(x)}$</h4>

- $scaled ~x_i$ is the proportion of range from the minimum to maximum of the vector $x$
- $x_i$ is an individual instance of $x$
- $min(x)$ is the minimum value of $x$
- $max(x)$ is the maximum value of $x$

## Statistical Hypothesis Testing

Probability distributions provide a mathematical framework for describing the likelihood of events in a random process or experiment. Methods like maximum likelihood estimation use analytical functions to find parameters that best fit observed data to a theoretical distribution.

Going forward, we will assume the Gaussian/normal distribution, or at times, the t-distribution. The parameters of the Gaussian distribution are mean and variance, and the parameter of the t-distribution is degrees of freedom, set to $n-1$ (where $n$ is the number of observations).

If testing the effect of an experiment, we expect to see that a sample mean from the experiment is unlikely to come from the assumed population probability distribution, meaning that the difference in stimulus/environment/etc. has had a <i>significant</i> effect.

A p-value (probability value) measures the likelihood of obtaining a particular sample mean from a theoretical distribution as determined by the type of distribution, and the variable parameters which define that distribution. A significance level $\alpha$ reflecting the maximum threshold for the p-value is commonly selected to be $0.05$ or $0.01$. This reflects .... It is when the p-value is lower than the significance level $\alpha$ that we consider a result not likely to be due to random chance.

The null hypothesis $H_0$ is that there is no difference between the sampled data and the population. The drug had no effect, the sale provided no uplift, etc. When the p-value is less than the significance level $\alpha$, the null hypothesis $H_0$ is rejected in favor of the alternative hypothesis $H_A$ (or $H_1$). Rejecting the null hypothesis does not prove that the alternative hypothesis is true, but suggests there is sufficient enough evidence that the null hypothesis is unlikely to be true.

#### The Z-Test

The z-test quantifies the probability of a number as or more extreme than another given number, when drawn from a normal distribution, given the data observed. Notable p-z combinations include:
- 68.3% of data is between -1 and +1 standard deviations from the mean
- 95.5% of data is between -2 and +2 standard deviations from the mean
- 99.7% of data is between -3 and +3 standard deviations from the mean

#### Critical Values

A critical value is defined in the context of the population distribution and a probability, and is used as a threshold for interpreting the result of a statistical test. The values in the population beyond the critical value are called the critical region or region of rejection.

A one-tailed test has a single critical value, on the left or the right of the distribution, and if the calculated statistic is less or equally extreme than the critical value, the null hypothesis of the test fails to be rejected. A two-tailed test has two critical values, one on each side of the distribution, which is often assumed to be symmetrical. When using a two-tailed test, the significance level $\alpha$ used in the calculation of critical values must be divided by two.

#### Confidence Intervals

A point estimate is the simplest approach to estimating a population parameter, and a confidence interval is a point estimate $\pm$ the margin of error. The wider the interval, the greater the confidence level. To be 95% confident means that there is a 95% chance that the mean is contained in the confidence interval, and that the method used to construct the interval will provide intervals that contain the population mean 95% of the time.

Involved in the calculation of a confidence interval is a critical statistic, such as a z-score or t-statistic, representative of the confidence level. A z-statistic is assumed to be drawn from a normal distribution, and a t-statistic is assumed to be drawn from a t-distribution. The t-distribution is similar to the normal distribution, and approximates a normal distribution as the number of samples becomes greater, but for lower numbers of samples, it is shorter and more spread out. For 30 or more samples, it is common to assume the normal distribution.

<i>Confidence Interval with z-Statistic:</i>

<h4>$CI = \bar{x} \pm z_{\alpha} \frac{s}{\sqrt{n}}$</h4>

<i>Confidence Interval with t-Statistic:</i>

<h4>$CI = \bar{x} \pm t_{n-1} \frac{s}{\sqrt{n}}$</h4>

- $\bar{x}$ is the sample average
- $z_{\alpha}$ is a critical z-statistic for the given $\alpha$ (significance) level
- $\alpha$ is the significance level ($0.05$ or $0.01$ is common)
- $s$ is the sample standard deviation
- $t_{n-1}$ is a critical t-statistic with degrees of freedom equal to $n-1$
- $n$ is the number of observations

## Linear Regression

Linear regression takes a set of inputs X and a set of outputs y, and models a linear relationship by which the input maximally explains the output. For a single variable, the relationship is written as:

<p>$$y = \beta_0 + \beta_1 x + \epsilon$$</p>
    <ul>
        <li>$y$ is the numeric response for the instance in question</li>
        <li>$\beta_0$ is an estimated intercept value</li>
        <li>$\beta_1$ is the estimated coefficient for the $x$ variable</li>
        <li>$\epsilon$ is random error that cannot be explained by the model</li>
    </ul>

#### Multiple Regression

When multiple independent variables are being considered, we call it multiple regression. Each independent variable receives a coefficient (with a confidence interval) that describes the variable's contribution to the model, and predictions are created as a weighted sum of feature inputs.

<p>$$y = \beta_0 + \beta_1 x1 + \beta_2 x_2 + ... + \beta_p x_p + \epsilon$$</p>
    <ul>
        <li>$y$ is the numeric response for the instance in question</li>
        <li>$\beta_0$ is an estimated intercept value</li>
        <li>$\beta_p$ is the estimated coefficient for the $p^{th}$ variable</li>
        <li>$\epsilon$ is random error that cannot be explained by the model</li>
    </ul>

One advantage of regression models is that they are highly interpretable. The features and weights can be interpreted as such:
- Numerical Feature: increasing the numerical feature by one unit changes the estimated outcome by the value of its weight.
- Binary Feature: changing the feature from the reference category to the other category changes the estimated outcome by the feature's weight.
- Categorical Feature: the interpretation of each category is the same as the interpretation for binary features.
- Intercept $\beta_0$: the intercept is the predicted outcome of an instance where all features are at their mean value.

Another advantage of regression is that it provides standard errors of the coefficients. 

Assumptions apply, such as a lack of correlated independent ($X$) features, however these assumptions are often violated with minimal consequence.