# Statistics

<h3>Random Variables</h3>

<h4>Expected Value</h4>

The expectation (average value, mean) of a random variable is given by the integral of the value of $X$ with its PDF.

$\mu = E[X] = \int_{-\infty}^{\infty} x f_X(x) ~dx$


<h4>Variance</h4>

$Var(X) = E[(X-E[X])^2] = E[X^2] - (E[X])^2$


<h4>Standard Deviation</h4>

$\sigma = \sqrt{Var(X)}$


<h4>Covariance</h4>

$Cov(X,Y) = E[(X-E[X])(Y-E[Y])] = E[XY] - E[X] E[Y]$


<h4>Correlation</h4>

$Corr(X,Y) = \frac{ Cov(X,Y) }{ \sqrt{Var(X) Var(Y)} }$

<h3>Law of Large Numbers</h3>

If you sample a random variable independently a large number of times, the measured average value should converge to the random variable's true expectation.

<h3>Central Limit Theorem</h3>

If you repeatedly sample a random variable a large number of times, the distribution of the sample mean will approach a normal distribution regardless of the initial distribution of the random variable.

<h3>Type I and II Errors</h3>

Type I = FP, Type II = FN

$1-\alpha$ is the confidence level, $1-\beta$ is the power.

<h2>MLE and MAP</h2>

In MLE, the goal is to estimate the most likely parameters given a likelihood function:

$\theta_{MLE} = argmax ~L(\theta), where L(\theta) = f_n(x_1 \ldots x_n | \theta)$

If the values of X are assumed to be IID, then the likelihood function becomes the following:

$L(\theta) = \prod_{i+1}^n f(x_i|\theta)$

The natural logarithm is then taken prior to calculating the maximum, changing the operation from a product to a sum.

$log L(\theta) = \sum_{i=1}^n log f(x_i | \theta)$

MAP assumes a prior distribution:

$\theta_{MAP} = argmax g(\theta) f(x_1 \ldots x_n | \theta)$

...

# Problems

<h4>Question 1</h4>

Explain the Central Limit Theorem, and why it is useful

<i>Answer:</i>

The CLT states that if any random variable, regardless of distribution, is sampled a large enough number of times, the sample mean will be approximately normally distributed. This allows for studying the properties for any statistical distribution as long as there is a large enough sample size.

<h4>Question 2</h4>		

How would you explain a confidence interval to a non-technical audience?

<i>Answer:</i>

Confidence intervals are a range of values with a lower and upper bound such that if you were to sample the parameter of interest a large number of times, the $95\%$ confidence interval would contain the true value of this parameter $95\%$ of the time. We can construct a confidence interval using the sample deviation and sample mean.

<h4>Question 3</h4>

What are some common pitfalls encountered in A/B testing?

<i>Answer:</i>

- Groups may not be balanced, leading to highly skewed results. Balance is needed for all dimensions of the groups, because otherwise the potentially statistically significant results from the test may be due to specific factors that were not controlled for.
- Not running an experiment for long enough.
- Dealing with multiple tests is important because there may be interactions between results of tests.

<h4>Question 4</h4>

Explain covariance and correlation formulaically; compare and contrast them.

<i>Answer:</i>

<i>refer to notes above</i>

<h4>Question 5</h4>

You flip a coin $10$ times and observe only $1$ heads. What would be your null hypothesis and p-value for testing whether the coin is fair or not?

<i>Answer:</i>

The null hypothesis is that the coin is fair. The alternative hypothesis is that the coin is biased toward tails.

$H_0: p_0 = 0.5$

$H_A: p_1 \lt 0.5$

There are $2^{10} = 1024$ possible outcomes, and in only $10$ of them are there $9$ tails and $1$ heads. The probability of the given result is $10/1024 = 0.0098$ (therefore, we can reject the $H_0$).

<h4>Question 6</h4>

Describe hypothesis testing and p-values in layman's terms.

<i>Answer:</i>

- Hypothesis testing is the process of testing whether data supports particular hypotheses, and involves measuring parameters of a population's probability distribution.

- p-values are the probability of observing the given test results under the $H_0$ assumptions. The lower the probability, the higher the chance the $H_0$ should be rejected.

<h4>Question 7</h4>

Describe what type I and type II errors are, and the trade-offs between them.

<i>Answer:</i>

- Type I error is when one rejects the $H_0$ when it is correct, known as a $FP$. We detect a difference, when in reality there is no significant difference.

- Type II error is when the $H_0$ is not rejected when the $H_A$ is correct, known as a $FN$. We fail to detect a difference, when in reality there is a significant difference.

Type I error is given by the significance level $\alpha$, whereas the type II error is given by $\beta$. We refer to $1-\alpha$ as the confidence level and $1-\beta$ as the power of the test (we want both $\alpha$ and $\beta$ to be small).

<h4>Question 8</h4>

Explain the statistical background behind power.

<i>Answer:</i>

<i>Explained in question above</i>

<h4>Question 9</h4>

What is a Z-test and when would you use it vs. a t-test?

<i>Answer:</i>

We can use either test only if the mean is normally distributed, which is possible if:

a) the initial population is normally distributed, or 

b) the sample size is large enough to apply the CLT ($n \ge 30$)

In general, we use Z-tests if the population variance is known, and a t-test if population variance is unknown. But when sample size is very large ($n \ge 200$), the t-distribution will closely resemble the normal distribution regardless.

<h4>Question 10</h4>

Say you are testing hundreds of hypotheses, each with a t-test. What considerations should you take into account?

<i>Answer:</i>

As the number of tests increases, the chance that a stand-alone p-value for any of the t-tests is statistically significant becomes high due to chance alone. i.e., you have a high probability of observing at least one significant outcome, and the chance of type I error ($FP$) increases.

The Bonferroni correction sets the significance threshold to $\alpha/m$, where $m$ is the number of tests being performed. While this helps to protect from type I error, it is still prone to type II error ($FN$). It is mostly useful when there is a small number of multiple comparisons of which a few are significant.

<h4>Question 11</h4>

How would you derive a confidence interval for the probability of flipping heads from a series of coin tosses?

<i>Answer:</i>

The confidence interval is an interval that includes a true population with degree of confidence $1-\alpha$. For flipping heads in a series of coin tosses, the proportion follows the binomial distribution. If the series size is large enough (each of the number of successes and number of failures is at least $10$), we can utilize the CLT and the normal approximation for the binomial distribution.

$\hat{p} \pm z_{\alpha/2} \sqrt{ \frac{\hat{p}(1-\hat{p})}{n} }$

<h4>Question 13</h4>

What is the expected number of rolls needed to see all sides of a fair die?

<i>Answer:</i>

Let $k$ denote the number of distinct sides seen from rolls. The first roll will always result in a new side being seen. If you have seen $k$ sides, where $k \lt 6$, then the probability of rolling an unseen value will be $(6-k)/6$, since there are $6-k$ values you have not seen, and $6$ possible outcomes of each roll.

Each roll is independent of previous rolls. Therefore, for the second roll ($k=1$), the time until a side not seen appears has a geometric distribution with $p=5/6$, after two sides ($k=2$), $p=4/6$, etc.

The mean for a geometric distribution is given by $1/p$. Let $X$ be the number of rolls needed to show all $6$ sides.

$E[X] = 1 + \frac{6}{5} + \frac{6}{4} + \ldots + \frac{6}{1}$

$E[X] = 6 \sum_{p=1}^6 \frac{1}{p} = 14.7 rolls$

<h4>Question 15</h4>

A coin is flipped $1000$ times, and $550$ times it shows heads. Do you think the coin is biased?

<i>Answer:</i>

Because the sample size is large, we can use the CLT. Each flip is a Bernoulli random variable, with $p$ as the probability of heads. We want to test whether $p=0.5$.

The number of heads seen out of n total rolls follows a binomial distribution. If the coin is not biased, then the expected number of heads is:

$\mu = np = 1000 \cdot 0.5 \cdot 0.5 = 250$

The variance and standard deviation are calculated as:

$\sigma^2 = np(1-p)$

$\sigma^2 = 1000 \cdot 0.5 \cdot 0.5 = 250$

$\sigma = \sqrt{250}$

Since this mean and standard deviation specify the normal distribution, we can calculate the corresponding z-score for $550$ heads as follows:

$z = \frac{550-500}{16} = 3.16$

This means that the event of seeing 550 heads should occur with probability $\lt 0.1\%$

<h4>Question 16</h4>

You are drawing from a normally distributed random variable once a day. What is the approximate number of days until you get a value greater than $2$?

<i>Answer:</i>

Since $X$ is normally distributed, we employ the CDF of the normal distribution.

$\Phi(2) = P(X \le 2) = P(X \le \mu + 2 \sigma) = 0.9772$

Therefore, $P(X \gt 2) = 1 - 0.977 = 0.023$ for any given day. Since each day's draws are independent, the expected time until drawing an $X \gt 2$ follows a geometric distribution with probability $0.023$.

Letting $T$ be a random variable denoting the number of days, we have the following:

$E[T] = \frac{1}{p} = \frac{1}{0.2272} = 44 ~days$

<h4>Question 17</h4>

Say you have two random variables $X$ and $Y$, each with a standard deviation. What is the variance of $aX + bY$ for constants $a$ and $b$?

<i>Answer:</i>

The variance of a sum of variables is assessed as:

$Var(X+Y) = Var(X) + Var(Y) + 2 ~Cov(X,Y)$

and that a constant coefficient of a random variable is assessed as $Var(aX) + a^2 Var(X)$. We have:

$Var(aX + bY) = a^2 Var(X) + b^2 Var(Y) + 2ab ~Cov(X,Y)$

which would provide the bounds on the designated variance; the range will depend on the covariance between $X$ and $Y$.

<h4>Question 19</h4>

Say you have an unfair coin which lands heads $60\%$ of the time. How many coin flips are needed to detect that the coin is unfair?

<i>Answer:</i>

Each flip is a Bernoulli trial with success probability $p$. We can calculate a confidence interval using the CLT. We construct a $95\%$ confidence interval ($z=1.96$), and if it does not include $0.5$ as its lower bound, we can reject the $H_0$ that the coin is fair.

$p \pm z \sqrt{ \frac{p(1-p)}{n} }$

$0.6 - 1.96 \sqrt{ \frac{0.6(1-0.6)}{n} } = 0.5$

Solving for $n$ yields $93$ flips.

<h4>Question 20</h4>

Say you have $n$ numbers $1, \ldots, n$, and you uniformly sample from this distribution with replacement $n$ times. What is the expected number of distinct values you would draw?

<i>Answer:</i>

Let $X_i = 1$ if $i$ is drawn in $n$ turns

We know that $p(X_i=1) = 1 - p(X_i=0)$, so the probability of a number not being drawn (where each draw is independent) is the following:

$p(X_i=0) = \left( \frac{n-1}{n} \right)^n$

Therefore, we have $p(X_i=1) = 1 - \left( \frac{n-1}{n} \right)^n$

and by linearity of expectation, we have:

$\sum_{i=1}^n E[X_i] = n ~E[X_i] = n \left( 1 - \left( \frac{n-1}{n} \right)^n \right)$

<h4>Question 23</h4>

Derive the mean and variance of the Uniform distribution U(a,b).

<i>Answer:</i>

$f_X(x) = \frac{1}{b-a}$

Therefore, we calculate the mean as:

$E[X] = \int_a^b x ~f_X(x) ~dx = \int_a^b \frac{x}{b-a} ~dx = \frac{x^2}{2(a-b)} \big\vert_a^b = \frac{a+b}{2}$

Similarly, the variance can be expressed as:

$Var(X) = E[X^2] - E[X]^2$

Giving us:

$E[X^2] = \int_a^b x^2 f_X(x) ~dx = \int_a^b \frac{x^2}{b-a} ~dx = \frac{x^3}{3(a-b)} \big\vert_a^b = \frac{a^2 + ab + b^2}{3}$

Therefore:

$Var(X) = \frac{a^2 + ab^2 + b^2}{3} - \left( \frac{a+b}{2} \right)^2 = \frac{(b-a)^2}{12}$

<h4>Question 29</h4>

What are MLE and MAP? What is the difference between the two?

<i>Answer:</i>

MLE and MAP are ways of estimating variables in a probability distribution by producing a single estimate of a variable.

</br>
<u><i>MLE:</i></u>

Assume we have likelihood function $P(X|\theta)$. Given n IID samples, MLE is:

$MLE_{\theta} = max_{\theta} P(X|\theta) = max \prod_i^n P(x_i | \theta)$

Maximizing the log function of the product is more convenient, a) to avoid computer rounding errors, and b) because hte log of a product is equal to the sum of logs.

$MLE_{log} \theta = max_\theta \sum_{i=1}^n ~log ~P(x_i | \theta)$

</br>
<u><i>MAP:</i></u>

MAP uses the posterior P(\theta|X) being proportional to the likelihood multiplied by a prior $P(\theta)$; i.e., $P(X|\theta) P(\theta)$. The MAP for $\theta$ is:

$MAP(\theta) = max_{\theta} P(X|\theta) = max_{\theta} \prod_i^n P(x_i | \theta) P(\theta)$

Employing the same math used in calculating the MLE, the MAP becomes

$MAP_{log}(\theta) = max_{\theta} \sum_{i=1}^n ~log ~P(x_i | \theta) + log ~P(\theta)$

MLE can be seen as a special case of the MAP with a uniform prior.

<h4>Question 30</h4>

Say you are given a random Bernoulli trial generator. How would you generate values from a standard normal distribution?

<i>Answer:</i>

$n$ Bernoulli trials form a binomial distribution with probability $p$ of success ($x_i=1$ means success and $x_i=0$ means failure). Assuming IID trials, we can compute the sample proportion for $\hat{p}$ as follows:

$\hat{p} = \frac{1}{n} \sum_{i=1}^n x_i$

We know that if $n$ is large enough, the binomial distribution approximates the following normal distributions:

$\hat{p} \text{~} \mathcal{N} \left( p, \frac{p(1-p)}{n} \right)$

where $np$ must be $\ge 10$, and $n(1-p) \ge 10$ (so $n \ge 20$ for $p=0.5$).

To simulate the standard normal distribution, we normalize $\hat{p}$.

$\hat{p_0} = \frac{ \hat{p} - p }{ \sqrt{ \frac{p(1-p)}{n} } }$

At this point, we can derive the final formula for the random number generator.

$\bar{x} = \frac{ \sum_{i=1}^n x_i - p }{ \sqrt{ \frac{p(1-p)}{n} } }$

which can be simplified to:

$\bar{x} = \frac{ \sum_{i=1}^n x_i - np }{ \sqrt{ np(1-p) } }$