# Cheat Sheet

In [1]:
# The source of the content is freely available online
# https://drive.google.com/file/d/1VmkAAGOYCTORq1wxSQqy255qLJjTNvBI/view
# https://projects.iq.harvard.edu/stat110/

<h2 style="color: blue;">Counting</h2>

<h4>Multiplication Rule</h4>

Say we have a compound experiment, an experiment with multiple components. If the first component has $n_1$ possible outcomes, then overall there are $n_1 \cdot n_2 \cdot \ldots \cdot n_r$ possible outcomes.

<h4>Sampling Table</h4>

**in notes**

<h2 style="color: blue;">Conditional Probability</h2>

<h4>Intersections via Conditioning</h4>

$P(A,B) = P(A) P(B|A)$

$P(A,B,C) = P(A) P(B|A) P(C|A,B)$

<h4>Unions via Inclusion-Exclusion</h4>

$P(A \cup B) = P(A) + P(B) - P(A \cap B)$

$
P(A \cup B \cup C) = 
P(A) + P(B) + P(C) - 
P(A \cap B) - P(A \cap C) - P(B \cap C) + 
P(A \cap B \cap C)
$


<h4>Law of Total Probability</h4>

Let $B_1, B_2, \ldots, B_n$ be a partition of the sample space (i.e., they are disjoint and their union is the entire sample space.

$P(A) = P(A|B_1) P(B_1) + \ldots + P(A|B_n) P(B_n)$

$P(A) = P(A \cap B_1) + \ldots + P(A \cap B_n)$

<h4>Law of Total Probability with Extra Conditioning</h4>

$P(A|C) = P(A|B,C) + \ldots + P(A|B_n,C) P(B_n|C)$

$P(A|C) = P(A|B_1,C) + \ldots + P(A \cap B_n | C)$

<h4>Law of Total Probability with $B$ and $B^C$ as Partition</h4>

$P(A) = P(A|B) P(B) + P(A|B^C) P(B^C)$

$P(A) = P(A \cap B) + P(A \cap B^C)$

<h4>Bayes' Rule</h4>

$P(A|B) = \frac{ P(B|A) P(A) }{P(B)}$

$P(A|B,C) = \frac{ P(B|A,C) P(A|C) }{P(B|C}$

We can also write:

$P(A|B,C) = \frac{ P(A,B,C) }{P(B,C)} = \frac{ P(B,C|A) P(A) }{P(B,C)}$

Odds Form of Bayes' Rule

$\frac{P(A|B)}{P(A^C|B)} = \frac{P(B|A)}{P(B|A^C)} \frac{(A)}{(A^C)}$

The posterior odds of A are the likelihood ratio times the prior odds.

<h2 style="color: blue;">RVs and Their Distributions</h2>

<h4>Probability Mass Function (PMF)</h4>

Gives the probability a discrete RV takes on the value $x$.

$P_X(x) = P(X=x)$

The PMF satisfies $p_X(x) \ge 0$ and $\sum_x p_X(x) = 1$.

<h4>Cumulative Distribution Function (CDF)</h4>

Gives the probability that a RV is less than or equal to $x$.

$F_X(x) = P(X \le x)$

The CDF is an increasing, right-continuous function with $F_X(x) \rightarrow 0$ as $x \rightarrow -\infty$ and $F_X(x) \rightarrow 1$ as $x \rightarrow \infty$.

<h4>Independence</h4>

Two RVs are independent if knowing the value of one gives no information about the other.

$P(X=x, Y=y) = P(X=x) P(Y=y)$

<h2 style="color: blue;">Expected Value</h2>

<h4>Expected Value (a.k.a. mean, expectation, average)</h4>

A weighted average of the possible outcomes of a RV.

$E(X) = \sum_i x_i P(X=x_i)$

<h4>Linearity</h4>

For any RVs $X$ and $Y$, and constants $a$, $b$, and $c$:
    
$E(aX + bY + c)  = aE(X) + bE(Y) + c$

Conditional Expected Value

Defined like expectation, only conditioned on an event $A$.

$E(X|A) = \sum_x P(X=x|A)$

Variance and Standard Deviation

$Var(X) = E(X - E(X))^2 = E(X^2) - (E(X))^2$

$SD(X) = \sqrt{Var(X)}$

<h2 style="color: blue;">Continuous RVs, LOTUS</h2>

<h4>Probability Density Function (PDF)</h4>

The derivative of the CDF.

$E(X) = \int_{-\infty}^{\infty} x f(x) ~dx$

<h4>Law of the Unconscious Statistician (LOTUS)</h4>

States that you can find the expected value of a function of a random variable, $g(X)$, by replacing the $x$ in front of the PMF/PDF by $g(x)$ but still working with the PMF/PDF of $X$.

$E(g(X)) = \sum_x g(x) P(X=x)$

$E(g(X)) = \int_{-\infty}^{\infty} g(x) f(x) ~dx$

<h4>Moment Generating Functions</h4>

For any RV $X$, the function $M_X(t) = E(e^{tX})$ is the MGF of $X$.

Moments describe the shape of a distribution. 

Mean: $E(X) = \mu_1$

Variance: $Var(X) = \mu_2 - \mu_1^2$

Skewness: $Skew(X) = m_3$

Kurtosis: $Kurt(X) = m_4 - 3$

<h2 style="color: blue;">Joint PDFs and CDFs</h2>

<h4>Joint Distributions</h4>

The joint CDF of $X$ and $Y$ is:
    
$F(x,y) = P(X \le x, Y \le y)$

In the discrete case, $X$ and $Y$ have a joint PMF.

$P{X,Y}(x,y) = P(X=x, Y=y)$

In the continuous case, they havec a joint PDF.

$f_{X,Y}(x,y) = \frac{\partial^2}{\partial x \partial y} F_{X,Y}(x,y)$

The joint PDF/PMF is nonnegative and sum to $1$.

<h4>Conditional Distributions</h4>

<h5>Conditioning and Bayes' Rule for Discrete RVs</h5>

$P(Y=y|X=x) = \frac{P(X=x,Y=y)}{P(X=x} = \frac{P(X=x|Y=y) P(Y=y)}{P(X=x)}$

</br>
<h5>Conditioning and Bayes' Rule for Continuous RVs</h5>

$f_{Y|X}(y|x) = \frac{f_{X,Y}(x,y)}{f_X(x)} = \frac{ F_{X,Y}(x|y) f_Y(y) }{f_X(x)}$

</br>
<h5>Hybrid Bayes' Rule</h5>

$f_X(x|A) = \frac{ P(A|X=x) f_X(x) }{P(A)}$

<h4>Marginal Distributions</h4>

To find the distribution of one (or more) RVs from a joint PMF/PDF, sum/integrate over the unwanted RVs.

From joint PMF:

$P(X=x) = \sum_y P(X=x,Y=y)$

From joint PDF:

$f_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x,y) ~dy$

<h4>Independence of RVs</h4>

RVs $X$ and $Y$ are independent only if the joint CDF is the product of the marginal CDFs, and the joint PMF/PDF is the product of the marginal PMFs/PDFs.

<h4>Covariance and Transformations</h4>

$Cov(X,Y) = E((X-E(X))(Y-E(Y))) = E(XY) - E(X) E(Y)$

$Corr(X,Y) = \frac{Cov(X,Y)}{\sqrt{Var(X) Var(Y)}}$

<h5>Properties</h5>

For RVs $W$, $X$, $Y$, $Z$, and constants $a$, $b$:

$Cov(X,Y) = Cov(Y,X)$

$Cov(X+a, Y+b) = Cov(X,Y)$

$Cov(aX, bY) = ab Cov(X,Y)$

$Cov(W+X, Y+Z) = Cov(W,Y) + Cov(W,Z) + Cov(X,Y) + Cov(X,Z)$

Correlation is location-invariant and scale-invariant:

$Corr(aX + b, cY + d) = Corr(X,Y)$

<h2 style="color: blue;">Continuous Distributions</h2>

<h4>Exponential Distribution</h4>

<h5>Properties</h5>

Memorylessness: the Exponential distribution is the only continuous memorylessness distribution.

$P(X \gt s+t | x \gt s) = P(x \gt t)$

Minimum of Exponentials: if we have independent $X_i \text{~} Expo(\lambda)$, then $min(X_1, X_2, \ldots, X_k) \text{~} Expo(\lambda_1 + \lambda_2 + \ldots + \lambda_k)$

Maximum of Exponentials: if we have IID $X_i \text{~} Expo(\lambda)$, then $max(X_1, \ldots, X_k)$ has the same distribution as $Y_1 + Y_2 + \ldots + Y_k$, where $Y_j \text{~} Expo(j \lambda)$ and the $Y_j$ are independent.

<h4>Gamma</h4>

Story: You sit waiting for shooting stars, where the waiting time for a start is distributed $Expo(\lambda)$. You want to see n shooting stars before you go home. The total waiting time for the $n^{th}$ shooting star is $Gamma(n,\lambda)$.

Example: You are at a bank, and there are 3 people ahead of you. The serving time for each person is $Expo(\lambda)$ with mean $2$ minutes. The distribution of your waiting time to be served is $Gamma \left( 3,\frac{1}{2} \right)$.

<h4>Chi-Squared Distribution</h4>

A $\chi^2(n)$ distribution is the sum of squares of independent standard normal RVs.

$X \text{~} Z_1^2 + Z_2^2 + \ldots + Z_n^2$

<h2 style="color: blue;">Discrete Distributions</h2>

Distributions for Sampling Schemes:
    
**table**

<h4>Binomial Distribution</h4>

$X$ is the number of successes we will achieve in $n$ independent Bernoulli trials with probability $p$.

<h5>Properties</h5>

Let $X \text{~} Bin(n,p)$ and $Y \text{~} Bin(m,p)$, with $X$ independent of $Y$.

- Redefining Success: $n-X \text{~} Bin(b,1-p)$

- Sum: $X+Y \text{~} Bin(n+m,p)$

- Conditional: $X|(X+Y=r) \text{~} HGeom(n,m,r)$

- Binomial-Poisson Relationship: $Bin(n,p)$ is approximately $Pois(\lambda)$ if $p$ is small

- Binomial-Normal Relationship: $Bin(n,p)$ is approximately $\mathcal{N}(np, np(1-p))$ if $n$ is large and $p$ is not near $0$ or $1$.

<h4>Geometric</h4>

Story: $X$ is the number of failures that we will achieve before we achieve our first success, where successes have probability $p$.

Example: If each Pokeball has probability $\frac{1}{10}$ to catch one, the number of failed Pokeballs will be distributed $Geom \left( \frac{1}{10} \right)$.

<h4>First Success Distribution</h4>

Equivalent to the Geometric distribution, except includes the first success in the count. This is $1$ more than the number of failures.

If $X \text{~} FS(p)$, then $E(X) = \frac{1}{p}$

<h4>Negative Binomial Distribution</h4>

Story: $X$ is the number of failures that we will have before we achieve our $r^{th}$ success, with success probability $p$.

Example: Pokeman $A$ has $60\%$ accuracy and can destroy Pokeman $B$ in $3$ hits. The number of misses before $A$ destroys $B$ is distributed $NBin(3,0.6)$.

<h4>Hypergeometric Distribution</h4>

Story: In a population of $w$ desired objects and $b$ undesired objects, $X$ is the number of successes we will have in a draw of $n$ objects, without replacement. The draw of $n$ objects is a simple random sample; all sets of $n$ objects are equally likely.

Examples:

- The number of aces in a $5$-card hand

- You have $w$ white balls and $b$ black balls, and you draw $n$ balls without replacement. The number of white balls in your sample is $HGeom(w,b,n)$. The number of black balls is $HGeom(b,w,n)$.

- A forest has $N$ elk, you recapture $n$ of them, tag them, and release them. Then you recapture a new sample of size $m$. How many tagged elk are now in the new sample? $HGeom(n,N-n,m)$.

<h4>Poisson</h4>

Story: There are rare (low probability) events with high possibilities of occurrences (may occur many different ways) at an average rate of $\lambda$ occurrences per unit of space or time. The number of occurrences in that unit of space or time is $X$.

Example: A busy intersection has an average of 2 accidents per month. It is reasonable to model the number of accidents in a month at that intersection as Pois(2). The number of accidents in two months is distributed Pois(4).

<h5>Properties</h5>

1. Sum: $X+Y \text{~} Pois(\lambda_1 + \lambda_2)$

2. Conditional: $X|(X+Y=n) \tilde{~} Bin \left( n, \frac{\lambda_1}{\lambda_1 + \lambda_2} \right)$

3. Chicken-Egg: if there are $Z \text{~} Pois(\lambda)$ items and we randomly and independently 'accept' each item with probability $p$, then the number of accepted items is $Z_1 \text{~} Pois(\lambda p)$, and the number of rejected items is $Z_2 \text{~} Pois(\lambda (1-p))$, with $Z_1$ independent of $Z_2$.

<h4>Multinomial</h4>

Story: We have $n$ items, which can fall into any one of the $k$ buckets independently with the probabilities $\overrightarrow{p} = (p_1, p_2, \ldots, p_k)$.

Example: Assume that every year, 100 students in the Harry Potter universe are randomly and independently sorted into one of $4$ houses with equal probability. The number of people in each of the $4$ houses is distributed $Mult_4(100,\overrightarrow{p})$, where $\overrightarrow{p} = (0.25,0.25,0.25,0.25)$.

Note that $X_1 + X_2 + X_3 + X_4 = 100$, and they are dependent. 

<h5>Joint PMF</h5>

For $n = n_1 + n_2 + \ldots + n_k$, 

$P(\overrightarrow{X} = \overrightarrow{n}) = \frac{n!}{ n_1! n_2! \ldots n_k! } p_1^{n_1} p_2^{n_2} \ldots p_k^{n_k}$

<h5>Marginal PMF, Lumping, and Conditionals</h5>

Marginally, $X_i \text{~} Bin(n,p_i)$ since we can define success to mean category $i$. If you lump together multiple categories in a Multinomial, then it is still Multinomial. e.g., $X_i + X_j \text{~} Bin(n, p_i + p_j)$ for $i \neq j$ since we can define success as being in category $i$ or $j$. e.g., 

$(X_1 + X_2, X_3 + X_4 + X_5, X_6) ~\text{~}~ Mult_3(n, (p_1 + p_2, p_3 + p_4 + p_5, p_6))$

<h4>Multivariate Normal</h4>

A vector $\overrightarrow{X} = (X_1, X_2, \ldots, X_k)$ is MVN if every linear combination is normally distributed. The parameters of the MVN are the mean vector $\overrightarrow{\mu} = (\mu_1, \mu_2, \ldots, \mu_k)$ and the covariance matrix (where the $(i,j)^{th}$ entry is $Cov(X_i, X_j)$.

<h5>Properties</h5>

- Any subvector is also MVN

- If any two elements within a MVN are uncorrelated, then they are independent

- The joint PDF of a Bivariate Normal $(X,Y)$ with $\mathcal{N}(0,1)$ marginal distributions and correlation $\rho \in (-1,1)$

$f_{X,Y}(x,y) = \frac{1}{2 \pi \tau} exp \left( - \frac{1}{2 \tau^2} (x^2 + y^2 - 2 \rho xy \right)$,

with $\tau = \sqrt{1-p^2}$.

<h2 style="color: blue;">Distribution Properties</h2>

<h4>Special Cases of Distributions</h4>

- $Bin(1,p) \text{~} Bern(p)$

- $Beta(1,1) \text{~} Unif(0,1)$

- $Gamma(1,\lambda) \text{~} Expo(\lambda)$

- $\chi_n^2 \text{~} Gamma \left( \frac{n}{2}, \frac{1}{2} \right)$

- $NBin(1,p) \text{~} Geom(p)$

<h2 style="color: blue;">Formulas</h2>

<h4>Geometric Series</h4>

$1 + r + r^2 + \ldots + r^{n-1} = \sum_{k=0}^{n-1} = \frac{1-r^n}{1-r}$

If $|r|$ is less than $1$, then:
    
$1 + r + r^2 + \ldots = \frac{1}{1-r}$

<h4>Exponential Function</h4>

$e^x = \sum_{n=0}^{\infty} \frac{x^n}{n!} = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + \ldots = \lim_{n \rightarrow \infty} \left( 1 + \frac{x}{n} \right)$

<h4>Euler's Approximation for Harmonic Sums</h4>

$ + \frac{1}{2} + \frac{1}{3} + \ldots + \frac{1}{n} \approx log ~n + 0.577$

<h2 style="color: blue;">Table of Distributions</h2>

<img src="img/prob_dists.png" style="height: 700px; width:auto;">