Probability Theory - the study of uncertainty 
* it is important to machine learning because the design of learning algorithms often relies on probability assumption of the data

1) Basic Concepts
* **Probability Space**:
 * $\Omega$ = space of possible outcomes (outcome space)
 * $F$ = space of measurable events (event space)
 * $P$ = probability measure or distribution
 * **six-sided dice example:**
    * $\Omega = \{1,2,3,4,5,6\}$
    * events of interest: odd or even dice throw
        * $F = \{\varnothing,\{1,3,5\},\{2,4,6\},\Omega\}$
    * define probability distribution $P$ over $F$ (defined over events)
        * $P(\{1\}) = P(\{2\}) = \dots = P(\{6\}) = \frac{1}{6}$
        * $P(\{2,4,6\}) = P(\{2\}) + P(\{4\}) + P(\{6\}) = \frac{1}{6} + \frac{1}{6} + \frac{1}{6} = \frac{1}{2}$
* Probability Interpretations:
 * **frequentist** - how likely an event occurs
 * **bayesian** - degree of belief in an event
* **Random Variables**:
 * random variables are not variables, but functions that map outcomes to real values
 * allows us to abstract away from formal notion of event space, and instead use random variables to define appropriate events
 * **six-sided dice example:**
    * let $X$ be a random variable that maps outcome $i$ to the value $i$
    * for example: mapping event of throwing a "one" to the value of $1$
    * another example: map events of throwing a odd roll to $1$ and even roll to $0$
 * indicate probability with respect to random variable notation via: 
    * $P(X = a)$ or $P_{x}(a)$ (probability of a random variable $X$ taking on the value of $a$)
    * $Val(X)$ denotes the range of random variable $X$
* **Distributions, Joint distributions, and Marginal distributions**:
 * what is a distribution of a variable? this is the probability of a random variable taking on certain values
 * **six-sided dice example:**
    * if dice is fair, then the distribution of $X$: (defined over random variables)
        * $P(X = 1) = P(X = 2) = \dots = P(X = 6) = \frac{1}{6}$
 * $P(X)$ denotes distribution of random variable $X$
 * **joint distributions** - refers to a distribution that is defined by more than one variable (probability is determined jointly by all variables involved)
 * **1 fair six-sided dice and 1 fair coin example:**
    * let $X$ be a random variable defined on the outcome space of a dice throw and $Y$ be an indicator variable that takes on value of 1 if coin flip is heads and 0 if tails
    * $P(X = a,Y = b)$ or $P_{X,Y}(a,b)$
    * joint distribution notated: $P(X,Y)$
    * e.g. $P(X = 1, Y = 0) = \frac{1}{12}$
    * e.g. $P(X = 2, Y = 1) = \frac{1}{12}$
    * e.g. $P(X = 3, Y = 0) = \frac{1}{12}$
 * **marginal distribution** - refers to the probability of a distribution of random variable on its own
    * from the example above, finding the marginal distribution of $X$ or $Y$ alone
    * sum out all other random variables from the distribution: $P(X) = \sum_{b\in Val(Y)}P(X,Y = b)$ (sum out random variable $Y$ to find marginal distribution of $X$)
* **Conditional Distributions**:
 * conditional distributions allows us to reason about uncertainty
 * allows us to specify distribution of random variable when the value of another random variable is known (or event is known to be true)
 * conditional probability of $X = a$ *given* $Y = b$ is defined as: 
     * $P(X = a \mid  Y = b) = \frac{P(X = a, Y = b)}{P(Y = b)}$
 * **1 fair six-sided dice example:**
     * suppose we know that a dice throw was odd and want to know the probability of an "one" has been thrown
     * let $X$ be random variable of dice throw and $Y$ be indicator variable that takes on value of $1$ if dice throws is odd
     * $P(X = 1\mid  Y = 1) = \frac{P(X = 1, Y = 1)}{P(Y = 1)} = \frac {\frac{1}{6}}{\frac{1}{2}} = \frac{1}{3}$ (top: dice has 1/6 chance of throwing 1, bottom: dice has 1/2 chance of being odd)
 * when there is a distribution of random variable conditioned on several variables: 
     * $P(X = a \mid  Y = b, Z = c) = \frac{P(X = a, Y = b, Z = c)}{P(Y = b, Z = c)}$
 * $P(X\mid Y = b)$ denotes distribution of random variable $X$ when $Y = b$
 * $P(X\mid Y)$ denotes a set of distributions of $X$, one for each of the different values of $Y$
* **Independence**:
 * independence means that the distribution of a random variable does *not* change on learning the value of an another random variable
 * in ML, we often make such assumptions about our data (e.g. training samples are assumed to be drawn independently; label of sample $i$ is assumed to be independent of the features of sample $j$ $(i \neq j$))
 * probability of event A and B occurring:
     * $P(A\cap B) = P(A\mid B) * P(B) = P(B\mid A) * P(A)$
 * mathematically, a random variable $X$ is independent of $Y$ when 
     * $P(X) = P(X\mid Y)$ (holds true for any values of $X$ and $Y$)
 * mathmatically, $X$ and $Y$ both are independent random variables of each other
     * $P(X,Y) = P(X)P(Y)$
     * $P(A\cap B) = P(A)P(B)$
 * **conditional independence** - means if we know the value of a random variable (or set of random variables), then some other random variables will be independent of each other
     * $X$ and $Y$ are conditionally independent given $Z$
     * $P(X\mid Z) = P(X\mid Y,Z)$ or
     * $P(X,Y\mid Z) = P(X\mid Z)P(Y\mid Z)$
     * assumption made by Naive Bayes algorithm
         * e.g. classifying emails as spam or non-spam
         * assumes that the probability of a word $x$ appearing in the email is conditionally independent of a word $y$ appearing given whether the email is spam or not
         * why is this bad? loss of generality as some words invariably come in pairs
         * why is this good? allows classification of spams quickly
* **Chain rule and Bayes rule**:
 * **chain rule** - evalulates joint probability of some random variables (is especially useful when there are conditional independence across variables)
     * important in study of Bayesian Networks and Probability Graphical Models
     * $P(X_{1},X_{2},\dots,X_{n}) = P(X_{1})P(X_{2}\mid X_{1}) \cdots P(X_{n}\mid X_{1},X_{2},\dots,X_{n-1})$
     * $P(X_{1},X_{2},\dots,X_{n}) = \prod_{n=1}^k P(X_{n}\mid X_{1},X_{2},\dots,X_{n-1})$
         * $P(X_1,X_2,X_3)=P(X_1)P(X_2\mid X_1) * P(X_3\mid X_1,X_1,X_2)$
         * where $P(X_1,X_2) = P(X_1)P(X_2\mid X_1)$
 * **bayes rule** - allows computation of conditional probability $P(X\mid Y)$ from $P(Y\mid X)$
     * $P(X\mid Y) = \frac{P(Y\mid X)P(X)}{P(Y)}$
     * if $P(Y)$ is not given, find marginal distribution of $X$:
     * (law of total probability) $P(Y) = \sum_{b\in Val(X)}P(X = a,Y) = \sum_{b\in Val(X)}P(Y\mid X = a)P(X = a)$ 
 * bayes rule for multiple random variables:
     * $P(X,Y\mid Z) = \frac{P(Z\mid X,Y)P(X,Y)}{P(Z)} = \frac{P(Y,Z\mid X)P(X)}{P(Z)}$
     * $P(X\mid Y,Z) = \frac{P(Y\mid X,Z)P(X,Z)}{P(Y,Z)} = \frac{P(Y\mid X,Z)P(X\mid Z)P(Z)}{P(Y\mid Z)P(Z)} = \frac{P(Y\mid X,Z)P(X\mid Z)}{P(Y\mid Z)}$
* **Sets**:
 * **complement (e.g. $A^c$ or $A'$)** - denotes all elements that are not in the complementary set 
 * **union (e.g. $A \cup B$)** - denotes all elements that are in A or B
 * **intersection (e.g. $A \cap B$ = $A,B$ = $A\&B$)** - denotes all elements in A and B
     * disjoint - two sets are disjoint if their intersection is null $\varnothing$

2) Defining Probability Distributions
* **Discrete Distribution (Probability Mass Function)**:
 * discrete - random variable of the underlying distribution can only take on finite many different values (outcome space is finite)
 * **PMF** - enumeration the probability of random variable taking on each of the possible values
     * divides up a unit mass (total probability) and places them on different values a random variable can take
     * can extend to joint distributions and conditional distributions
* **Continuous Distribution (Probability Density Function)**:
 * continuous - random variable of the underlying distribution can take on infinitely many different values (outcome space is infnite)
 * cannot place a non-zero amount of mass on each of the values since it will add up to infinity (violating requirement of total probability to sum to 1)
 * **PDF**, $f$, is a non-negative, integrable function that defines the continuous distribution such that:
     * $\int_{Val(X)} f(x)dx = 1$
 * compute probability of a random variable $X$ distributed to a PDF $f$:
     * $P(a \leq X \leq b) = \int_{a}^{b} f(x)dx$
 * uniform distribution example: 
     * let random variable $X$ be uniformly distributed in the range $[0,1]$:
     * $f(x) = $$\begin{cases} 
        1  & \text{if $0 \leq x \leq 1$} \\
        0 & \text{$otherwise$}
        \end{cases}$
     * more generally, if $X$ is distributed uniformly over the range [a,b], the PDF:
     * $f(x) = $$\begin{cases} 
        \frac{1}{b-a}  & \text{if $a \leq x \leq b$} \\
        0 & \text{$otherwise$}
        \end{cases}$
 * **cumulative distribution function** - gives the probability of a random variable being smaller than some value
    * PDF -> CDF: $F(b) = P(X \leq b) = \int_{-\infty}^{b} f(x)dx$
    * CDF -> PDF: $F'(x) = \frac{d/dx}F(x) = f(x)$

3) Expectations and Variance
* **Expectations: (aka mean, expected value, first moment)**
 * $E(X) = \sum_{a\in Val(X)}aP(X = a)$ or $E(X) = \int_{a\in Val(X)}xf(x)dx$
 * $E(X) = x_1p_1 + x_2p_2 + \cdots +  x_kp_k$ or $E(X) = \int_{-\infty}^{\infty}xf(x)dx$
 * example: let $X$ be the outcome of rolling a fair dice
     * expectation of $X$ is
     * $E(X) = (1)\frac{1}{6} + (2)\frac{1}{6} + \cdots + (6)\frac{1}{6} = (3)\frac{1}{2}$
 * **linearity of expectations theorem**:
     * sums of random variables
         * $E(X_{1}+X_{2}+\cdots+X_{n}) = E(X_{1})+E(X_{2})+\cdots+E(X_{n})$
         * no restrictions on whether the random variables are independent or not
     * on products of random variables when independent then:
         * $E(XY) = E(X)E(Y)$
* **Variance: (aka spread, second moment)**
 * variance of a distribution is a measure of the "spread" of a distribution
 * mean squared distance between random variable $X$ and mean $\mu$
     * $\begin{align} Var(X) & = \sigma^2 \\
         & = E\big[(X-\mu)^2\big] \\
         & = E\big[(X-E(X))^2\big] \\
         & = E\big[X^2-2XE[X] + (E[X])^2\big] \\
         & = E[X^2]-2E[X]E[X] + (E[X])^2 \\
         & = E[X^2]-(E[X])^2
     \end{align}$
     * $Var(X) = E[(X-\mu)^2] = \sum_{i=1}^k p_i * (x_i-\mu)^2$
     * $Var(X) = \sigma^2 = \int (x-\mu)^2f(x)dx$
 * **standard deviation** = $\sigma = \sqrt{Var(X)}$
 * variance is not a linear function of a random variable $X$
     * $Var(aX+b) = a^2Var(X)$
 * if random variables $X$ and $Y$ are independent:
     * $Var(X+Y)=Var(X)Var(Y)$ if $X\bot Y$
 * **covariance** - measures how close two random variables are closely related
     * $\begin{align} Cov(X,Y) & = E((X-E(X))(Y-E(Y))) \\
         & = E[(X-\mu_x)(Y-\mu_y)] \\
         & = E(XY)-E(X)E(Y) \\
         & = \frac{\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{n-1}
     \end{align}$
 * **correlation** - normalized version of correlation coefficient
     * also measures strength of linear relationship between two variables
     * $\rho_{X,Y} = corr(X,Y) = \frac{cov(X,Y)}{\sigma_X \sigma_Y} = \frac{E[(X-\mu_x)(Y-\mu_y)]}{\sigma_x\sigma_y}$
     * $\tau_{X,Y} = \frac{ \sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y}) } { \sqrt{ \sum_{i=1}^n(x_i - \bar{x})^2 \sum_{i=1}^n(y_i - \bar{y})^2} }$
     * correlation measures strength of relationship between two variables but doesn't capture some qualities:
         * e.g. anscombe's quartet
         * reflects noisiness and direction
         * not slope and non-linearly qualities 
 * **scatterplot matrix** - captures the $Var(X_1)$, $Cov(X_1,X_2)$, $Corr(X_1,X_2)$
      * let $X$ be data matrix $m$ features: $X=[X_1,X_2,\dots,X_m]$
      * variances on diagonal
      * compact representation of all covariances
      * correlation matrices

4) Statistical Distributions
* **Bernoulli$(p)$**:
 * description: discrete probability distribution for a bernoulli trial (1 for success probability of p and 0 for failure)
 * $P(X) = p^x(1-p)^{1-x}$
 * PMF = $P[success] = p$, $P[failure] = 1-p$
 * mean: $E[x] = p$
 * variance: $Var(x) = p(1-p)$
 * example: $Bernoulli(p)$ = single coin flip turns out to be heads
 * a random variable distribution that takes two possible values {0,1}
 * has a single parameter, $p$, to be $P(X=1)$
 * often indicates whether a trial was successful or not
 * e.g. classification task using logistic regression has assumptions that the label are Bernoulli distributed given the features
* **Binomial$(n,p)$**:
 * description: discrete probability distribution of obtaining exactly $p$ successes out of $n$ trials
 * PMF: $(^n_k)p^k(1-p)^{n-k}$ for $0 \leq k \leq n$
 * mean: $np$
 * variance: $npq$ or $np(1-p)$
 * example: $Binomial(100,p)$ = # of coin flips out of 100 that turn out of to be heads
* **Geometric$(p)$**:
 * description: discrete probability of some number $(X)$ of Bernoulli trials needed to get one success. It also refers to probability of $(X-1)$ failures before the first success
 * PMF: $p(1-p)^{k-1}$ for $k = 1,2,\dots$
 * mean: $\frac{1}{p}$
 * variance: $\frac{1-p}{p^2}$
 * example: $Geometric(p)$ = # of trials until coin flip turns out to be heads
* **Hypergeometric**:
 * description: discrete probability distribution that describes probability of $k$ successes in $n$ draws, without replacement
 * the hypergeometric test uses the hypergeometric distribution to calculate the statistical significance of having drawn $k$ successes in $n$ total draws
     * example: urn with two types of marbles (red and green) and draw marbles without replacement
     * define drawing a green marble as a success and drawing a red as a failure (analogous to binomial distribution)
     * did i draw the **expected** number of green marbles?
     * this is not a binomial distribution since the probability of success on each trial is not the same
 * models the game - texas hold 'em
* **Poisson$(\lambda)$**:
 * description: after the mean of an event happening per unit time is observed, obtain the discrete probability of $n$ events happening
 * PMF: $P(X=k)=\frac{exp(-\lambda)\lambda^k}{k!}$ for $k = 0,1,2,\dots$
 * mean: $\lambda$
 * variance: $\lambda$
 * example: $Poisson(\lambda=10)$ = # of taxis passing a street corner in a given hour (on avg 10/hr)
 * deals with the arrival of events
     * measures the probability of the number of events happening over a fixed period of time
     * assumes a fixed average rate of occurrence
     * assumes that the events take place independently of the time since the last event
 * is parameterized by average arrival rate $\lambda$
 * mean of poisson random variable is $\lambda$
 * variance is also $\lambda$
* **Uniform$(a,b)$**:
 * PDF: $\frac{1}{b-a}\forall x \in (a,b)$
 * mean: $\frac{a+b}{2}$
 * variance: $\frac{(b-a)^2}{12}$
 * example: $Uniform(0,360)$ = Degrees between hour hand and minute hand
* **Gaussian$(\mu,\sigma^2)$**: (aka normal distribution)
 * description: most widely distribution for continuous variables
 * PDF: $f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}exp\big(\frac{-(x-\mu)^2}{2\sigma^2}\big)$ for $x \in (-\infty, \infty)$
 * mean: $\mu$
 * sigma: $\sigma^2$
 * example: $Gaussian(100,10)$ = IQ Score
 * inverse of variance is called **precision** $(\tau = \frac{1}{\sigma^2})$
 * approximates binomial distribution when the number of experiments is large
 * approximates poisson distribution when the average arrival rate is high
 * related to Law of Large Numbers
 * often assume noise in the system is Gaussian distributed
 * parameterized by mean $\mu$ and variance $\sigma^2$
* **Exponential$(\lambda)$**:
 * description: models the time between events for a poisson process (subcase of gamma distribution)
 * PDF: $\lambda e^{-\lambda x}$ $x \geq 0, \lambda > 0, x \in (0,\infty)$
 * mean: $\frac{1}{\lambda}$
 * variance: $\frac{1}{\lambda^2}$
 * example: $Exponential(\lambda=10)$ = time until taxi will pass street corner
 * it is governed by a rate parameter $\lambda$
* **Beta$(\alpha,\beta)$**:
 * description: density function that is versatile in representing outcomes like proportions or probabilities. it works on a space between 0 and 1
 * PDF: $P(x\mid \alpha,\beta) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)}$
     * where $B$ is the beta function: $B(\alpha,\beta) = \int_0^1 t^{\alpha-1}(1-t)^{\beta-1}dt$
 * mean: $\frac{a}{a+b}$
 * variance: $\frac{ab}{(a+b)^2(a+b+1)}$
 * example: useful in estimating success
 * uniform is actually a special case of Beta ($Uniform(0,1) = Beta(1,1)$)
 * there are two parameters that work together to determine if the distribution has a mode in the interior of unit interval and whether it is symmetrical
* **Gamma$(k,\theta)$**:
 * example: time until $n$ evnets in a processs with no memory
 * gamma is the sum of $k$ Exponentials
* **Chi-square**:
 * example: useful for goodness of fit tests
 * chi-square is sum of squares of $k$ independent Gaussian random variables
* **F-Distribution**:
 * example: useful for some statistical tests
 * f-dist is the ratio of two normalized chi-squared distributed random variables

In [2]:
# if there are 10 lottery balls and we want to draw them all, how many possible orderings are there?
import math
math.factorial(10)

3628800

In [13]:
# how many different pairs are there for afternoon sprints with 24 students? (order doesn't matter / combinations)
from itertools import combinations
spair = list(combinations("ABCDEFGHIJKLMNOPQRSTUVXY",2))
print len(spair)

276


In [14]:
from math import factorial
def comb(n, k):
    return factorial(n) / (factorial(k) * factorial(n-k))
print int(comb(24,2))

276


In [12]:
# permutations (order does matter)
# on a baseball team of 20 players, how many different batting orders are there?
from itertools import permutations
len(list(permutations("ABCDEFGHIJKLMNOPQRST",2)))

380

In [16]:
from math import factorial
def perm(n, k):
    return factorial(n) / factorial(n-k)
print int(perm(20,2))

380
