# Chapter 2: Probabilities

#### Probability vs Probability Density

* As opposed to PMF, PDF ***can*** be > 1, and does ***not*** give point probability $\text{Prob}\left(\mathbf{X}=\mathbf{x}_{0}\right)$ -> hence the need for normalized distribution.


## Bayesian learning (binary)
* *Keywords: Bernoulli, Binomial, Beta distrib.*

**Problem**: Tossing a coin $N=3$ times. Observed number of heads $m=3$ and number of tails $l=0$. Predict outcome of next tosses.

**Model**: Each toss is a <font color='blue'>Bernoulli</font> event $$p\left(t|\Theta\right)=\text{Bernoulli}\left(t|\mu\right)=\mu^{t}\left(1-\mu\right)^{1-t}=\begin{cases}
\mu & \text{if }t=1\\
1-\mu & \text{if }t=0
\end{cases}$$ where $t\in\left\{ 0,1\right\} $ and $\Theta=\left\{ \mu\right\}$ .
 Therefore $N$ (independent) tosses gives <font color='blue'>**Likelihood**</font> $$p\left(\mathcal{D}|\Theta\right)=\prod_{n}^{N}\mu^{t_{n}}\left(1-\mu\right)^{1-t_{n}}
 $$
where $\mathcal{D}=\left\{ t_{1..N}\right\}$. Predict outcome of next tosses $\equiv$ ***Estimate $\mu$***

<font color='red'>*Note*</font>: the above likelihood can be written as 
$p\left(\mathcal{D}|\Theta\right)=\mu^{m}\left(1-\mu\right)^{N-m}$ which is the unnormalized pmf of a <font color='blue'>Binomial</font> distribution. In fact, Bernoulli is a *special case* of the Binomial. Additionally, the problem can be modelled using the Binomial directly via an *alternative problem*.
> Throw $N$ coins which have the same fairness $\mu$ to the ground. Observed $m$ heads and $l$ tails. Predict number of heads if throwing other $N'$ coins which also have the same fairness $\mu$.

In this case, 
$$p\left(\mathcal{D}|\Theta\right)=\left(\begin{array}{c}
N\\
m
\end{array}\right)\mu^{m}\left(1-\mu\right)^{N-m}$$
where $\mathcal{D}=\left\{ m\right\}$ and $\Theta=\left\{ N,\mu\right\} $


**Inference and Predictive distribution**:
* Frequentist: 
    * <font color='blue'>**Parameter** estimate</font>: 
    $$\hat{\mu}_\text{MLE}=\dfrac{m}{N}=1$$
    * <font color='blue'>**Predictive distribution**</font>: 
    $$p\left(t^{\left(\text{new}\right)}=1|\mathcal{D}\right)=\text{Bernoulli}\left(t^{\left(\text{new}\right)}=1|\mu=1\right)=1$$ i.e. all future tosses will be heads (!)
* Bayesian: Hm.. Such extreme estimate based on such limited observation by MLE. It seems more intuitive to 
    * formally measure *uncertainty $p\left(\mu|\cdot\right)$ of the parameter $\mu$*, and
    * *updating prediction* upon new tosses **sequentially** when new data come.

### Bayesian inference

<font color='blue'>**Parameter Prior**</font>: $$p\left(\mu\right|\mathcal{H})=\text{Beta}\left(\mu|a,b\right)=\dfrac{\Gamma\left(a+b\right)}{\Gamma\left(a\right)\Gamma\left(b\right)}\mu^{a-1}\left(1-\mu\right)^{b-1}$$ for ***conjugacy***. Hyperparameters $\mathcal{H}=\left\{ a,b\right\} $ are interpreted as *effective* (prior and observed) number of heads and tails respectively (note *effective*, as a posterior can be used as a prior in ***sequential learning*** framework).

<font color='red'>*Note*</font>: $\Gamma\left(z+1\right)=z\cdot\Gamma\left(z\right)$ for $z\in\mathbb{R}
 $, and $\Gamma\left(n+1\right)=n!$ for $n\in\mathbb{N}$


<font color='blue'>**Parameter Posterior**</font>: $$p\left(\mu|\mathcal{D},\mathcal{H}\right)=\text{Beta}\left(a+m-1,b+l-1\right)$$
<img src="figs/beta.png">

<font color='blue'>**Predicitve distribution**</font>: $$p\left(t^{\left(\text{new}\right)}=1|\mathcal{D},\mathcal{H}\right)=\int_{\mu}p\left(t=1|\mathcal{\mu},\mathcal{D},\mathcal{H}\right)p\left(\mu|\mathcal{D},\mathcal{H}\right)\text{d}\mu=\dfrac{m+a}{N+a+b}$$
In concrete, $p\left(t^{\left(\text{new}\right)}=1|\mathcal{D},\mathcal{H}\right)=...=\int_{\mu}\mu p\left(\mu|\mathcal{D},\mathcal{H}\right)\text{d}\mu=\mathbb{E}_{\mu}\left[\mu|\mathcal{D},\mathcal{H}\right]=\dfrac{m+a}{N+a+b}$

## Bayesian learning (categorical)
* *Keywords: Categorical, Multinomial, Dirichlet distrib. *

**Problem**: Rolling dice. There are $K=6$ possible outcomes

**Model**: Each roll is a <font color='blue'>Categorical</font> (or generalized Bernoulli) event
$$p\left(\mathbf{t}|\Theta\right)=\prod_{k}\mu_{k}^{t_{k}}$$
where $\mathbf{t}$ is 1-of-K encoding, $\Theta=\left\{\mu_{1..K}\right\} $ and $\sum_{k}\mu_{k}=1$. 
 
 Therefore $N=\sum_{k}m_{k}$ (independent) rolls give <font color='blue'>**Likelihood**</font> 
 $$p\left(\mathcal{D}|\Theta\right)=\prod_{n}^{N}\prod_{k}^{K}\mu_{k}^{t_{nk}}
 $$
 where $\mathcal{D}=\left\{t_{1..N,1..K}\right\} $

<font color='red'>*Note*</font>: the above likelihood can be written as 
$p\left(\mathcal{D}|\Theta\right)=\prod_{k}^{K}\mu_{k}^{m_{k}}
$ which is the unnormalized pmf of a <font color='blue'>Multinomial</font> distribution. As above, Categorical is a *special case* of the Multinomial, and the original problem can be modelled using the Multinomial indirectly via an *alternative problem*. In this case, 
$$p\left(\mathcal{D}|\Theta\right)=\left(\begin{array}{c}
N\\
m_{1}m_{2}...m_{K}
\end{array}\right)\prod_{k}^{K}\mu_{k}^{m_{k}}$$
where $\mathcal{D}=\left\{m_{1..K}\right\}$ and $\Theta=\left\{ N,\mu_{1..K}\right\} $


### Bayesian inference

<font color='blue'>**Parameter Prior**</font>: $$p\left(\mathbf{\mu}\right|\mathcal{H})=\text{Dir}\left(\mathbf{\mu|\alpha}\right)=\dfrac{\Gamma\left(\sum_{k}\alpha_{k}\right)}{\prod_{k}\Gamma\left(\alpha_{k}\right)}\prod_{k}\mu^{\alpha_{k}-1}$$ for *conjugacy*. Dirichlet hyperparameters $\mathcal{H}=\left\{\alpha_{1..K}\right\} $ are interpreted as *effective* (prior and observed) number of each outcome respectively.

<font color='blue'>**Parameter Posterior**</font>: $$p\left(\mathbf{\mu}|\mathcal{D},\mathcal{H}\right)=\text{Dir}\left(\mathbf{\alpha+m}\right)$$
<img src="figs/dirichlet.png">

## TODO LDA

## Bayesian learning (continuous)
*Keywords:*
* (univariate) *Gaussian, Gamma, Gaussian-Gamma distrib.* 
* (multivariate) *Gaussian, Wishart, Gaussian-Wishart distrib. *