# Chapter 6 Probability and Distributions
* **Quantifying uncertainty requires the idea of a *random variable*, which is a function that maps outcomes of random experiments to a set of properties that we are interested in.**
    * **Associated with the random variable is a function that measures the probability that a particular outcome(or a set of outcomes) will occur; this is called the probability distribution.**
    
## 6.1 Construction of a Probability Space
* **The theory of probability aims at defining a mathematical structure to describe random outcomes of experiments.**

### 6.1.1 Philosophical Issues
* **Probability theory can be considered a generalization of Boolean logic.**
* **Mathematical criterias that must apply to all plausibilities:**
1. **The degree of plausibility are represented by real numbers.**
1. **These numbers must be based on the rules of common sense.**
1. **The reasoning must be consistent, with the three following meanings of the word "consistent":**
    1. **Consistency or non-contradiction: When the same result can be reached through differenct means, the same plausibility values ,ust be found in all cases.**
    1. **Honesty: All available data must be taked in to account.**
    1. **Reproducibility: If our state of knowledge**
* **The Bayesian Interpretation of Probability: uses probability to specify the degree of uncertainity that the user has about an event, sometimes referred to as "subjective probability" or "degree of belief".**
* **The frequentist interpretation considers the relative frequencies of events of interest to the total number of events that occured. The probability of an event is deined as the relative frequency of the event in the limit when one has infinite data.**

### 6.1.2 Probability and Random Variables
1. **The sample space $ \Omega $**<br>
**The *sample space* is the set of all possible outcomes of the experiment. usually denoted by $\Omega$**
1. **The event space $ \mathcal{A} $**<br>
**The *event space* is the space of potential results of the experiment. A subset $A$ is in the event space $\mathcal{A} $ if at the end of the experiment we can observe whether a particular outcome $\omega \in \Omega$ is in $A$.**
    * **The even space $A$ is obtained by considering the collection of subsets of $\Omega$, and for discrete distribution $A$ is often the power set of $\omega$.**
1. **The probability $P$**<br>
**With each event $A \in \mathcal{A}$, we associate a number $P(A)$ that measure the probability or degree of belief that the event will occur. $P (A) $ is called the *porbability* of $A$.**
1. **Random variable**<br>
**A function $X: \omega \to \mathcal{T}$ that takes an element of $\omega$ (an outcome) and returns a particular quantity of interest $x$, a value in $\mathcal{T}$. This association/mapping from $\omega$ to $\mathcal{T}$ is called a *random variable*.**
    * **For a finite sample space $ \omega $ and finite $ \mathcal{T}$, the function corresponding to a random variable is essentially a look up table.**
* **Consider the random variable $ X: \omega \to \mathcal{T} $ and a subset $S \subset \mathcal{T}$. Let $ X^{-1}(S) $ be the pre-image of $S$ by $X$, i.e., the set of elements of $ \Omega $ that map to $S$ under $X$; $\{ \omega \in \Omega: X(\omega) \in S \}$. For $S \subseteq \mathcal{T} $, we have the notation
$$ P_X (S) = P ( X \in S ) = P ( X^{-1} (S)) = P(\{\omega \in \Omega: X(\omega) \in S\})$$**
* **A random variable $X$ is distributed according to a particular probability distribution $P_X$, which defines the probability mapping between the event and the probability outcome of the randome variable.**

## 6.2 Discrete and Continuous Probabilitie
* **When the target space $\mathcal{T} $ is discreet, we can specify the probability that a random variable $X$ takes a particular value $ x \in \mathcal{T} $, denoted as $ P(X=x) $.**
    * **The expression $P(X = x) $ for a discrete random variable $X$ is known as the *probability mass function*.**
* **When the target space $ \mathcal{T} $ is continuous, it is more natural to specify the probability that a random variable $X$ is in an interval, denoted by $ P( a \leq X \leq b) $ for $ a \lt b$.**
    * **The expression $ P(X \lt x) $ for a continuous random variable $X$ is known as the cumulative distribution function.**
    
### 6.2.1 Discrete Probabilities
* **The target space of the joint probability is the cartesian product of the target spaces of each of the random variables. We define the *joint probability* as the entry of both values jointly
$$ P(X =  x_i, Y = y_i) = \frac{n_{ij}}{N}$$
where $n_{ij}$ is the number of events with state with $x_i$ and $y_i$ and $N$ the total number of events.**
    * **The joint probability is the probability of the intersection of both events, that is, $ P(X = x_i, Y = y_i) = P(X = x_i \cap Y = y_i) $**
* **for two random variable $X$ and $Y$, the probability that $X = x$ and $Y = y$ is wirtten as $p(x,y)$ and is called the joint probability.**
* **The *marginal probability* that $X$ takes the value of $x$ irrespective of the value of random variable $Y$ is written as $p(x)$.**
* **If we consider only the instances where $ X = x$, then the fraction of instances(the *conditional probability*) for which $ Y = y$ is written as $ p (y|x)$**
* **In machine learning, we use discrete probability distributions to model $categorical\space variables$, i.e., variables that take a finite set of unordered values.**

### 6.2.2 Continuous Probabilities
* **Probability Density Function. A function $f: \mathbb{R}^{D} \to \mathbb{R} $ is called a probability density function (pdf) if**
    1. $ \forall x \in \mathbb{R}^{D}: f(\pmb{X}) \leq 0 $
    1. **Its integral exists and 
$$ \int_{\mathbb{R}^{D}} f(\pmb{x}) d \pmb{x} = 1$$
for probability mass function(pmf) of discrete random variables, the integral is replaced with a sum**
* **We associate a random variable $X$ with this function by 
$$ P(a \leq X \leq b) = \int_a^b f(x)d x$$
where $ a, b \in \mathbb{R} $ are outcomes of the continuous random variable $X$.**

* **Cumulative Distribution Function. A *cumulative distribution function* (cdf) of a multivariate real-valued random variable $X$ with states $ x \in \mathbb{R}^{D}$ given by**
$$ F_X (\pmb{X} ) = P( X_1 \leq x_1, \dots, X_2 \leq x_D) $$, 
**where $ X = [ X_1, \dots, X_D]^T, \pmb{x} = [x_1, \dots, x_D]^T $, and the right-hand sode represents the probability that random variable $X_i$ takes the value smaller than or equl to $x_i$.**
    * **The cdf can also be expressed as the integral of the probability density function $f( \pmb{x}) $ so that**
$$ F_x(\pmb{x}) = \int_{- \infty}^{x_1} \dots \int_{- \infty}^{x_D} f(z_1, \dots, z_D) d_{z1} \dots d_{Z_D} $$
* **There are two distinct concepts when talking about distribution:**
    1. **The idea of a pdf(denoted by $f(x)$), whcih is a nonnegative function that sums to one.**
    1. **The law of a random variable $X$, the association of random variable $X$ with the pdf $f(x)$.**

![Screen%20Shot%202020-11-27%20at%208.01.50%20AM.png](attachment:Screen%20Shot%202020-11-27%20at%208.01.50%20AM.png)

### 6.2.3 Contrasting Discrete and Continuous Distributions
![Screen%20Shot%202020-11-27%20at%208.12.14%20AM.png](attachment:Screen%20Shot%202020-11-27%20at%208.12.14%20AM.png)
**Nomenclature**<br>
* **For a value $x$ of the set of possible outcomes of te random variable $\pmb{X}$, i.e., $x \in \mathcal{T} $, $p(x)$ donotes the probability that random variable $\pmb{X}$ has the outcome $x$. **
* **For discrete random variables, this is written as $P(X=x)$, wcih is known as the probability mass function**
* **For continuous variables, $p(x)$ is called the probability density function(often referred to as density).**
* $p(\pmb{x,y})$ **is the joint distribution of the rwo random variables $\pmb{x,y}$. The distributions $p(\pmb{x})$ and $p(\pmb{y})$ are the corresponding marginal distributions, and $ p(\pmb{y}|\pmb{x})$ is the conditional distribution of $\pmb{y}$ given $\pmb{x}$**

## 6.3 Sum Rule, Product Rule, and Bayes' Theorem
* **Two Fundamental Rules of Probability Theory**
* ***Sum Rule***
$$ p(\pmb{x}) = 
\begin{cases}
\sum_{\pmb{y} \in \mathcal{y}} p(\pmb{x,y}) \quad if \space \pmb{y} \space is \space discrete \\
\int_{\mathcal{y}} p(\pmb{x,y}) d \pmb{y} \quad if \space \pmb{y} \space is \space continuous 
\end {cases}
$$
**where $\mathcal{y}$ are the states of the target space of random variable $Y$. This means that we sum out(or integrate out) the set of states $\pmb{y}$ of the random variable $Y$.**
    * **The sum rule is also known as the *marginalization property*. The sum rule relates the joint distribution to a marginal distribution.**
    * **In general, when the joint distribution contains more than two random variables, the sum rule can be applied to any subset of the random variable. More concretely, if $ \pmb{x} = [x_1, \dots, x_D]^T$, we obtain the marginal
$$ p(x_i)=\int p(x_1, \dots, x_D) d \pmb{x}_{\backslash i} $$
by repeated application of the sum rule where we integrate/sum out all random variables except $x_i$, which is indicated by $\backslash i$**
* **Product Rule**<br>
$$ p(\pmb{x, y}) = p(\pmb{y}|\pmb{x})p(\pmb{x})$$
**The product rule can be interpreted as the fact that every joint distribution of two random variables can be factorized of tow other distributions: the marginal distribution of the first random variable $p(\pmb{x})$, and the conditional distribution of the second random variable given the first $p(\pmb{y}|\pmb{x})$.**
* **Baye's theorem***<br>
**Assume we have some prior knowledege $p(\pmb{x})$ about an unobserved random variable $\pmb{x}$ and some relationship $y(\pmb{y}|\pmb{x})$ between $\pmb{x}$ and a second random variable $\pmb{y}$, which we cna observe. If we observe $\pmb{y}$, we can use Baye's theorem to draw some conclusions about $\pmb{x}$ given the observed values of $\pmb{y}$.**
![Screen%20Shot%202020-11-28%20at%207.43.53%20AM.png](attachment:Screen%20Shot%202020-11-28%20at%207.43.53%20AM.png)
* $ p(\pmb{x})$ **is the *prior*, which encapsulates our subjective knowledge of the unobserved(latebt) variable $\pmb{x}$ before obsercing any data.**
* **The *likelihood* $p(\pmb{y}|\pmb{x}) $ describes how $\pmb{x}$ and $\pmb{y}$ are related, and in the case of discrete probability distributions, it is the porbability of the data $\pmb{y}$ if we were to know the latent variable $\pmb{x}$.**
* **The *posterior* $p(\pmb{x}|\pmb{y})$ is the quantity of interest in Bayesian statistics because it expresses exactly what we are interested in, i.e., what we know about $\pmb{x}$ after having observed $\pmb{y}$.**
* **The quantity
$$ p(\pmb{y}) = \int p(\pmb{y}|\pmb{x})p(\pmb{x})d\pmb{x} = \mathbb{E}_X[p(\pmb{y} | \pmb{x}]$$
is the *marginal likelihood/evidence***
* **The marginal likelihood is independent of $\pmb{x}$, and it ensures that the posterior $p(\pmb{x}|\pmb{y})$ is normalized. The marginal likelihood can also be interpreted as the expected likelihood where we take the expectation with respect to the prior $p(\pmb{x})$.**
* **Baye's theorem allows us to invert the relationship between $\pmb{x}$ and $\pmb{y}$ given the likelihood. Therefore, Bayes's theorem is sometimes called the *probabilistic invserse.***

## 6.4 Summary Statistics and Independence
### 6.4.1 Means and Covariances
* **Expected Value. The *expected value* of a function $g: \mathbb{R} \to \mathbb{R}$ of a univariate continuous random variable $ X ～ p(x)$ is given by**
$$ \mathbb{E}_X [ g(x) ] = \int_{\chi} g(x)p(x) dx$$
**Correspondingly, the expected value of a function $g$ of a discrete random variable $ X ~ p(x)$ is given by**
$$ \mathbb{E}_X [ g(x)] = \sum_{x \in \chi} g(x) p(x)$$
**where $\chi$ is the set of possible outcomes(the target space) of the random variable X.**
* **We consider multivariate random variables *X* as a finite vector of univariate random $[ X_1, \dots, X_d]^T$. For multivariate random variables, we define the expected value element wise
$$ \mathbb{E}_x [g(\pmb{x})] = 
\begin {bmatrix}
\mathbb{E}_{x_1} [ g(x_1)]\\
\vdots\\
\mathbb{E}_{x_D} [ g(x_D) ]\\
\end {bmatrix}
\in \mathbb{R}^{D}
$$
where the subscript $\mathbb{E}_{x_d} $ indicates that we are taking the expected value with respect to the $d$th element of the vector $\pmb{x}$.**

* **Mean. The *mean* of a random variable $X$ with states $\pmb{x} \in \mathbb{R}^{D} $ is an average and is defined as**
$$ 
\mathbb{E}_X [\pmb{x}] = 
\begin {bmatrix}
\mathbb{E}_{X_1} [ x_1)]\\ 
\vdots\\
\mathbb{E}_{X_D} [ x_D ]\\
\end {bmatrix}
\in \mathbb{R}^{D}
$$
where 
$$
\mathbb{E}_{X_d} [x_d] :=
\begin {cases}
\int_{\chi} x_d p(x_d) d x_d \space if \space X \space is \space a 
\space continuous \space random \space variable \\
\sum_{x_i \in \chi}^{\chi} x_i p(x_d = x_i) \space if \space X \space is \space a \space discrete \space random \space variable\\
\end {cases}
$$
**for $d = 1, \dots, D$, where the subscript $d$ indicates the corresponding dimension of $x$. The integral and sum are over the states $ \chi$ of the target space of the random variable X.**

**Median**<br>
* **Median: the "middle" value if we sort the values, i.e., 50% of the values are greater than the median and 50% are smaller than the median.**
    * **The median provides an estimate of a typical value that is closer to human intuition than the mean value.**
    * **The median is more robust to outliers than the mean.**
    * **The genenralization of the median to higher dimensions is non-trivial.**
**Mode**<br>
* **The *mode* is the most frequently occuring value.**
* **For a discrete random variable, the mode is defined as the value of $x$ having the highest frequency of occurrence.**
* **For a continuous random variavle, the mode is defined as a peak in the density $p(\pmb{x})$**

* **Covariance(Univariate). The *covariance* between two univariate random variables $X,Y \in \mathbb{R} $ is given by the expected product of ther deviations from their repsecteive means.i.e.,**
$$ Cov_{X,Y} [x, y] := 
\mathbb{E}_{X,Y} [ (x-\mathbb{E}_X [x])(y - \mathbb{E}_y [y] )] $$
* **The covariance of a variable with itself $Cov[x,x]$ is called the *variance* and is denoted by $ \mathbb{V}_X [x] $.**
* **The square root of the variance is called the *standard deviation* and is often denoted by $\sigma (x) $.**

* **Covariance(Multivariate). If we consider two miltivariate random variables $X$ and $Y$ with states $\pmb{x}\in \mathbb{R}^D$ and $\pmb{y} \in \mathbb{R}^{E} $ respectively, the *covariance* betweem *X* and *Y* is defined as 
$$ Cov[\pmb{x,y} ] = \mathbb{E}[\pmb{xy}^T] - \mathbb{E}[\pmb{x}] \mathbb{E} [ \pmb{y} ]^T = Cov [\pmb{y,x} ]^T \in \mathbb{R}^{D \times E} $$**
* **The *variance* of a random variable X with states $\pmb{x} \in \mathbb{R}^D $ and a mean vector $\mu \in \mathbb{R}^{D} $ is defined as**
![Screen%20Shot%202020-11-29%20at%207.28.58%20AM.png](attachment:Screen%20Shot%202020-11-29%20at%207.28.58%20AM.png)
* **The $D \times D$ matrix called the *covariance matirx* of the multivariate random variable $X$.**
* **The covariance matrix is symmetric and positive semidefinite and tells us something about the spred of the data.**
* **On its diagonal, the covariance matrix contains the variance of the *marginals*
$$ p(x_i) = \int p(x_1, \dots, x_D) d_{x_{\backslash i}},$$
where "\i" denotes "all variables but i". The off-diagonal entries are *the cross-variance* terms $Cov[x_i, x_j] $ for $i, j =1, \dots, D, i \neq j$**
* **Correlation. The *correlation* between two random variables $X,Y$ is given by
$$ corr[x, y] = \frac{\mathrm{Cov}[x,y]}{\sqrt{\mathbb{V}[x] \mathbb{V} [y]}} \in [-1, 1] $$**

### 6.4.2 Empirical Means and Covariances
* **Empricial Mean and Covariance. The *empirical mean* vector is the arithmetic average of the observation for each variable, and it is defined as
$$ \overline{\pmb{x} } := \frac{1}{N} \sum_{n=1}{N} \pmb{x}_n$$**
* **To compute the statistics for a particular dataset, we would use the realtizations(observations) $\pmb{X}_1, \dots, \pmb{x}_N$**
* **The *empirical covariance* matrix is a $D \times D$ matrix
$$ \Sigma := \frac{1}{N} \sum_{n=1}^{N} (\pmb{x}_n - \overline{\pmb{x}})(\pmb{x}_n - \overline{x})^T$$**

### 6.4.3 Three Expressions for the Variance
* **The standard definition of variance is that the expectation of the squared deviation of a random variable X from its expected value $\mu$, i.e.,**
$$ \mathbb{V}_x [x] := \mathbb{E}_X [(x-\mu)^2] $$
**depending on whether $X$ is a discrete or continuous random variable.**

**Calculating Variance through a two-pass algorithm**<br>
* **one pass through the data to calculate the mean $\mu$**
* **Second pass using the estimate $\hat{\mu}$ calculate the variance.**

***Raw-score formula for variance.***<br>
$$ \mathbb{V}_X [x] = \mathbb{E}_X [x^2] - (\mathbb{E}[x])^2 $$
* **It can be calculated empirically in one pass through data since we can accumulate $x_i$ (to calculate the mean) and $x_i^2$ simultaneously, where $x_i$ is the $ith$ observation.**

* **A third way to understand the variance is that it is a sum of pairwise differences between all pairs of observations. Consider a sample $x_1, \dots, x_N$ of realizations of random variable X, and we compute the squared difference between pairs of $x_i$ and $x_j$.**
$$ \frac{1}{N^2} \sum_{i,j=1}^{N} (x_i - x_j)^2 = 2[\frac{1}{N} \sum_{i=1}^{N} x_i^2 - (\frac{1}{N} \sum_{i=1}^{N} x_i)^2 ]$$
* **We can express the sum of pairwise distances(of which there are $N^2$ of them) as a sum of deviations from the mean(of which there are N).**
* **Geometrically, this means that there is an equivalence between the pairwise distances and the distances from the center of the set of points.**

### 6.4.4 Sums and Transformations of Random Variables
* **Consider two random variables $X,Y$ with states $\pmb{x,y} \in \mathbb{R}^D $. Then:**
![Screen%20Shot%202020-11-29%20at%208.46.50%20AM.png](attachment:Screen%20Shot%202020-11-29%20at%208.46.50%20AM.png)
* **Consider a random variable $X$ with mean $\pmb{\mu}$ and covariance matrix $\Sigma$ and a (deterministic) affine transformation $\pmb{y} = \pmb{Ax} + \pmb{b} $ of $\pmb{x}$. Then $\pmb{y}$ is itself a random variable whose mean vector and covariance matrix are given by**
![Screen%20Shot%202020-11-29%20at%208.52.21%20AM.png](attachment:Screen%20Shot%202020-11-29%20at%208.52.21%20AM.png)
**respectively. Furthermore,**
![Screen%20Shot%202020-11-29%20at%208.56.50%20AM.png](attachment:Screen%20Shot%202020-11-29%20at%208.56.50%20AM.png)

### 6.4.5 Statistical Independence
* **Independenc. Two random variables $X, Y$ are *statistical independent* if and only if**
$$ p(\pmb{x,y}) = p(\pmb{x})p(\pmb{y}) $$
* **Intuitively, two random variables $X$ and $Y$ are independent if the value of $\pmb{y}$ does not any additional information about $\pmb{x}$. If $X,Y$ are (statistically) independent, then**
    * $$ p(\pmb{y}|\pmb{x}) = p(\pmb{y}) $$
    * $$ p(\pmb{x} | \pmb{y}) p(\pmb{x}) $$
    * $$ \mathbb{V}_{X,Y} [\pmb{x+y}] = \mathbb{V}_x [\pmb{x}] + \mathbb{V}_y [\pmb{y}] $$
    * $$ \mathrm{Cov}_{X,Y} [\pmb{x,y}] = \pmb{0} $$
**Two random variable can have covariance zero but not statisitcally independent: covariance measures only linear dependence.**

* **Conditional  Independence. Two random variable $X$ and $Y$ are *conditionally independent* given $Z$ if and only if 
$$ p(\pmb{x,y}|\pmb{z}) = p(\pmb{x|z}) p(\pmb{y|z}) \space for \space all \space \pmb{z} \in \mathcal{Z} $$
where $\mathcal{Z}$ is the set of states of random variable $Z$. We write $ X \space \perp \!\!\! \perp \space Y|Z$ to denote that $X$ is conditionally independent of $Y$ given $Z$.**
* **The interpretation can be understood as "given knowledge about $z$, the distribution of $\pmb{x}$ and $\pmb{y}$ factorizes.**
* **By using the product rule of probability, we can expand the left-hand side, we can obtain**
$$ p(\pmb{x,y} | \pmb{z}) = p(\pmb{x} | \pmb{y,z}) p(\pmb{y}|\pmb{z})$$

### 6.4.6 Inner Products of Random Variables
* **Random variables can be considered vectors in a vector space, and we can define inner products to obtain geometric properties of random variables. If we define
$$ \langle X, Y \rangle := \mathrm{Cov} [x, y] $$
for zero mean random variables $X$ and $Y$, we obtain an inner product.**
* **We see that the covariance is symmetirc, positive definite, and linear in either argument. The length of the random variable is
$$ \| X \| = \sqrt{\mathrm{Cov}[x, x]} = \sqrt{\mathbb{V}[x]} = \sigma[x]$$
i.e., its standard deviation. The "longer" the random variable, the more uncertain it is; and a random variable with length 0 is deterministic.**
* **If we look at the angle $\theta$ between two random variables $X, Y$, we get
$$ \cos \theta = 
\frac{ \langle X,Y \rangle}{\| X \| \| Y \|} 
= 
\frac{\mathrm{Cov}[x,y]}{ \sqrt{\mathbb{V}[x] \mathbb{V}[y]}} $$
which is the correlation between the two random variables.**
* **This means that we can think of correlation as the cosine of the angle between two random variables when we consider them geometrically. In our case, this means that $X$ and $Y$ are orthogonal if and only if $ Cov[x, y] = 0$, i.e., they are uncorrelated.**

## 6.5 Gaussian Distribution
* **For a univarite random variable, the Gaussian distribution has a density that is given by
$$ p(x|\mu, \sigma^2) = \frac{1}{\sqrt{2 \pi \sigma^2}} \mathrm{exp} (- \frac{(x-\mu)^2}{2\sigma^2}) $$**
* **The *multivariate Gaussian distribution* is fully characterized by a *mean vector* $\mu$ and a *covariance matrix* is multivariate $\Sigma$ is defined as 
$$ p(\pmb{x|\mu, \Sigma}) = (2\pi)^{-\frac{D}{2}}|\Sigma|^{-\frac{1}{2}}exp(-\frac{1}{2}(\pmb{x-\mu})^T \Sigma^{-1} (\pmb{x-\mu})) $$
where $\pmb{x} \in \mathbb{R}^D $. We write $p(\pmb{x}) = \mathcal{N} (\pmb{x|\mu, \Sigma})$ or $ X \sim \mathcal{N} (\mu, \Sigma) $**<br>
**Bivariate Gaussian Distribution**<br>
![Screen%20Shot%202020-11-30%20at%206.24.00%20AM.png](attachment:Screen%20Shot%202020-11-30%20at%206.24.00%20AM.png)
**Univariate Gaussian Distribution**<br>
![Screen%20Shot%202020-11-30%20at%206.27.18%20AM.png](attachment:Screen%20Shot%202020-11-30%20at%206.27.18%20AM.png)

### 6.5.1 Marginals and Conditionals of Gaussians are Gaussians
* **Let $X$ and $Y$ be two multivariate random variables, we explicitly write the Gaussain distribution in terms of the concatenated states $[\pmb{x}^{T}, \pmb{y}^{T}]$,
$$p(\pmb{x}=
\mathcal(N) 
\bigg(
\begin {bmatrix}
\pmb{\mu_x}\\
\pmb{\mu_y}\\
\end {bmatrix}
,
\begin {bmatrix}
\Sigma_{xx} \quad \Sigma_{xy} \\
\Sigma_{yx} \quad \Sigma_{yy} \\
\end {bmatrix}
\bigg)
$$
where $ \Sigma_{xx} = \mathrm{Cov} [ \pmb{x,x}] $ and $\Sigma_{yy} = \mathrm{Cov} [\pmb{y,y}]$ are the marginal covariance matrices of $\pmb{x} $ and $\pmb{y}$, respectively, and $\Sigma_{xy} = Cov[\pmb{x,y}] $ is the cross-covariance matrix between $\pmb{x}$ and $\pmb{y}$**
* **The conditional distribution $p(\pmb{x|y})$ is also Gaussian and is given by**
![Screen%20Shot%202020-11-30%20at%206.43.50%20AM.png](attachment:Screen%20Shot%202020-11-30%20at%206.43.50%20AM.png)
* **The marginal distribution $p(\pmb{x})$ of a joint Gaussian distribution $p(\pmb{x,y})$ is itself Gaussian and computed by applying the sum rule and given by
$$ p(\pmb{x}) = \int p(\pmb{x,y}) \mathrm{d} \pmb{y} = \mathcal{N}(\pmb{x|\mu_x, \Sigma_{xx}}) $$**

### 6.5.2 Product of Gaussian Densities
* **The *product* of two Gaussians $ \mathcal{N}(\pmb{x|a, A})\mathcal{N}(\pmb{x|b, B})$ is a Gaussian distribution scaled by a $c \in \mathbb{R}$, given by $c\mathcal{N}(\pmb{x|c, C})$ with
![Screen%20Shot%202020-11-30%20at%207.00.38%20AM.png](attachment:Screen%20Shot%202020-11-30%20at%207.00.38%20AM.png)
The scaling constant $c$ itself can be written in the form of a Gaussian density either in $\pmb{a}$ or in $\pmb{b}$ with an "inflated" covariance matrix $\pmb{A+B}$, i.e., $ c= \mathcal{N} (\pmb{a|b, A+B})=\mathcal{N}(\pmb{b|a, A+B})$**

### 6.5.3 Sums and Linear Transformations
* **If $X, Y$, are in independent Gaussian random variables(i.e., the joint distribution is given as $p(\pmb{x,y}) = p(\pmb{x})p(\pmb{y}) $ with $p(\pmb{x}) = \mathcal{N} (\pmb{x|\mu_x, \Sigma_x})$ and $ p(\pmb{y}) = \mathrm{N}(\pmb{y|\mu_y, \Sigma_y})$, then $\pmb{x+y}$ is also Gaussian distributed and given by 
$$ p(\pmb{x+y}) = \mathcal{N}(\pmb{\mu_x + \mu_y, \Sigma_x + \Sigma_y}) $$**
* **Knowing that $p(\pmb{x+y})$ is Gaussian, the mean and covariance matrix can be determined immediately.**
![Screen%20Shot%202020-11-30%20at%207.20.31%20AM.png](attachment:Screen%20Shot%202020-11-30%20at%207.20.31%20AM.png)

* **Consider a mixture of two univariate Gaussian densities
$p(x) = \alpha p_1(x) + (1-\alpha) p_2(x) $
where the scalar $0 < \alpha < 1 $ is the mixture weight, and $p_1(x)$ and $p_2(x)$ are univariate Gaussian desnities with different parameters, i.e., $(\mu_1, \sigma_1^2) \neq (\mu_2, \sigma_2^2)$.<br>
Then the mean of the mixture density $p(x)$ is given by the weighted sum of the means of each random variable 
$$ \mathbb{E}[x] = \alpha \mu_1 + (1- \alpha) \mu_2 $$
The variance of the mixture density $p(x)$ is given by 
$$ \mathbb{V}[x] = [\alpha \sigma_1^2 + (1-\alpha)\sigma_2^2] + \Big([ \alpha \mu_1^2 + (1 - \alpha)\mu_2^2] - [\alpha \mu_1 + (1-\alpha)\mu_2]^2 \Big) $$**
* **This is rearranged such that the expectation of a squared random variable is the sum of the squared mean and variance.**
* ***Law of total variance* For two random variables $X$ and $Y$ it holds that $ \mathbb{V}_X [x] = \mathbb{E}_Y [ \mathbb{V}_x[x][y]]+ \mathbb{V}_Y [\mathbb{E}_X[x[y]]$, i.e., the (total) variance of $X$ is the expected conditional variance plus the variance of the conditional mean.**
* **Consider a Gaussian distributed random variable $ X \sim \mathcal{N} (\pmb{\mu, \Sigma})$. For a given matrix $\pmb{A}$ of appropriate shape, let $Y$ be a random varaible such that $\pmb{y} = \pmb{Ax}$ is a transformed version of $\pmb{x}$, we can compute the mean of $\pmb{y}$ by exploiting that the expactation is a linear operator. This means that the random variable $\pmb{y}$ is distributed according to 
$$ p (\pmb{y}) = \mathcal{N} (\pmb{y|A \mu, A \Sigma A^T})$$**
* **When we know that a random variable has a mean that is a linear transformation of another variable. For a given full rank matrix $\pmb{A} \in \mathbb{R}^{M \times N}$, where $ m \leq N $, let $ y \in \mathbb{R}^M $ bw a Gaussian random variable with mean $\pmb{Ax}$, i.e., 
$$ p(\pmb{y}) = \mathcal{N} (\pmb{y|Ax, \Sigma})$$
The corresponding probability distribution $p(\pmb{x}) $ is** 
![Screen%20Shot%202020-11-30%20at%208.03.18%20AM.png](attachment:Screen%20Shot%202020-11-30%20at%208.03.18%20AM.png)

### 6.5.4 Sampling from Multivariate Gaussian Distributions
1. **We need a source of pseudo-rando numbers that provide a uniform sample in the interval $[ 0, 1]$**
1. **We yse a non-linear transformation such as the Box-Muller transform to obtain a sample from a univariate Gaussian.**
1. **We collate a vector of these samples to obtain a sample from a multivariate standard norm $\mathcal{N}(\pmb{0,I} )$**
    * **To obtain samples from a multivariate normal $ \mathcal{N} (\pmb{\mu, \Sigma}) $, we can use the properties of a linear trnasformation of Gaussain random variable: if $\pmb{x} \sim \mathcal{N}(\pmb{0, I} )$ then $\pmb{y = Ax + \mu}$, where $\pmb{A A^T} = \Sigma $ is Gaussian distributed with mean $\mu$ and covariance matrix $ \Sigma$.**

## 6.6 Conjugacy and the Exponent Family
* **There is some "closure property" when applying the rules of probability, e.g., Bayes' theorem. By closure, we mean that aopplying a particular operation returns an object og the same type.**
* **As we collect more data, we do not need more parameters to describe the distribution.**
* **Since we are interestd in learning from data, we want parameter estimation to behave nicely.**
![Screen%20Shot%202020-11-30%20at%205.40.59%20PM.png](attachment:Screen%20Shot%202020-11-30%20at%205.40.59%20PM.png)

![Screen%20Shot%202020-11-30%20at%205.44.42%20PM.png](attachment:Screen%20Shot%202020-11-30%20at%205.44.42%20PM.png)

***Beta distribution***<br>
* **Beta distribution is a distribution over a continuous random variable $ \mu \in [0, 1] $, which is often used to represent the probability for some binary events(e.g., the parameter governing the Bernoulli distribution). The Beta distribution $Beta( \alpha, \beta)$ itself is governed by two parameters $ \alpha > 0, \beta > 0$ and is defined as 
$$ p(\mu| \alpha, \beta) = \frac{\Gamma (\alpha + \beta)}{\Gamma (\alpha)\Gamma (\beta) } \mu^{\alpha - 1}(1 - \mu)^{\beta - 1} \\
\mathbb{E}[\mu] = \frac{\alpha}{\alpha + \beta}, \\
\mathbb{V}[\mu] = \frac{\alpha \beta}{(\alpha + \beta)^2 (\alpha + \beta + 1)}$$
where $ \Gamma(\cdot)$ is the Gamma function defined as 
$$ \Gamma (t) := \int_{0}^{\infty} x^{t-1} \mathrm{exp} (-x) dx, t > 0. \\ \Gamma (t + 1) = t \Gamma (t) $$**
![Screen%20Shot%202020-11-30%20at%205.59.57%20PM.png](attachment:Screen%20Shot%202020-11-30%20at%205.59.57%20PM.png)
* **For $ \alpha = 1 = \beta$, we obtain the uniform distribution $\mathcal{u} [0,1] $**
* **For $ \alpha, \beta < 1$, we get a bimodal distribution with spikes at 0 and 1.**
* **For $\alpha, \beta > 1$, the distribution is unimodal.**
* **For $\alpha, \beta > 1$, and $ \alpha = \beta $, the distribution is unimodal, symmetric, and centered in the interval \[0, 1\], i.e., the mode/mean is at $\frac{1}{2}$**

### 6.6.1 Conjugacy
* **Conjugate Prior. A prior is *conjugate* for the likelihood function if the posterior of the same form/type as the prior.**
    * **Conjugacy is particularly convenient because we can algebraically calculate our posterior distribution by updating the parameters of the prior distribution.**
![Screen%20Shot%202020-11-30%20at%206.32.59%20PM.png](attachment:Screen%20Shot%202020-11-30%20at%206.32.59%20PM.png)
![Screen%20Shot%202020-11-30%20at%206.35.20%20PM.png](attachment:Screen%20Shot%202020-11-30%20at%206.35.20%20PM.png)

* **Beta-Bernoulli Conjugacy**
![Screen%20Shot%202020-11-30%20at%206.41.16%20PM.png](attachment:Screen%20Shot%202020-11-30%20at%206.41.16%20PM.png)

**Examples for conjugate priors for the parameters of some standard likelihoods used in probablistic modeling.**
![Screen%20Shot%202020-11-30%20at%206.49.47%20PM.png](attachment:Screen%20Shot%202020-11-30%20at%206.49.47%20PM.png)
* **The Beta distribution is the conjugate prior for the parameter $\mu$ in both the Binomial and the Bernoulli likelihood.**
* **In a univariate(scalar) case, the inverse Gamma is the conjugate prior for the variance.**
* **In the multivariate case, we use conjugate inverse WWishart distribution as a prior on the covariance matrix.**

### 6.6.2 Sufficient Statistics
* ***Sufficient statistics* : the idea taht there are statistics that will contain all availiable information that can be inferred from data corresponding to the distribution under consideration.**
    * **In other words, sufficient statistics carry all the information needed to make inference about the population, that is, they are statistics that are sufficient to represent the distribution.**
* **Fisher-Neyman.Let $X$ have probability density function $p(x|\theta)$. Then the statistics $\phi (x) $ are sufficient for $ \theta$ if and only if $ p(x | \theta) $ can be written in the form
$$ p(x | \theta ) = h(x) g_{\theta}(\phi(x)),$$
where $h(x)$ is a distribution independent of $\theta$ and $g_{\theta}$ captures all the dependence on $\theta$ via sufficient statistics $\phi (x) $.**

### 6.6.3 Exponential Family
* **An *exponent family* is a family of probability distributions, parametrized by $ \pmb{\theta} \in \mathbb{R}^{D} $, of the form
$$ p(\pmb{x|\theta} ) = h(\pmb{x})\mathrm{exp} (\langle \pmb{\theta}, \phi(\pmb{x})\rangle - A(\pmb{\theta})) $$
where $\phi (\pmb{x}) $ is the vector of sufficient statistics.**
    * **In general, any inner product can be used, and for conreteness we will use the standard dot product here $ ( \langle \pmb{\theta}, \phi(\pmb{x}) \rangle = \pmb{\theta}^T \phi(x) $**
    * **The form of the exponential family is essentially a particular expression of $g_{\theta}(\phi(x)) $ in the Fisher-Neyman theorem.**
    * **The factor $h(\pmb{x})$ can be absorbed into the dot product term by adding another entry $ (\mathrm{log} \space h(\pmb{x}) $ to the vector of sufficient statistics $\phi (\pmb{x}) $, and constraining the corresponding parameter $\theta_0 = 1.$**
    * **The term $A(\theta)$ is the normalization constant that ensures that the distribution sums up or integrates to one and is called the *log-partition function*.**
    * **A good intuitive notion of exponential families can be obtained by ignoring the first two terms and considering exponential families as distributions of the form
$$ p(\pmb{x}|\pmb{\theta}) \propto \mathrm{exp} \big(\pmb{\theta}^T \phi(\pmb{x}) \big)$$
for this form of parametrization, the parameters $\theta$ are called the *natural parameters*.**
![Screen%20Shot%202020-11-30%20at%207.45.55%20PM.png](attachment:Screen%20Shot%202020-11-30%20at%207.45.55%20PM.png)

![Screen%20Shot%202020-11-30%20at%207.48.05%20PM.png](attachment:Screen%20Shot%202020-11-30%20at%207.48.05%20PM.png)

![Screen%20Shot%202020-11-30%20at%207.48.34%20PM.png](attachment:Screen%20Shot%202020-11-30%20at%207.48.34%20PM.png)
* **The relationship between the orginal Bernoulli parameter $\mu$ and thr natural parameter $\theta$ is known as the *sigmoid* or logistics function.**
    * **Observe that $\mu \in (0,1) $ but $ \theta \in \mathbb{R} $, and therefore the sigmoid function squeezes a real value into the range $(0,1) $.**

* **Exponential families provide a convenient way to find conjugate pairs of the distributions. Consider the random variable $X$ is a member of the exponential family:
$$ p(\pmb{x}|\pmb{\theta}) = h(\pmb{x}) \mathrm{exp}(\langle \pmb{\theta}, \phi(\pmb{x})\rangle - A(\pmb{\theta}))$$
Every member of the exponential family has a conjugate prior
$$ p(\pmb{\theta}|\gamma) = h_c(\pmb{\theta}) \mathrm{exp} \bigg( \bigg \langle 
\begin {bmatrix}
\gamma_1 \\
\gamma_2 \\
\end {bmatrix}
,
\begin {bmatrix}
\pmb{\theta} \\
-A(\theta) \\
\end {bmatrix}
\bigg \rangle - A_c(\gamma)\bigg)$$
where $\gamma = 
\begin {bmatrix}
\gamma_1 \\
\gamma_2 \\
\end {bmatrix}$ has dimension $ \mathrm{dim}(\theta) + 1$. The sufficient statistics of the conjugate prior are $ \begin {bmatrix} \pmb{\theta}\\
-A(\pmb{\theta})\\
\end {bmatrix} $.**


### 6.7 Change of Variables/Inverse Transform
* **For discrete random variables, transformations directly change the individual events(with the probabilites appropriately transformed).**
    * **Transformations of discrete random variables can be understood directly. Suppose that there is a discrete random variable $X$ with pmf $P(X = x)$, and a invertible function $U(x)$. Consider the transformed random variable $Y:=U(X)$, with pmf $P(Y = y)$. Then
$$ P(Y = y) = p(U(X)=y)\\
    = P(X= U^{-1} (y)) $$
where we can observe that $x=U^{-1}(y)$. Therefore, for discrete random variables, transformations directly change the individual events(with the probabilities appropriately transformed).**

### 6.7.1 Distribution Function Technique
* **for a random variable $X$ and a function $U$, we find the pdf of the random variable $Y := U(X)$  by**
1. **Finding the cdf:** 
$$ F_Y(y) = P(Y \leq y) $$
1. **Differentiating the cdf $ F_Y(y)$ to get the pdf $f(y) $
$$ f(y) = \frac{\mathrm{d}}{\mathrm{d}y} F_y(y)$$**
![Screen%20Shot%202020-11-30%20at%209.00.07%20PM.png](attachment:Screen%20Shot%202020-11-30%20at%209.00.07%20PM.png)

* **Let $X$ be a continuous random variable with a strictly monotonic cumulative distribution function $F_X(x)$. Then the random variable $Y$ defined as 
$$ Y := F_X(X)$$ has a uniform distribution.**
    * **Probability integral transform: it is used to derive algorithms for sampling from distribution by transformaing the result of sampling from a uniform random variable.**
        * **The algorith works by first generating a sample from a uniform distribtion, then transforming it by the inverse cdf(assuming this is availiable) to obtain a smaple from the desired distribution.**



### 6.7.2 Change of Variables 
* **The name "change of variables" comes from the idea of changing the variable of integration when faced with a difficult integral.
$$ \int f(g(x))g'(x) \mathrm{d} x = \int f(u) \mathrm{d} u , \quad where \quad u = g(x) $$
The fundamental theorem of calculus formalizes the fact that integration and differentiation are somhow "inverses" of each other.**

* **Consider a univariate random variable $X$, and an *invertible* function $U$, which gives another randomm variable $Y=U(X)$. We assume that random variable $X$ has states $x \in [a, b] $. By the definition of the cdf, we have 
$$ F_Y(y) = P(Y \leq y) $$
we are interested in a function $U$ of the random variable
$$ P(Y \leq y) = P(U(X) \leq y) $$
where we assume that the function $U$ is invertible.**
    * **An invertible function on an interval is either increasing or strictly decreasing. In the case that $U$ is strictly increasing, then its inverse $U^{-1}$ is also strictly increasing. By applying the inverse $U^{-1}$ to the arguments $P(U(X) \leq y) $, we obtain
$$ P(U(X) \leq y) = P(U^{-1} U(X)) \leq U^{-1}(y)) = P(X\leq U^{-1}(y))$$**

* **Let $f(\pmb{x}))$ be the value of the probability density of the multivariate continuous random variable $X$. If the vector-valued function $\pmb{y} = U(\pmb{x})$ is differentiable and invertible for all values within the domain of $\pmb{x}$, then for corresponding values of $\pmb{y}$, the probability density of $Y = U(X)$ is given by
$$ f(\pmb{y}) = f_x(U^{-1}(\pmb{y})) \cdot \bigg| \mathrm{det} \bigg( \frac{\partial}{\partial \pmb{y}} U^{-1}(\pmb{y}) \bigg) \bigg|$$**
    * **We need to work out the inverse transform, and substitue that into the density of $\pmb{x}$.**
    * **Then we calculate the determinant of the Jacobian and multiply the result.**

![Screen%20Shot%202020-12-01%20at%207.16.44%20AM.png](attachment:Screen%20Shot%202020-12-01%20at%207.16.44%20AM.png)
![Screen%20Shot%202020-12-01%20at%207.17.24%20AM.png](attachment:Screen%20Shot%202020-12-01%20at%207.17.24%20AM.png)