# Probability and Distributions

Probability, loosely speaking, concerns the study of uncertainty. Probabil- ity can be thought of as the fraction of times an event occurs, or as a degree of belief about an event. We then would like to use this probability to mea- sure the chance of something occurring in an experiment. We often quantify uncertainty in the data, uncertainty in the machine learning model, and uncertainty in the predictions produced by the model. Quantifying uncertainty requires the idea of a random variable, which is a function that maps outcomes of random experiments to a set of properties that we are interested in. Associated with the random variable is a function that measures the probability that a particular outcome (or set of outcomes) will occur; this is called the probability distribution.

Probability distributions are used as a building block for other concepts, such as probabilistic modeling, graphical models, and model selection. In the next section, we present the three concepts that define a probability space and how they are related to a fourth concept called the random variable.

## Construction of a Probability Space

The theory of probability aims at defining a mathematical structure to describe random outcomes of experiments. For example, when tossing a single coin, we cannot determine the outcome, but by doing a large number of coin tosses, we can observe a regularity in the average outcome. Using this mathematical structure of probability, the goal is to perform automated reasoning, and in this sense, probability generalizes logical reasoning.

In machine learning and statistics, there are two major interpretations of probability: the Bayesian and frequentist interpretations. The Bayesian interpretation uses probabil- ity to specify the degree of uncertainty that the user has about an event. It is sometimes referred to as "subjective probability" or "degree of belief". The frequentist interpretation considers the relative frequencies of events of interest to the total number of events that occurred. The probability of an event is defined as the relative frequency of the event in the limit when one has infinite data.

#### Probability and Random Variables

There are three distinct ideas that are often confused when discussing probabilities. First is the idea of a probability space, which allows us to quantify the idea of a probability. However, we mostly do not work directly with this basic probability space. Instead, we work with random variables (the second idea), which transfers the probability to a more convenient (often numerical) space. The third idea is the idea of a distribution or law associated with a random variable. We will introduce the first two ideas in this section and expand on the third idea

Modern probability is based on a set of axioms that introduce the three concepts of sample space, event space, and probability measure. The probability space models a real-world process (referred to as an experiment) with random outcomes.

**The Sample Space $\Omega$**

The sample space is the set of all possible outcomes of the experiment, usually denoted by $\Omega$. For example, two successive coin tosses have a sample space of {hh, tt, ht, th}, where "h" denotes "heads" and "t" denotes "tails".

**The event space $A$**

The event space is the space of potential results of the experiment. A subset $A$ of the sample space $\Omega$ is in the event space $A$ if at the end of the experiment we can observe whether a particular outcome $\omega$ $\in$ $\Omega$ is in $A$. The event space $A$ is obtained by considering the collection of subsets of $\Omega$, and for discrete probability distributions $A$ is often the power set of $\Omega$.

**The probability $P$**

With each event $A$ $\in$ $A$, we associate a number $P(A)$ that measures the probability or degree of belief that the event will occur. $P(A)$ is called the probability of $A$.

The probability of a single event must lie in the interval \[0, 1\], and the total probability over all outcomes in the sample space $\Omega$ must be 1, i.e., $P(\Omega) = 1$. Given a probability space ($\Omega, A, P$), we want to use it to model some real-world phenomenon

In machine learning, we often avoid explicitly referring to the probability space, but instead refer to probabilities on quantities of interest, which we denote by $\tau$. The target space is referred to as $\tau$ and refer to elements of $\tau$ as states. We introduce a function $X : \Omega \rightarrow \tau$ that takes an element of $\Omega$ (an outcome) and returns a particular quantity of interest $x$, a value in $\tau$ . This association/mapping from $\Omega$ to $\tau$ is called a random variable. For example, in the case of tossing two coins and counting the number of heads, a random variable $X$ maps to the three possible outcomes: $X(hh) = 2$, $X(ht) = 1$, $X(th) = 1$, and $X(tt) = 0$. In this particular case, $\tau = \{0, 1, 2\}$, and it is the probabilities on elements of $\tau$ that we are interested in. For a finite sample space $\Omega$ and finite $\tau$, the function corresponding to a random variable is essentially a lookup table. For any subset $S  \subseteq \tau$ , we associate $P_X(S) \in [0, 1]$ (the probability) to a particular event occurring corresponding to the random variable $X$


Consider a statistical experiment where we model a funfair game con- sisting of drawing two coins from a bag (with replacement). There are coins from USA (denoted as \\$) and UK (denoted as £) in the bag, and since we draw two coins from the bag, there are four outcomes in total. The state space or sample space $\Omega$ of this experiment is then (\\$, \\$), (\\$, £), (£, \\$), (£, £). Let us assume that the composition of the bag of coins is such that a draw returns at random a \\$ with probability 0.3.
The event we are interested in is the total number of times the repeated draw returns \\$. Let us define a random variable $X$ that maps the sample space $\Omega$ to $\tau$ , which denotes the number of times we draw \\$ out of the bag. We can see from the preceding sample space we can get zero \\$, one \\$, or two \\$s,and therefore $\tau$ =\{0,1,2\}. The random variable $X$(a function or lookup table) can be represented as a table like the following:

\begin{equation}
X(($,$)) = 2\\
X(($,£)) = 1\\
X((£,$)) = 1\\
X((£,£)) = 0
\end{equation}

Since we return the first coin we draw before drawing the second, this implies that the two draws are independent of each other. Note that there are two experimental outcomes, which map to the same event, where only one of the draws returns $. Therefore, the probability mass function of X is given by

\begin{equation}
P(X=2) = P(($,$))\\
= P($) \cdot P($)\\
= 0.3 \cdot 0.3 = 0.09\\
\end{equation}
\begin{equation}
P(X=1) = P(($,£)\ \cup\ (£, $))\\
= P(($,£))\ +\ ((£, $))\\
= 0.3 \cdot (1-0.3) + (1-0.3) \cdot 0.3 = 0.42\\
\end{equation}
\begin{equation}
P(X=0) = P((£,£))\\
= P(£) \cdot P(£)\\
= (1 - 0.3) \cdot (1 - 0.3) = 0.49 
\end{equation}

#### Statistics

Probability theory and statistics are often presented together, but they con- cern different aspects of uncertainty. One way of contrasting them is by the kinds of problems that are considered. Using probability, we can consider a model of some process, where the underlying uncertainty is captured by random variables, and we use the rules of probability to derive what happens. In statistics, we observe that something has happened and try to figure out the underlying process that explains the observations. In this sense, machine learning is close to statistics in its goals to construct a model that adequately represents the process that generated the data. We can use the rules of probability to obtain a "best-fitting" model for some data.

Another aspect of machine learning systems is that we are interested in generalization error. This means that we are actually interested in the performance of our system on instances that we will observe in future, which are not identical to the instances that we have seen so far. This analysis of future performance relies on probability and statistics.

## Discrete and Continuous Probabilities

When the target space $\tau$ is discrete, we can specify the probability that a random variable X takes a particular value $x \in \tau$ , denoted as $P(X = x)$. The expression $P (X = x)$ for a discrete random variable $X$ is known as the probability mass function. When the target space $\tau$ is continuous, e.g., the real line $\mathbb{R}$, it is more natural to specify the probability that a random variable $X$ is in an interval, denoted by $P(a \leqslant X \leqslant b) for a < b$. By convention, we specify the probability that a random variable $X$ is less than a particular value $x$, denoted by $P (X \leqslant x)$. The expression $P (X \leqslant x)$ for a continuous random variable X is known as the cumulative distribution function.

### Discrete Probabilities

When the target space is discrete, we can imagine the probability distri- bution of multiple random variables as filling out a (multidimensional) array of numbers e.g. below:

![Screenshot 2021-06-01 at 16.22.15.png](attachment:df82aa44-8cfb-4e16-9322-1d4e49d87733.png)

The target space of the joint probability is the Cartesian product of the target spaces of each of the random variables. We define the joint probability as the entry of both values jointly:

\begin{equation}
P = (X=x_i, Y=y_i)=\frac{n_{i,j}}{N}
\end{equation}

where $n_{ij}$ is the number of events with state $x_i$ and $y_j$ and $N$ the total number of events. The joint probability is the probability of the intersectionofbothevents, thatis, $P = (X=x_i, Y=y_i) = P(X=x_{i} \cap Y=y_i)$

The figure above illustrates the probability mass function (pmf) of a discrete prob- ability distribution. For two random variables $X$ and $Y$ , the probability that $X = x$ and $Y = y$ is (lazily) written as $p(x, y)$ and is called the joint probability. One can think of a probability as a function that takes state $x$ and $y$ and returns a real number, which is the reason we write $p(x,y)$.

The marginal probability that $X$ takes the value $x$ irrespective of the value of random variable $Y$ is (lazily) written as $p(x)$. We write $X ∼ p(x)$ to denote that the random variable X is distributed according to p(x). If we consider only the instances where $X = x$, then the fraction of instances (the conditional probability) for which $Y = y$ is written (lazily) as $p(y | x)$.

**Marginal probability is the probability of an event irrespective of the outcome of another variable**

Consider two random variables $X$ and $Y$, where $X$ has five possible states and $Y$ has three possible states, as shown above. We denote by $n_{ij}$ the number of events with state $X = x_i$ and $Y = y_j$, and denote by $N$ the total number of events. The value $c_i$ is the sum of the individual frequencies for the ith column, that is, $c_i = \sum^3_{j=1}n_{ij}$. Similarly, the value
$r_j$ is the row sum, that is, $r_j = \sum^5_{i=1} n_{ij}$. Using these definitions, we can compactly express the distribution of X and Y.

The probability distribution of each random variable, the marginal probability, can be seen as the sum over a row or column:

\begin{equation}
P(X=x_i) = \frac{c_i}{N} = \frac{\sum^3_{j=1}n_{ij}}{N}
\end{equation}

and

\begin{equation}
P(Y=y_j) = \frac{r_j}{N} = \frac{\sum^5_{j=1}n_{ij}}{N}
\end{equation}

where $c_i$ and $r_j$ are the ith column and jth row of the probability table, respectively. By convention, for discrete random variables with a finite number of events, we assume that probabilties sum up to one, that is:

\begin{equation}
\sum^5_{i=1}P(X=x_i)=1
\end{equation}

and

\begin{equation}
\sum^3_{j=1}P(Y=y_j)=1
\end{equation}

The conditional probability is the fraction of a row or column in a particular cell. For example, the conditional probability of $Y$ given $X$ is:

\begin{equation}
P(Y=y_j|X=x_i) = \frac{n_{ij}}{c_i}
\end{equation}

and the conditional probability of $X$ given $Y$ is:

\begin{equation}
P(X=x_i|Y=y_j) = \frac{n_{ij}}{r_j}
\end{equation}

In machine learning, we use discrete probability distributions to model categorical variables, i.e., variables that take a finite set of unordered val- ues. They could be categorical features, such as the degree taken at uni- versity when used for predicting the salary of a person, or categorical la- bels, such as letters of the alphabet when doing handwriting recognition. Discrete distributions are also often used to construct probabilistic models that combine a finite number of continuous distributions 

### Continuous Probabilities

We consider real-valued random variables in this section, i.e., we consider target spaces that are intervals of the real line $\mathbb{R}$.

The size of a set is called its measure. For example, the cardinality of discrete sets, the length of an interval in R, and the volume of a region in $\mathbb{R}^d$ are all measures. We consider random variables with values in $\mathbb{R}^D$ to be a vector of real-valued random variables.


**Probability Density Function**
A function $f : \mathbb{R}^D \rightarrow \mathbb{R}$ is called a probability density function (pdf) if:
1. $\forall x \in \mathbb{R}^D: f(x)  \geqslant 0$
2. Its integral exists and

\begin{equation}
\int_{\mathbb{R}^D} f(x)dx=1
\end{equation}

For probability mass functions (pmf) of discrete random variables, the integral in above is replaced with a sum.

Observe that the probability density function is any function $f$ that is non-negative and integrates to one. We associate a random variable $X$ with this function $f$ by:

\begin{equation}
P(a \leqslant X \leqslant b) = \int^b_{a} f(x)dx=1
\end{equation}

where $a, b \in \mathbb{R}$ and $x \in \mathbb{R}$ are outcomes of the continuous random variable $X$. States $x \in \mathbb{R}^D$ are defined analogously by considering a vector of $x \in \mathbb{R}$. This association is called the law or distribution of the random variable $X$.

### Contrasting Discrete and Continuous Distributions

Recall from Section that probabilities are positive and the total probability sums up to one. For discrete random variables, this implies that the probability of each state must lie in the interval \[0,1\]. However, for continuous random variables the normalization does not imply that the value of the density is less than or equal to 1 for all values. We illustrate this in the figure below using the uniform distribution for both discrete and continuous random variables.

<img src="attachment:76bddb1f-92b6-4405-a97d-7b3edf9ebd99.png" style="width: 800px;"/>

We consider two examples of the uniform distribution, where each state is equally likely to occur. This example illustrates some differences between discrete and continuous probability distributions. Let $Z$ be a discrete uniform random variable with three states $\{z = −1.1, z = 0.3, z = 1.5\}$. The probability mass function can be represented as a table of probability values:

<img src="attachment:0dbe915e-ebf5-4b60-afc1-68dc09b56574.png" style="width: 400px;"/>

Alternatively, we can think of this as a graph (above left), where we use the fact that the states can be located on the $x$-axis, and the $y$-axis represents the probability of a particular state.

Let $X$ be a continuous random variable taking values in the range $0.9 \leqslant X \leqslant 1.6$, as represented by Figure above (right). Observe that the height of the density can be greater than 1. However, it needs to hold that:

\begin{equation}
\int^{1.6}_{0.9} f(x)dx=1
\end{equation}


Summary of nomenclature:

<img src="attachment:3b4f5335-837e-4b35-a479-8d64f6baef70.png" style="width: 600px;"/>

### Sum Rule, Product Rule, and Bayes’ Theorem

Once we have defined probability distributions corresponding to the uncertainties of the data and our problem, it turns out that there are only two fundamental rules, the sum rule and the product rule.

Recall from that $p(x, y)$ is the joint distribution of the two ran- dom variables $x$, $y$. The distributions $p(x)$ and $p(y)$ are the corresponding marginal distributions, and $p(y | x)$ is the conditional distribution of y given x. Given the definitions of the marginal and conditional probability for discrete and continuous random variables, we can now present the two fundamental rules in probability theory.

The first rule, the sum rule, states that:

\begin{equation}
  p(x) =
    \begin{cases}
      \sum_{y \in Y}p(x,y)\ \text{if y is discrete}\\
      \int_{Y}p(x,y)dy\ \text{if y is continuous}\\
    \end{cases} 
\end{equation}

where $y$ are the states of the target space of random variable $Y$. This means that we sum out (or integrate out) the set of states $y$ of the random variable $Y$. The sum rule is also known as the marginalization property. The sum rule relates the joint distribution to a marginal distribution. In general, when the joint distribution contains more than two random variables, the sum rule can be applied to any subset of the random variables, resulting in a marginal distribution of potentially more than one random variable.

The second rule, known as the product rule, relates the joint distribution to the conditional distribution via

\begin{equation}
p(x, y) = p(x|y)p(x)
\end{equation}

The product rule can be interpreted as the fact that every joint distribution of two random variables can be factorized (written as a product) of two other distributions. The two factors are the marginal distribu- tion of the first random variable $p(x)$, and the conditional distribution of the second random variable given the first $p(y | x)$. Since the ordering of random variables is arbitrary in $p(x, y)$, the product rule also implies $p(x, y) = p(x | y)p(y)$.

In machine learning and Bayesian statistics, we are often interested in making inferences of unobserved (latent) random variables given that we have observed other random variables. Let us assume we have some prior knowledge $p(x)$ about an unobserved random variable $x$ and some relationship $p(y | x)$ between $x$ and a second random variable $y$, which we can observe. If we observe $y$, we can use Bayes’ theorem to draw some conclusions about $x$ given the observed values of $y$

Bayes’ theorem:

\begin{equation}
p(x|y) = \frac{p(y|x)p(x)}{p(y)}
\end{equation}

- $p(x|y)$ = posterior
- $p(y|x)$ = likelihood
- $p(x)$ = prior
- $p(y)$ = evidence

Bayes’ theorem is a direct consequence of the product rule.

In $p(x)$ is the prior, which encapsulates our subjective prior knowledge of the unobserved (latent) variable x before observing any data. We can choose any prior that makes sense to us, but it is critical to ensure that the prior has a nonzero pdf (or pmf) on all plausible $x$, even if they are very rare.

The likelihood $p(y | x)$ describes how $x$ and $y$ are related, and in the case of discrete probability distributions, it is the probability of the data $y$ if we were to know the latent variable $x$. Note that the likelihood is not a distribution in $x$, but only in $y$. We call $p(y | x)$ either the “likelihood of $x$ (given $y$)” or the “probability of $y$ given $x$” but never the likelihood of $y$.

The posterior $p(x | y)$ is the quantity of interest in Bayesian statistics because it expresses exactly what we are interested in, i.e., what we know about $x$ after having observed $y$.

In Bayesian statistics, the posterior distribution is the quantity of interest as it encapsulates all available information from the prior and the data. Instead of carrying the posterior around, it is possible to focus on some statistic of the posterior, such as the maximum of the posterior.

### Summary Statistics and Independence

A statistic of a random variable is a deterministic function of that random variable. The summary statistics of a distribution provide one useful view of how a random variable behaves, and as the name suggests, provide numbers that summarize and charac- terize the distribution. We describe the mean and the variance, two well- known summary statistics.

#### Means and Covariances
Mean and (co)variance are often useful to describe properties of probability distributions (expected values and spread).

The concept of the expected value is central to machine learning, and the foundational concepts of probability itself can be derived from the expected value.

The expected value of a function $g : \mathbb{R} \rightarrow \mathbb{R}$ of a univariate continuous random variable $X ∼ p(x)$ is given by:
\begin{equation}
\mathbb{E}_X[g(x)] = \int_\mathcal{X} g(x)p(x)dx
\end{equation}

Correspondingly, the expected value of a function $g$ of a discrete random
variable $X ∼ p(x)$ is given by:
\begin{equation}
\mathbb{E}_X[g(x)] = \sum_{x \in \mathcal{X}} g(x)p(x)
\end{equation}

where $\mathcal{X}$ isthe set of possible outcomes (the target space) of the random
variable $X$.

The defintion above defines the meaning of the notation $\mathbb{E}_X$ as the operator indicating that we should take the integral with respect to the probability density (for continuous distributions) or the sum over all states (for discrete distributions). The definition of the mean, is a special case of the expected value, obtained by choosing $g$ to be the identity function.

The mean of a random variable X with states $x \in \mathbb{R}^D$ is an average and is defined as

\begin{equation}
\mathbb{E}_X[x] = \begin{bmatrix}
\mathbb{E}_{X_1}[x_1]\\
\vdots \\
\mathbb{E}_{X_D}[x_D]\\
\end{bmatrix}
\end{equation}

where

\begin{equation}
    \mathbb{E}_{X_d}[x_d] :=
    \begin{cases}
      \int_\mathcal{X} x_dp(x_d)dx_d\ \text{if $X$ is a continuous random variable}\\
      \sum_{x_i \in \mathcal{X}} x_ip(x_d = x_i)\ \text{if $X$ is a discrete random variable}\\
    \end{cases} 
\end{equation}

for $d = 1, \cdots, D$, where the subscript d indicates the corresponding dimension of $x$. The integral and sum are over the states $X$ of the target space of the random variable $X$.

#### Empirical Means and Covariances

The definitions above are often also called the population mean and covariance, as it refers to the true statistics for the population. In machine learning, we need to learn from empirical observations of data. Consider a random variable $X$. There are two conceptual steps to go from population statistics to the realization of empirical statistics. First, we use the fact that we have a finite dataset (of size $N$) to construct an empirical statistic that is a function of a finite number of identical random variables, $X_1 , . . . , X_N$. Second, we observe the data, that is, we look at the realisation $x_1,...,x_N$ of each of the random variables and apply the empirical statistic.

Specifically, for the mean, given a particular dataset we can obtain an estimate of the mean, which is called the empirical mean or sample mean. The same holds for the empirical covariance.

The empirical mean vec- tor is the arithmetic average of the observations for each variable, and it is defined as

\begin{equation}
\overline{X} := \frac{1}{N} \sum^N_{n=1} x_n
\end{equation}

where $x_n \in \mathbb{R}^D$. Similar to the empirical mean, the empirical covariance matrix is a $D \times D$ matrix

\begin{equation}
\sum := \frac{1}{N} \sum^N_{n=1} (x_n - \overline{x})(x_n - \overline{x})^\top
\end{equation}

To compute the statistics for a particular dataset, we would use the realizations (observations) $x_1 , . . . , x_N$ and use.

### Sums and Transformations of Random Variables

We may want to model a phenomenon that cannot be well explained by textbook distributions and hence may perform simple manipulations of random variables (such as adding two random variables).

Consider two random variables $X$, $Y$ with states $x$, $y \in \mathbb{R}^D$ . Then:

\begin{equation}
\mathbb{E}[x+y] = \mathbb{E}[x] + \mathbb{E}[y]\\
\mathbb{E}[x-y] = \mathbb{E}[x] - \mathbb{E}[y]\\
\mathbb{V}[x+y] = \mathbb{V}[x] + \mathbb{V}[y] + Cov[x,y] + Cov[y,x]\\
\mathbb{V}[x-y] = \mathbb{V}[x] + \mathbb{V}[y] - Cov[x,y] - Cov[y,x]\\
\end{equation}

### Statistical Independence

Two random variables X, Y are statis- tically independent if and only if

\begin{equation}
p(x, y) = p(x)p(y)
\end{equation}

Intuitively, two random variables $X$ and $Y$ are independent if the value of $y$ (once known) does not add any additional information about $x$ (and vice versa). If $X$, $Y$ are (statistically) independent, then

\begin{equation}
p(y|x) = p(y)
p(x|y) = p(x)
\end{equation}

In machine learning, we often consider problems that can be modeled as independent and identically distributed (i.i.d.) random variables, $X1,...,XN$. For more than two random variables, the word "independent" usually refers to mutually independent random variables, where all subsets are independent.

The phrase "identically distributed" means that all the random variables are from the same distribution.

Another concept that is important in machine learning is conditional independence. Two random variables $X$ and $Y$ are conditionally independent given $Z$ if and only if

\begin{equation}
p(x, y | z) = p(x|z)p(y|z)\ \text{for all}\ z \in Z
\end{equation}

where $Z$ are a set of all possible random states.

### Inner Products of Random Variables

In [None]:

\mathbb{R}

\mathbb{E}