# Probability and Distributions

Probability, loosely speaking, concerns the study of uncertainty. Probabil- ity can be thought of as the fraction of times an event occurs, or as a degree of belief about an event. We then would like to use this probability to mea- sure the chance of something occurring in an experiment. We often quantify uncertainty in the data, uncertainty in the machine learning model, and uncertainty in the predictions produced by the model. Quantifying uncertainty requires the idea of a random variable, which is a function that maps outcomes of random experiments to a set of properties that we are interested in. Associated with the random variable is a function that measures the probability that a particular outcome (or set of outcomes) will occur; this is called the probability distribution.

Probability distributions are used as a building block for other concepts, such as probabilistic modeling, graphical models, and model selection. In the next section, we present the three concepts that define a probability space and how they are related to a fourth concept called the random variable.

## Construction of a Probability Space

The theory of probability aims at defining a mathematical structure to describe random outcomes of experiments. For example, when tossing a single coin, we cannot determine the outcome, but by doing a large number of coin tosses, we can observe a regularity in the average outcome. Using this mathematical structure of probability, the goal is to perform automated reasoning, and in this sense, probability generalizes logical reasoning.

In machine learning and statistics, there are two major interpretations of probability: the Bayesian and frequentist interpretations. The Bayesian interpretation uses probabil- ity to specify the degree of uncertainty that the user has about an event. It is sometimes referred to as "subjective probability" or "degree of belief". The frequentist interpretation considers the relative frequencies of events of interest to the total number of events that occurred. The probability of an event is defined as the relative frequency of the event in the limit when one has infinite data.

#### Probability and Random Variables

There are three distinct ideas that are often confused when discussing probabilities. First is the idea of a probability space, which allows us to quantify the idea of a probability. However, we mostly do not work directly with this basic probability space. Instead, we work with random variables (the second idea), which transfers the probability to a more convenient (often numerical) space. The third idea is the idea of a distribution or law associated with a random variable. We will introduce the first two ideas in this section and expand on the third idea

Modern probability is based on a set of axioms that introduce the three concepts of sample space, event space, and probability measure. The probability space models a real-world process (referred to as an experiment) with random outcomes.

**The Sample Space $\Omega$**

The sample space is the set of all possible outcomes of the experiment, usually denoted by $\Omega$. For example, two successive coin tosses have a sample space of {hh, tt, ht, th}, where "h" denotes "heads" and "t" denotes "tails".

**The event space $A$**

The event space is the space of potential results of the experiment. A subset $A$ of the sample space $\Omega$ is in the event space $A$ if at the end of the experiment we can observe whether a particular outcome $\omega$ $\in$ $\Omega$ is in $A$. The event space $A$ is obtained by considering the collection of subsets of $\Omega$, and for discrete probability distributions $A$ is often the power set of $\Omega$.

**The probability $P$**

With each event $A$ $\in$ $A$, we associate a number $P(A)$ that measures the probability or degree of belief that the event will occur. $P(A)$ is called the probability of $A$.

The probability of a single event must lie in the interval \[0, 1\], and the total probability over all outcomes in the sample space $\Omega$ must be 1, i.e., $P($\Omega$) = 1$. Given a probability space ($\Omega, A, P$), we want to use it to model some real-world phenomenon

In machine learning, we often avoid explicitly referring to the probability space, but instead refer to probabilities on quantities of interest, which we denote by $\tau$. The target space is referred to as $\tau$ and refer to elements of $\tau$ as states. We introduce a function $X : \Omega \rightarrow \tau$ that takes an element of $\Omega$ (an outcome) and returns a particular quantity of interest $x$, a value in $\tau$ . This association/mapping from $\Omega$ to $\tau$ is called a random variable. For example, in the case of tossing two coins and counting the number of heads, a random variable $X$ maps to the three possible outcomes: $X(hh) = 2$, $X(ht) = 1$, $X(th) = 1$, and $X(tt) = 0$. In this particular case, $\tau = \{0, 1, 2\}$, and it is the probabilities on elements of $\tau$ that we are interested in. For a finite sample space $\Omega$ and finite $\tau$, the function corresponding to a random variable is essentially a lookup table. For any subset $S  \subseteq \tau$ , we associate $P_X(S) \in [0, 1]$ (the probability) to a particular event occurring corresponding to the random variable $X$


Consider a statistical experiment where we model a funfair game con- sisting of drawing two coins from a bag (with replacement). There are coins from USA (denoted as \\$) and UK (denoted as £) in the bag, and since we draw two coins from the bag, there are four outcomes in total. The state space or sample space $\Omega$ of this experiment is then (\\$, \\$), (\\$, £), (£, \\$), (£, £). Let us assume that the composition of the bag of coins is such that a draw returns at random a \\$ with probability 0.3.
The event we are interested in is the total number of times the repeated draw returns \\$. Let us define a random variable $X$ that maps the sample space $\Omega$ to $\tau$ , which denotes the number of times we draw \\$ out of the bag. We can see from the preceding sample space we can get zero \\$, one \\$, or two \\$s,and therefore $\tau$ =\{0,1,2\}. The random variable $X$(a function or lookup table) can be represented as a table like the following:

\begin{equation}
X(($,$)) = 2\\
X(($,£)) = 1\\
X((£,$)) = 1\\
X((£,£)) = 0
\end{equation}

Since we return the first coin we draw before drawing the second, this implies that the two draws are independent of each other. Note that there are two experimental outcomes, which map to the same event, where only one of the draws returns $. Therefore, the probability mass function of X is given by

\begin{equation}
P(X=2) = P(($,$))\\
= P($) \cdot P($)\\
= 0.3 \cdot 0.3 = 0.09\\
\end{equation}
\begin{equation}
P(X=1) = P(($,£)\ \cup\ (£, $))\\
= P(($,£))\ +\ ((£, $))\\
= 0.3 \cdot (1-0.3) + (1-0.3) \cdot 0.3 = 0.42\\
\end{equation}
\begin{equation}
P(X=0) = P((£,£))\\
= P(£) \cdot P(£)\\
= (1 - 0.3) \cdot (1 - 0.3) = 0.49 
\end{equation}

#### Statistics

Probability theory and statistics are often presented together, but they con- cern different aspects of uncertainty. One way of contrasting them is by the kinds of problems that are considered. Using probability, we can consider a model of some process, where the underlying uncertainty is captured by random variables, and we use the rules of probability to derive what happens. In statistics, we observe that something has happened and try to figure out the underlying process that explains the observations. In this sense, machine learning is close to statistics in its goals to construct a model that adequately represents the process that generated the data. We can use the rules of probability to obtain a "best-fitting" model for some data.

Another aspect of machine learning systems is that we are interested in generalization error. This means that we are actually interested in the performance of our system on instances that we will observe in future, which are not identical to the instances that we have seen so far. This analysis of future performance relies on probability and statistics.

## Discrete and Continuous Probabilities