# Notes on Lecture 2

## 1. Probability

**Table of Contents**

- 1.1. Probability Spaces
  - 1.1.1. Definitions
  - 1.1.2. Countability
  - 1.1.3. The Lebesgue measure
  - 1.1.4. Coin tossing
  - 1.1.5. Standard and discrete probability spaces
  - 1.1.6. Product measures
  - 1.1.7. Construction of the Lebesgue measure
- 1.2. Random Variables
  - 1.2.1. Introduction
  - 1.2.2. Random variables
  - 1.2.3. Distributions
  - 1.2.4. Simples functions
  - 1.2.5. Construction of the expectation
  - 1.2.6. $L_p$ spaces
  - 1.2.7. Computation of the expectation
  - 1.2.8. Borel measurability
  - 1.2.9. Lebesgue&ndash;Stieltjes integration
  - 1.2.10. Probability density functions
  - 1.2.11. Statistics of a probability distribution
  - 1.2.12. Covariance
  - 1.2.13. Random variables on product spaces
- 1.3. Probability Distributions
  - 1.3.1. Binomial and Bernoulli distributions
  - 1.3.2. Poisson approximation of the binomial distribution
  - 1.3.3. Poisson distribution
  - 1.3.4. Normal approximation to the binomial distribution
  - 1.3.5. Gaussian integrals.
  - 1.3.6. Normal distribution
- 1.4. Conditioning
  - 1.4.1. Conditional probability
  - 1.4.2. Independence
  - 1.4.3. Conditional expectation
  - 1.4.4. Conditional probabilities
  - 1.4.5. Conditional distributions
- 1.5. Multivariate Random Variables
  - 1.5.1. Random vectors
  - 1.5.2. Variance-covariance matrix
  - 1.5.3. Joint probability distributions
  - 1.5.4. Multivariate Gaussian

### 1.1. Probability Spaces

**1.1.1. Definitions.** Probability theory begins with three notions: 

1. a **sample space** that determines the scope of an experiment,  
2. an **event**, which is an element of the same space, that represents a particular outcome of the experiment, and  
3. a **probability measure** that assigns the likelihood of to the event.

Formally, a **probability space** is an ordered triple $(\Omega, \mathcal{F}, \mathbb{P})$ consisting of a set $\Omega$ denoting the sample space, a **$\sigma$-algebra** $\mathcal{F}$ of events, and a **probability measure** $\mathbb{P}$ on the measurable space $(\Omega, \mathcal{F})$.

By a $\sigma$-algebra, we mean a set $\mathcal{F}$ of subsets of $\mathcal{F}$ that satisfies the following properties:

1. The full set $\mathcal{F}$ and the empty set $\varnothing$ are elements of $\mathcal{F}$;
2. if $E$ is an element of $\mathcal{F}$, then its complement $X \smallsetminus E$ is an element of $\mathcal{F}$;
3. if $\{E_n\}_{n=1}^{\infty}$ is a collection of sets in $\mathcal{F}$, then its union $\bigcup_{n=1}^\infty E_n$ and its intersection $\bigcap_{n=1}^\infty E_n$ are elements of $\mathcal{F}$.  

The axioms above formalize the basic, intuitive properties of a sample space. Indeed, we would want to be able to examine, for each event $E$, the complement event $X \smallsetminus E$ of $E$ not occurring. We would also want to be able to examine the event that at least one of the events in a collection of events happens, and the event that all events in a collection happens. Lastly, we throw in the two trivial events for completeness: the event that something happens, and the event that nothing happens.

The ordered pair $(\Omega, \mathcal{F})$ of a set and a $\sigma$-algebra on it is called a **measurable space**, because we can define a probability measure on it. A **probability measure** on $(\Omega, \mathcal{F})$ is a function $\mathbb{P}:\Omega \to [0,1]$ such that $\mathbb{P}(\Omega) = 1$, and that the **countable additivity** criterion

$$\mathbb{P}\left(\bigcup_{n=1}^\infty E_n \right) = \sum_{n=1}^\infty \mathbb{P}(E_n)$$

holds whenever $\{E_n\}_{n=1}^\infty$ is a disjoint collection of events. In other words, the probably of at least one of the events in a disjoint collection of events happens should be the sum of the individual probabilities.

**1.1.2. Countability.** Observe that we consider only *countable* unions (and intersections) in formalizing probability theory. This is to prevent pathological behaviors, such as events of probability zero adding up to be an event of probability one.

**1.1.3. The Lebesgue measure.** The standard example of a probability space is $\mathcal{S}  = ([0,1],\mathscr{B}_{[0,1]},\mathscr{L}_{[0,1]})$, the Lebesgue measure on the Borel $\sigma$-algebra on the unit interval $[0,1]$. The Lebesgue measure is the standard *length* of sets of real numbers. For example, the Lebesgue measure of $(0,1/3) \cup (2/3,1)$ is $(1/3-0) + (1-2/3) = 2/3$. The Borel $\sigma$-algebra collects the sets of real numbers that are *nice enough* to have a reasonable notion of length. (Not so nice sets exist: see [Vitali set](https://en.wikipedia.org/wiki/Vitali_set).) A formal construction of the Lebesgue measure (**§1.1.7**) is done via the Carathéodory extension theorem (**§1.1.6**).

Since <a href="https://en.wikipedia.org/wiki/Singleton_(mathematics)">singleton sets</a> have Lebesgue measure zero, we typically consider two events whose intersection consists of finitely many points to be disjoint. This is an example of a property that holds **almost surely**, i.e., a statement that is true except on a set of measure zero. 

**1.1.4. Coin tossing.** Many probabilistic contexts can be modeled on $\mathcal{S}$. For example, consider the experiment of tossing an unbiased coin twice, resulting in four outcomes, each with probability $1/4$: `HH`, `HT`, `TH`, `TT`. We can let $[0,1/4]$, $[1/4, 1/2]$, $[1/2, 3/4]$, $[3/4,1]$ represent the four outcomes to model this experiment. 

With this basic setting, compound events can be represented by subsets of $[0,1]$ as well. For example, the event that the first coin toss turns up head is $[0, 1/4] \cup [1/4,1/2]$. All things considered, the $\sigma$-algebra for our coin-toss experiment is

$$\begin{align*}\mathcal{F} =& \{[p_1,q_1] \cup [p_2, q_2] : p_1,p_2,q_1,q_2 \in \{0,1/4,1/2,/3/4,1\} \\ &\mbox{ and } p_1 < q_1 \leq p_2 < q_2\} \cup \{\varnothing\},\end{align*}$$

and the probability measure is defined by the Lebesgue measure on each set in $\mathcal{F}$.

Analogously, the experiment of tossing an unbiased coin $n$ times can be modeled with the $\sigma$-algebra

$$\mathcal{F}_n =\sigma \left(\left\{\left[ \frac{k}{2^n}, \frac{k+1}{2^n}\right] : k \in \{0,1,\ldots,2^n-1\}\right\}\right)$$

and the Lebesgue measure as the probability measure. Here, $\mathcal{F}(\mathcal{A})$ denotes the $\sigma$-algebra constructed from a collection $\mathcal{A}$ by taking countable unions, countable intersections, and set complements. Such a $\sigma$-algebra is called the **$\sigma$-algebra generated by $\mathcal{A}$.**

How about tossing an unbiased coin *infinitely* many times? It would be reasonable to take the $\sigma$-algebra to be

$$\mathcal{F}_\infty = \sigma\left( \bigcup_{n=1}^\infty \mathcal{F}_n \right),$$

with the Lebesgue measure as the probability measure. Since every real number can be written as an infinite sum of [dyadic rational numbers](https://en.wikipedia.org/wiki/Dyadic_rational), $\mathcal{F}_\infty$ contains all subintervals of $[0,1]$. Now, the Borel $\sigma$-algebra $\mathscr{B}_{[0,1]}$ can be generated by taking countable unions and intersections of the subintervals of $[0,1]$, we conclude that $\mathcal{F}_\infty \supseteq \mathscr{B}_{[0,1]}$. But then $\mathcal{F}_\infty$ consists of subsets of $\mathscr{B}_{[0,1]}$, and so $\mathcal{F}_\infty = \mathscr{B}_{[0,1]}$. It follows that the probability space $([0,1], \mathscr{B}_{[0,1]}, \mathscr{L}_{[0,1]})$ models the experiment of tossing an unbiased coin infinitely many times.

**1.1.5. Standard and discrete probability spaces.** Often, experiments that appear to have nothing to do with the unit interval at first glance turn out to admit $([0,1], \mathscr{B}_{[0,1]}, \mathscr{L}_{[0,1]})$ as its probabilistic model. A probability space modeling an experiment that can also be modeled on $([0,1], \mathscr{B}_{[0,1]}, \mathscr{L}_{[0,1]})$ is called a [standard probability space](https://en.wikipedia.org/wiki/Standard_probability_space).

Although standard probability spaces can be modeled on $([0,1], \mathscr{B}_{[0,1]}, \mathscr{L}_{[0,1]})$, there is often a more abstract probabilistic model that is more in tune with the nature of the experiment. For example, we can model the experiment of tossing an unbiased coin $$n$$ times quite directly, using symbols `H` and `T`. Indeed, we take the $n$-fold Cartesian product

$$\Omega_n = \{H,T\}^n = \{(t_1,\ldots,t_n) : t_1,\ldots,t_n \in \{H,T\}\}$$

to be the sample space, the [power set](https://en.wikipedia.org/wiki/Power_set) 

$$\mathcal{F}_n = \mathcal{P}(\Omega_n) = \{A : A \subseteq \Omega_n\}$$

of the sample space to be the $\sigma$-algebra of events, and the normalized [counting measure](https://en.wikipedia.org/wiki/Counting_measure)

$$\mathbb{P}[E] = \frac{|E|}{2^n} \mbox{ for each } A \in \mathcal{F}_n$$

to be the probability measure. $(\Omega_n, \mathcal{F}_n,\mathbb{P})$ is an example of a **discrete probability space**, whose sample space $\Omega_n$ consists of at most countably many elements and whose $\sigma$-algebra is the power set of the sample space.

On the other hand, it is not easy to see what the natural probability measure for the infinite coin toss space $(\{H,T\}^\infty, \mathcal{P}(\{H,T\}^n))$ should be. In these cases, it makes sense to model the experiment on the unit interval and use the Lebesgue measure.

**1.1.6. Product measures** To formalize the process of taking the product of probability spaces, we consider the Cartesian product $\Omega_1 \times \Omega_2$ of the samples spaces $\Omega_1$ and $\Omega_2$ underlying the probability spaces $(\Omega_1,\mathcal{F}_1,\mathbb{P}_1)$ and $(\Omega_2,\mathcal{F}_2,\mathbb{P}_2)$, respectively. A **rectangle** on $\Omega_1 \times \Omega_2$ is a set of the form

$$E_1 \times E_2,$$

where $E_1 \in \mathcal{F}_1$ and $E_2 \in \mathcal{F}_2$. We define the **product probability measure** $\mathbb{P}_1 \otimes \mathbb{P}_2$ on rectangles by setting

$$(\mathbb{P}_1 \otimes \mathbb{P}_2)(E_1 \times E_2) = \mathbb{P}_1(E_1)\mathbb{P}_2(E_2).$$

Since the collection of all rectangles is not a $\sigma$-algebra, $\mathbb{P}_1 \otimes \mathbb{P}_2$ is not yet a *bona fide* probability measure.

Let us now consider a collection $\mathcal{F}$ of all finite unions, intersections, and complements of rectangles on $\Omega_1 \times \Omega_2$. The resulting $\mathcal{F}$ is an **algebra**, as it is closed under finite unions, intersections, and complementation. The following extension theorem now extends $\mathbb{P}_1 \otimes \mathbb{P}_2$ to $\sigma(\mathcal{F})$, the $\sigma$-algebra generated by $\mathcal{F}$ (**§1.1.4**).

**Carathéodory extension theorem.** If $\mathbb{P}$ is a countably additive (**§1.1.1**) function on an algebra $(\Omega, \mathcal{A})$ with $\mathbb{P}(\Omega) = 1$, then there exists a unique probability measure on $(\Omega, \sigma(\mathcal{A}))$ that agrees with $\mathbb{P}$ on $\mathcal{A}$.

The resulting $\sigma$-algebra is called the **product $\sigma$-algebra** of $\mathcal{F}_1$ and $\mathcal{F}_2$ and is denoted by $\mathcal{F}_1 \otimes \mathcal{F}_2$.

**1.1.7. Construction of the Lebesgue measure.** With the Carathéodory extension theorem, we can give a more detailed construction of the Lebesgue measure (**§1.1.3**).

Given a metric space $(M,d_M)$, the $\sigma$-algebra $\mathscr{B}_M$ generated by the open subsets of $M$ is called the **Borel $\sigma$-algebra** on $M$. Equivalently, $\mathscr{B}_M$ is the $\sigma$-algebra generated by the closed subsets of $M$.

We now define the **length** $\mathscr{L}$ of an open subinterval $(a,b)$ of $[0,1]$ to be $b-a$. Applying the Carathéodory extension theorem to $\mathscr{L}$ on the field of all finite unions, intersections, and complements of open subintervals of $[0,1]$, we obtain a probability measure on $\mathscr{B}_{[0,1]}$, called the **Lebesgue measure** on $[0,1]$.

In general, a **measure** is a countably additive function on a $\sigma$-algebra. Carathéodory extension theorem continues to hold for general measures, producing not a probability measure but merely a measure.

Defining the length of an open interval $(a,b)$ in $\mathbb{R}$ to be $b-a$, we can use the Carathéodory extension theorem to construct a measure on $\mathscr{B}_\mathbb{R}$, called the **Lebesgue measure on $\mathbb{R}$**. The product $\sigma$-algebra

$$\mathscr{B}_\mathbb{R} \otimes \cdots \otimes \mathscr{B}_{\mathbb{R}}$$

agrees with the Borel $\sigma$-algebra $\mathscr{B}_{\mathbb{R}^n}$ with respect to the standard Euclidean metric, and the resulting product measure is called the **Lebesgue measure on $\mathbb{R}^n$.**

### 1.2. Random Variables

**1.2.1. Introduction.** To attach meanings to the outcomes of an experiments, it makes sense to consider functions on the corresponding sample space, the distribution of the values of such a function over the sample space, and the weighted average of the values that take the distribution into account. These correspond to random variables, probability distributions, and expectations, respectively.

**1.2.2. Random variables.** For various technical reasons, we do not consider every function on a sample space. Instead, we consider **measurable functions** on a probability space $(\Omega,\mathcal{F},\mathbb{P})$, which are functions $X: \Omega \to \mathbb{R}$  such that

$$X^{-1}((-\infty,\alpha]) = \{X \leq \alpha\} = \{\omega : X(\omega) \leq \alpha\}$$

is an element of the $\sigma$-algebra $\mathcal{F}$ for each real number $\alpha$. A measurable function defined over a probability space is called a **random variable**.

**1.2.3. Distributions.** One reason for this restriction is that we are interested in the **probability distribution**, or the **cumulative distribution function**, 

$$F_X(\alpha) = \mathbb{P}[X \leq \alpha] = \mathbb{P}[\{x: X(\omega) \leq \alpha\}]$$

of a random variable $X$, which tells us how the values of $X$ are distributed throughout the sample space. The expression $\mathbb{P}[X \leq \alpha]$ only makes sense if $\{X \leq \alpha\}$ is in $\mathcal{F}$, i.e., $X$ is measurable.

Now that there is a canonical interpretation of $F_X$ as a probability measure on $\mathbb{R}$. We set

$$\mathscr{L}_X((-\infty, \alpha]) = F_X(\alpha)$$

for each $\alpha \in \mathbb{R}$ and apply the Carathéodory extension theorem (**§1.1.6**) to extend $\mathscr{L}_X$ to a probability measure on $(\mathbb{R},\mathscr{B}_\mathbb{R})$. This measure is called the **law** with respect to the random variable $X$.

We remark that $F_X$ satisfies the following properties:

- $0 \leq F_X \leq 1$;
- $F_X$ is increasing;
- $F_X(\alpha) \to 0$ as $\alpha \to -\infty$;
- $F_X(\alpha) \to 1$ as $\alpha \to \infty$;
- $F_X$ is [right-continuous](https://en.wikipedia.org/wiki/Continuous_function#Directional_and_semi-continuity).

Conversely, any function $F$ that satisfies the above properties gives rise to a random variable $X$ such that $F = F_X$ (**§1.2.9**).

**1.2.4. Simple functions.** Another reason for the measurability restriction is that measurable functions, however complicated they might be, can be approximated by a class of functions known as **simple functions**. A simple function on $(\Omega, \mathcal{F}, \mathbb{P})$ is a [linear combination](https://en.wikipedia.org/wiki/Linear_combination) of [indicator functions](https://en.wikipedia.org/wiki/Indicator_function), i.e.,

$$s(\omega) = \sum_{i=1}^k a_i \boldsymbol{1}_{E_i}(\omega,$$

where $\boldsymbol{1}_{E_i}(\omega)$ is 1 if $\omega \in E_i$ and 0 elsewhere. We assume that $E_1,\ldots,E_k$ are disjoint events, i.e., elements of the $\sigma$-algebra $\mathcal{F}$. 

It is not hard to work out the probability distribution and the weighted average of a simple function. Indeed, given a simple function $s(\omega) = \sum_{i=1}^k a_i \boldsymbol{1}_{E_i}(\omega)$, we see that the probability distribution of $s$ is

$$F_s(\alpha) = \mathbb{P}[s \leq \alpha] = \sum_{\substack{1 \leq i \leq k \\\ a_i \leq \alpha}} \mathbb{P}(E_i).$$

The weight average, or the **expectation**, of $s$ would simply be the weighted average of the probability measures on the events $E_1,\ldots,E_k$, i.e.,

$$\mathbb{E}[s] = \sum_{i=1}^k a_i \mathbb{P}[E_k].$$

**1.2.5. Construction of the expectation.** Whenever a random variable $X$ on $\Omega$ is nonnegative, there exists a sequence $(s_n)_{n=1}^\infty$ of simple functions such that $0 \leq s_1 \leq s_2 \leq \ldots \leq X$ and that

$$\lim_{n \to \infty} s_n(\omega) = X(\omega)$$

for all $\omega \in X$. With this approximation, we can generalize the notion of expectation to a larger class of random variables. Indeed, we define

$$\mathbb{E}[X] = \lim_{n \to \infty} \mathbb{E}[s_n],$$

for any nonnegative random variable $X$. While $X$ could have more than one simple function approximation, the [machinery of measure theory](https://terrytao.wordpress.com/2010/09/25/245a-notes-3-integration-on-abstract-measure-spaces-and-the-convergence-theorems/) guarantees that the limit is the same in each case.

Now, every real-valued random variable $X$ can be written as the difference $X^+ - X^-$, where $X^+ = \max(X,0)$ and $X^- = \max(-X,0)$. We can then define the **expectation,** or the **mean**, of $X$ to be the sum

$$\mathbb{E}[X] = \mathbb{E}[X^+] - \mathbb{E}[X^-].$$

By construction, $\mathbb{E}$ is linear. In other words, if $X$ and $Y$ are random variables, then

$$\mathbb{E}[aX+bY] = a\mathbb{E}[X]+ b\mathbb{E}[Y].$$

We remark that we often drop the brackets and write $\mathbb{E}X$ to denote the expectation of $X$.

The exepctation constructed this way satisfies the limit theorem

$$\lim_{n \to \infty} \mathbb{E}[X_n] = \mathbb{E}\left[ \lim_{n \to \infty} X_n \right]$$

whenever $X_1 \leq X_2 \leq \cdots \leq X_n \leq \cdots $ or there exists a random variable $Y$ such that $\mathbb{E}[\vert Y \vert] < \infty$ and $\vert X_n \vert \leq Y$ for all $n$. The former is called the **monotone convergence theorem**; the latter is called the **dominated convergence theorem**.

**1.2.6. $L_p$ space.** For each $1 \leq p < \infty$, we define $L_p(\Omega,\mathcal{F},\mathbb{P})$ to be the collection of all random variables $X$ such that

$$\|X\|_p = \mathbb{E}[\vert X \vert^p]^{1/p} < \infty.$$

Identifying any pair of random variables $X$ and $Y$ satisfying $\|X - Y\|_p = 0$, we see that the vector space $L_p(\Omega,\mathcal{F},\mathbb{P})$ is a Banach space under the $L_p$-norm $\|\cdot\|_p$. If $p =2 $, then $L_p(\Omega,\mathcal{F},\mathbb{P})$ is a Hilbert space with inner product

$$\langle X, Y \rangle_{L_2} = \mathbb{E}[ \vert XY \vert ].$$

We now define the **essential supremum** of a random variable $X$ to be the quantity

$$\inf \{t \in \mathbb{R} : \mathbb{P}[X > t] = 0\}$$

and denote it by $\|X\|_\infty$. The collection of all random variables with finite essential supremum is denoted by $L_\infty(\Omega,\mathcal{F},\mathbb{P})$. Identifying any pair of random variables $X$ and $Y$ satisfying $\|X - Y\|_\infty = 0$, we see that $L_\infty(\Omega,\mathcal{F},\mathbb{P})$ is a Banach space under the $L_\infty$-norm $\|\cdot\|_\infty$.

The $L_p$ norms are related by **Hölder's inequality**

$$\|XY\|_1 \leq \|X\|_p \|Y\|_{p'}.$$

where $\frac{1}{p} + \frac{1}{p'} = 1$. Here we consider $1/\infty=0$. The $p=2$ case can be written as

$$\left\vert \langle X, Y \rangle_{L_2} \right\vert  \leq \|X\|_2 \|Y\|_2,$$

known as the **Cauchy&ndash;Schwarz inequality**.

Whenever $p \leq q$, we see that

$$\| X^{q/p} \|_1^{p/q} \leq \|X^{q/p}\|_p^{p/q} \|1\|_{p'}^{p/q} = \|X\|_q,$$

where $\frac{1}{p} + \frac{1}{p'} = 1$.

Fix $p \leq q$ and suppose that $X \in \mathcal{L}_q(\Omega,\mathcal{F},\mathbb{P})$. Observe that

$$\|X\|_p = \|\vert X \vert^p\|_1^{1/p} \leq \|\vert X \vert^p\|_{q/p}^{1/p}\|1\|_{(q/p)'}^{1/p},$$

where $1/(p/q) + 1/(p/q)' = 1$. Hölder's inequality holds, as $q/p \geq 1$. Since the total measure of $\Omega$ is 1, we see that $\|1\|_{(q/p)'} = 1$, and so

$$\|X\|_p \leq \|\vert X \vert^p\|_{q/p}^1/p = \left(\mathbb{E}\left[\left(\vert X \vert ^p\right)^{q/p}\right]^{1/p}\right)^{p/q} = \mathbb{E}[ \vert X \vert^q ]^{1/q} = \|X\|_q.$$

It follows that

$$\|X\|_p \leq \|X\|_q.$$

Similarly,

$$\|X\|_p = \mathbb{E}[\vert X \vert^p]^{1/p} \leq \mathbb{E}[ \|X\|_\infty^p ] ^{1/p} = \|X\|_\infty \mathbb{E}[1] = \|X\|_\infty$$

for all $p \geq 1$.

**1.2.7. Computation of the expectation** On $([0,1],\mathscr{B}_{[0,1]},\mathscr{L}_{[0,1]})$, this construction of the expectation agrees with the standard integral on $[0,1]$, a fact we often use to compute the expectations of random variables.

On a discrete probability space, the **probability mass function**

$$p_X(a) = \mathbb{P}[X = a]$$

is also of interest. Note that, in this case, there are at most countably many values in the <a href="https://en.wikipedia.org/wiki/Range_(mathematics)">range</a> $\operatorname{im} X$ of $X$, and so we can write the expectation of $X$ as a sum:

$$\mathbb{E}X = \sum_{a \in \operatorname{im} X} a p_X(a).$$

Observe, in this case, that 
$\mathbb{E}X^k = \sum_{a \in \operatorname{im} X} a^k p_X(a)$
whenever $k \in \mathbb{N}$. To see this, we note that $\{X = a\}$ and $\{X = b\}$ are disjoint whenever $a = b$. Therefore,

$$X^k(x) = \left(\sum_{a \in \operatorname{im} X} a \boldsymbol{1}_{\{X = a\}} \right)^k = \left(x \boldsymbol{1}_{\{X = x\}} \right)^k = x^k \boldsymbol{1}_{\{X = x\}}$$

for each $x$.

Another useful tool for computing expectations is the **change-of-variables formula**

$$\mathbb{E}g(X) = \int_{-\infty}^\infty g(\alpha) \, dF_X(\alpha),$$  

which holds whenever $g:\mathbb{R} \to \mathbb{R}$ is a function that makes $g(X)$ measurable and satisfies the decay condition $\mathbb{E} \vert g(X) \vert < \infty$. The integral is to be understood as a **Lebesgue-Stieltjes integral**.

**1.2.8. Borel measurability.** What functions $g$ would preserve the measurability of a random variable $X$ upon composition? Recall that $X$ is a (measurable) random variable on $(\Omega,\mathcal{F},\mathbb{P})$ if $X^{-1}((-\infty,\alpha])$ is in $\mathcal{F}$ for each real number $\alpha$. Since 

$$\begin{align*} X^{-1} \left(\bigcup_{n=1}^\infty E_n\right) &= \bigcup_{n=1}^\infty X^{-1}(E_n) \\ X^{-1} \left( \bigcap_{n=1}^\infty E_n \right) &= \bigcap_{n=1}^\infty X^{-1}(E_n) \\ X^{-1}(\mathbb{R} \smallsetminus E_1) &= \Omega \smallsetminus X^{-1}(E_1)\end{align*}$$

for any collection of subsets $\{E_n\}_{n=1}^\infty$ of $\mathbb{R}$, it follows that $X$ is measurable if and only if $X^{-1}(E) \in \mathcal{F}$ for all $E \in \sigma(\mathcal{I})$, the $\sigma$-algebra generated by the collection $\mathcal{I}$ of all intervals of the form $(-\infty, \alpha]$. Since $(-\infty, a] \cap (-\infty, b] = [a,b]$ whenever $a \leq b$, the $\sigma$-algebra $\sigma(\mathcal{I})$ contains all closed intervals. Now,

$$(a,b) = \bigcup_{n=1}^\infty \left[a - \frac{\vert b -a \vert}{3^n} , b - \frac{\vert b - a\vert}{3^n} \right],$$

and so $\sigma(\mathcal{I})$ contains all open intervals. It follows that $\sigma(\mathcal{I}) = \mathscr{B}_\mathbb{R}$, the Borel $\sigma$-algebra on $\mathbb{R}$ (**§1.1.7**).

We now observe that, given a random variable $X$, the composite function $g(X)$ remains measurable if $g$ preserves the Borel-ness of sets. In other words, if $g^{-1}(E) \in \mathscr{B}_{\mathbb{R}}$ whenever $E \in \mathscr{B}_{\mathbb{R}}$, then we have $(g(X))^{-1}(E) \in \Omega$ whenever $E \in \mathscr{B}_{\mathbb{R}}$, thereby preserving the measurability of $X$. Such a function $g:\mathbb{R} \to \mathbb{R}$ is called **Borel measurable on $\mathbb{R}$**. We remark that all [continuous functions](https://en.wikipedia.org/wiki/Continuous_function) are Borel measurable.

**1.2.9. Lebesgue&ndash;Stieltjes integration** We recall that the distribution function

$$F_X(\alpha) = \mathbb{P}[X \leq \alpha]$$

of a random variable $X$ satisfies the following properties:

- $0 \leq F_X \leq 1$;
- $F_X$ is increasing;
- $F_X(\alpha) \to 0$ as $\alpha \to -\infty$;
- $F_X(\alpha) \to 1$ as $\alpha \to \infty$;
- $F_X$ is [right-continuous](https://en.wikipedia.org/wiki/Continuous_function#Directional_and_semi-continuity).

We now suppose that $F:\mathbb{R} \to [0,1]$ satisfies the above properties. The set function $dF$ given by the formula

$$dF((-\infty,\alpha]) = F(\alpha)$$

can be extended (**§1.1.6**)  to a probability measure on $(\mathbb{R},\mathscr{B}_\mathbb{R})$. The measure $dF$ is called the **Lebesgue&ndash;Stieltjes measure associated with $F$**.

Now, the function $X:[0,1] \to \mathbb{R}$ defined by the formula

$$X(\omega) = \sup\{t : F(t) \leq \omega\}$$

is a random variable on $([0,1],\mathscr{B}_{[0,1]},\mathscr{L}_{[0,1]})$ such that

$$F_X = F.$$

It follows that there is one-to-one correspondence between probability distributions and the Lebesgue&ndash;Stieltjes measures.


As an example, we consider the [Heaviside function](https://en.wikipedia.org/wiki/Heaviside_step_function)

$$H(x) = \begin{cases} 0 & \mbox{ if } x < 0; \\ 1 & \mbox{ if } x > 0.\end{cases}$$

The corresponding probability measure is

$$dH = \sum_{i=1}^n \mathbb{P}[E_i] \delta_{a_i},$$

where $\delta_{a_i}$ is the [Dirac delta measure](https://en.wikipedia.org/wiki/Dirac_delta_function#As_a_measure)

$$\delta_{a_i}(E) = \begin{cases} 1 & \mbox{ if } a_i \in E \\ 0 & \mbox{ otherwise.}\end{cases}$$

on $\mathbb{R}$. Therefore, the expectation of a simple function $s$ on $(\mathbb{R},\mathscr{B}_{\mathbb{R}},dH)$ is

$$\sum_{i=1}^n s(a_i) \mathbb{P}[E_i],$$

whence it follows from the construction of the expectation (**§1.2.5**) that

$$\mathbb{E}g(X) = \int_{-\infty}^\infty g(\alpha) \, dH(\alpha) = \sum_{i=1}^n g(a_i) \mathbb{P}[E_i].$$

**1.2.10. Probability density functions.** We now suppose that $F_X$ is differentiable, and that $f_X$ is the derivative of $F_X$. In this case,  a standard result in the theory of Lebesgue-Stieltjes integral is that

$$\mathbb{E}g(X) = \int_{-\infty}^\infty g(\alpha) dF_X(\alpha) = \int_{-\infty}^\infty g(\alpha) f_X(\alpha) \, d\alpha.$$

In particular, if $g(x) = x$, then

$$\mathbb{E}X = \int_{-\infty}^\infty \alpha f_X(\alpha) \, d\alpha.$$

$f_X$ is referred to as the **probability density function** of $X$.

Probability density functions are useful for computational purposes, as they convert the task of computing the expectation into integration on the real line. Moreover, some probability distributions that do not admit neat characterizations have closed-form expressions for their density functions. For example, the **standard normal distribution** (**§1.3.6**) has probability density funciton

$$f_X(t) = \frac{1}{\sqrt{2\pi}} e^{-t^2/2},$$

but its distribution function

$$F_X(\alpha) = \int_{-\infty}^t \frac{1}{\sqrt{2\pi}} e^{-t^2/2} \, dt$$

does not have a closed-form expression.

**1.2.11. Statistics of a probability distribution.** We now turn to a few **statistics** of a probability distribution, i.e., numbers that summarize information about the distribution.

We already know about the mean $\mathbb{E}X$ of $X$. Another useful *middle value* is the **median** $F_X^{-1}(1/2)$, where the **quantile function** $F_X^{-1}$ is given by the formula

$$F^{-1}(q) = \inf \{x : F_X(x) > q \}.$$

$F_X^{-1}(1/4)$ and $F^{-1}(3/4)$ are called the **first quartile** and the **third quartile**, respectively.

Often, we are interested not only in the mean of $X$, but also the **variance**

$$\operatorname{var}[E] = \mathbb{E}[(X-\mathbb{E}X)^2]$$

of $X$, which measures the *spread* of a distribution. While it might make sense, at a first glance, to use $\mathbb{E}[X - \mathbb{E}X]$ to measure the spread, the linearity of $\mathbb{E}$ implies that

$$\begin{align*} \mathbb{E}[X-\mathbb{E}X] &= \mathbb{E}[X - \mathbb{E}X\boldsymbol{1}_{\Omega}] \\\ &= \mathbb{E}X - \mathbb{E}[\mathbb{E}X\boldsymbol{1}_{\Omega}] \\\ &= \mathbb{E}X - \mathbb{E}X[\boldsymbol{1}_{\Omega}] \\\ &= \mathbb{E}X - \mathbb{E}X = 0.\end{align*}$$ 

Similar calculations yield $\operatorname{var}[X] = \mathbb{E}X^2 - (\mathbb{E}X)^2$.

The **standard deviation** of $X$ is

$$\operatorname{std}[X] = \sqrt{\operatorname{var}[E]}$$

and is often denoted by $\sigma$. The standard deviation is useful for modeling purposes because it has the same [unit](https://en.wikipedia.org/wiki/Base_unit_(measurement)) as the random variable $X$.

In general, the **$n$th moment** of $X$ is the expectation

$$\mu_n = \mu_n(X) = \mathbb{E}[X^n].$$

The first moment, the mean, is typically denoted by $\mu$.

A neat way to compute the moments of $X$ is to consider its [moment-generating function](https://en.wikipedia.org/wiki/Moment-generating_function)

$$M_X(t) = \mathbb{E}[e^{tX}].$$

Indeed, we observe that

$$\mathbb{E}[e^{tX}] = \mathbb{E} \left[ \sum_{n=0}^\infty \frac{(tX)^n}{n!}\right] = \sum_{n=0}^\infty \frac{t^n \mu_n}{n!}$$

by the [uniform convergence](https://en.wikipedia.org/wiki/Uniform_convergence) of Taylor series. It then follows that 

$$ \frac{d^kM_X(t)}{dt^k}= \sum_{n=k}^\infty \frac{t^{n-k} \mu_n}{(n-k)!},$$

whence

$$\frac{d^kM_X(t)}{dt^k}\bigg|_{t = 0}\mathbb{E}[e^{tX}] = \mu_k.$$

**1.2.12. Covariance.** Recall that the **variance** of a random variable $X$ is

$$\mathbb{E}[(X-\mathbb{E}X)^2] = \mathbb{E}[(X-\mathbb{E}X)(X -\mathbb{E}X)].$$

The **covariance** between two random variables $X$ and $Y$ is defined analogously:

$$\operatorname{Cov}(X,Y) = \mathbb{E}[(X-\mathbb{E}X)(Y-\mathbb{E}Y)].$$

Note that the covariance is zero when $\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y]$. This is often obtained as a consequence of independence (**§1.5.2**).

Variance and covariance are related in the following manner:

$$\begin{align*}
\operatorname{var}[X+Y] &= \operatorname{var}[X] + \operatorname{var}[Y] + 2 \operatorname{Cov}(X,Y); \\
\operatorname{var}[X-Y] &= \operatorname{var}[X] + \operatorname{var}[Y] - 2 \operatorname{Cov}(X,Y).
\end{align*}$$

Normalizing the covariance, we obtain the **correlation coefficient**

$$\rho = \rho_{X,Y} = \rho(X,Y) = \frac{\operatorname{Cov}(X,Y)}{\sqrt{\operatorname{var}[X] \operatorname{var}[Y]}}.$$

By the Cauchy&ndash;Schwarz inequality (**§1.2.6**),

$$\vert \operatorname{Cov}(X,Y) \vert \leq \sqrt{\operatorname{var}[X]\operatorname{var}[Y]},$$

and so

$$-1 \leq \rho(X,Y) \leq 1.$$

Te equality in the Cauchy-Schwarz inequality is achieved if two variables are linearly dependent, and so $\vert \rho \vert = 1$ if

$$Y = aX + b$$

for some constants $a$ and $b$. Note, in addition, that $a > 0$ implies $\rho = 1$, and that $a < 0$ implies $\rho < 1$.

**1.2.13. Random variables on product spaces.** Let $(\Omega_1,\mathcal{F}_1,\mathbb{P}_1)$ and $(\Omega_2,\mathcal{F}_2,\mathbb{P}_2)$ be probability spaces. We recall that the **product probability space** $(\Omega_1 \times \Omega_2, \mathbb{F}_1 \otimes \mathbb{F}_2, \mathbb{P}_1 \otimes \mathbb{P}_2)$ consists of the cartesian product $\Omega_1 \times \Omega_2$ of event spaces, the $\sigma$-algebra $\mathcal{F}_1 \otimes \mathcal{F}_2$ generated by rectangles on $(\mathcal{F}_1,\mathcal{F}_2)$, and multiplicatively defined product probability measure $\mathcal{P}_1 \otimes \mathcal{P}_2$. 

Given fixed $\omega_1 \in \Omega_1$ and $\omega_2 \in \Omega_2$, the **canonical injection mappings** $\iota^1_{\omega_2}:\Omega_1 \to \Omega_1 \times \Omega_2$ and $\iota^2_{\omega_1}:\Omega_2 \to \Omega_1 \times \Omega_2$ defined by the formulas

$$\iota^1_{\omega_2}(\omega) = (\omega, \omega_2) \hspace{1em}\mbox{and}\hspace{1em} \iota^2_{\omega_1}(\omega) = (\omega_1, \omega)$$

are $(\mathcal{F}_i, \mathcal{F}_1 \otimes \mathcal{F}_2)$-measurable for $i = 1, 2$, respectively. This implies that the mappings 

$$X_1(\omega) = X(\omega, \omega_2) \hspace{1em}\mbox{and}\hspace{1em} X_2(\omega) = X(\omega_1, \omega)$$

are random variables whenever $X:\Omega_1 \times \Omega_2 \to \mathbb{R}$ is $\mathcal{F}_1 \otimes \mathcal{F}_2$-measurable. Indeed,

$$X_1 = X \circ \iota^1_{\omega_2} \hspace{1em} \mbox{and} \hspace{1em} X_2 = X \circ \iota^2_{\omega_1}.$$

**Fubini's theorem** states that

$$\int_{\Omega_1 \times \Omega_2} X(\omega_1,\omega_2) \, d(\omega_1,\omega_2) = \int_{\Omega_1} \int_{\Omega_2} X(\omega_1,\omega_2) \, d\omega_2 \, d\omega_1 = \int_{\Omega_2} \int_{\Omega_1} X(\omega_1,\omega_2) \, d\omega_1 \, d\omega_2$$

whenever

$$\int_{\Omega_1 \times \Omega_2} \left\vert X(\omega_1,\omega_2) \right\vert \, d(\omega_1,\omega_2) < \infty.$$

**Tonelli's theorem** states that the above identity holds whenever $X$ is a nonnegative $(\mathcal{F}_1 \otimes \mathcal{F}_2)$-measurable random variable on $\Omega_1 \times \Omega_2$.

We often refer to both theorems simultaneously as the **Fubini&ndash;Tonelli theorem**. We note that the Fubini&ndash;Tonelli theorem holds on every measure space $(\Omega,\mathcal{F},\mu)$ that is **$\sigma$-finite**, i.e., there exists a sequence $(E_n)_{n=1}^\infty$ of $\mu$-finite sets in $\mathcal{F}$ such that

$$\bigcup_{n=1}^\infty E_n = \Omega.$$

As a simple application, we consider the product measure space $(\Omega \times \mathbb{R}, \mathcal{F} \otimes \mathscr{B}_{\mathbb{R}}, \mathbb{P} \otimes \mathscr{L})$ constructed form a probability space $(\Omega,\mathcal{F}, \mathbb{P})$ and the Lebesgue measure space $(\mathbb{R},\mathscr{B}_{\mathbb{R}},\mathscr{L})$. We fix a random variable $X:\Omega \to \mathbb{R}$ and define

$$A_X = \{(\omega, t) : 0 \leq t \leq X(\omega)\}.$$

Observe that the indicator function $\boldsymbol{1}_{A_X}:\Omega \times \mathbb{R} \to \mathbb{R}$ satisfies the following identities:

$$\mathbb{E}[\boldsymbol{1}_{A_X} \circ \iota^1_t] = \mathbb{P}[X \geq t] \hspace{1em}\mbox{and}\hspace{1em} \mathbb{E}[\boldsymbol{1}_{A_X} \circ \iota^2_{\omega}] = \mathscr{L}([0,X(\omega)]) = X(\omega).$$

It follows from the Fubini&ndash;Tonelli theorem that

$$
\begin{align*}
\mathbb{E}[\boldsymbol{1}_{A_X}] &= \int_{\Omega} \int_{-\infty}^\infty \boldsymbol{1}_{A_X}(\omega, t) \, dt \, d\omega = \mathbb{E}[X] \\
\mathbb{E}[\boldsymbol{1}_{A_X}] &= \int_{-\infty}^\infty \int_{\Omega} \boldsymbol{1}_{A_X}(\omega, t) \, d\omega \, dt = \int_0^\infty \mathbb{P}[X \geq t] \, dt,
\end{align*}
$$

whence

$$\mathbb{E}[X] = \int_0^\infty \mathbb{P}[X \geq t] \, dt.$$

### 1.3. Probability Distributions

**1.3.1. Binomial and Bernoulli distributions.** Given the usual discrete probability space $(\{H,T\}^n, \mathcal{P}(\{H,T\}^n), \mathbb{P})$ modeling of tossing a fair coin $n$ times (**§1.1.4**), we define the random variable $X$ on it by the formula

$$X(\omega_1,\ldots,\omega_n) = \sum_{i=1}^n \boldsymbol{1}_{H}(\omega_i),$$

where $\boldsymbol{1}_H(x_i)$ is 1 if $\omega_i = 1$ and 0 otherwise. In this case, the probability mass function $p_X(k)$ yields the number of possible coin-tossing outcomes with $k$ many heads, divided by $2^n$. It is not hard to see that

$$2^n(h+t)^n = \sum_{j=0}^n p_X(j)h^jt^{n-j}$$

where $p_X(a) = \mathbb{P}[X=a]$. It follows that

$$p_X(k) = \begin{pmatrix} n \\ k \end{pmatrix} 2^{-n} = \frac{n!}{(n-k)!k!} 2^{-n}$$

whenever $k \in \{0,1,\ldots,n-1,n\}$.

More generally, if the probability of heads is $\theta$, then the above polynomial formulation yields

$$p_X(k) = \begin{pmatrix} n \\ k \end{pmatrix} \theta^k (1-\theta)^{n-k}$$

whenever $k \in \{0,1,\ldots,n-1,n\}.$ In this case, we say that $X$ has a **binomial distribution**, and write $\operatorname{Bin}(k \mid n, \theta)$ to denote $p_X(k)$. If $n= 1$, then we say that $X$ has a **Bernoulli distribution** and write $\operatorname{Ber}(x \mid \theta)$ to denote $p_X(x)$.

We remark that an experiment consisting of repeatedly performing *independent* tasks with only two outcomes is called a **Bernoulli trial**. Our coin-tossing example is the archetypal example of a Bernoulli trial.  Many other scenarios can be modeled as Bernoulli trials, so long as *success* and *failure* can be clearly defined.

To compute its moments, we consider its moment-generating function $M_X(t) = \mathbb{E}[e^{tX}]$ (**§1.2.10**). By the change-of-variables formula (**§1.2.7**), we have the identity

$$\begin{align*} \mathbb{E}[e^{tX}] &= \sum_{k=0}^n e^{tk} \operatorname{Bin}(k \mid n, \theta) \\ &= \sum_{k=0}^n e^{tk} \begin{pmatrix} n \\\ k \end{pmatrix} \theta^k (1-\theta)^{n-k} \\ &= \sum_{k=0}^n \begin{pmatrix} n \\\ k \end{pmatrix}  (\theta e^t)^k (1-\theta)^{n-k} \\ &= (\theta e^t + (1-\theta))^n. \end{align*}$$

Since

$$M_X'(t) = n(\theta e^t + (1-\theta))^{n-1}(\theta e^t),$$

we have $\mu = M_X'(0) = n\theta.$ Moreover,

$$\begin{align*}M_X''(t) =& n(n-1)(\theta e^t + (1-\theta))^{n-2}(\theta e^t)^2 \\ &+ n(\theta e^t + (1-\theta))^{n-1}(\theta e^t),\end{align*}$$

and so

$$\mu_2 = M_X''(0) = n(n-1) \theta^2 + n\theta = n^2 \theta^2 + n\theta(1- \theta).$$

It follows that

$$\operatorname{var}(X) = n^2 \theta^2 + n\theta(1-\theta) - n^2\theta^2 = n\theta(1-\theta).$$

**1.3.2. Poisson approximation of the binomial distribution.** Suppose that we have performed a large number of Bernoulli trials with small $\theta$, so that the mean of the binomial distribution $$\mu = n\theta$$ is of "moderate magnitude". In such cases, we can derive a convenient approximation to the distribution, due to Poisson.

Recall that $\operatorname{Bin}(0 \mid n, \theta) = (1-\theta)^n = (1-\frac{\mu}{n})^n$.  Since

$$\log (1 + x) = \sum_{j=1}^\infty \frac{(-1)^{j+1}}{j} x^j = \frac{x}{j} + o(x^2),$$

we see that

$$\begin{align*} \log \left(\operatorname{Bin}(0 \mid n, \theta)\right) &= n \log \left( 1 - \frac{\mu}{n}\right) \\ &= n \sum_{j=1}^\infty - \frac{(\mu/n)^j}{j} \\ &= n\left( \frac{-\mu}{n} + o(n^{-2}) \right) \\ &= -\mu + o(n^{-1}). \end{align*}$$

Therefore, $\operatorname(0 \mid n, \theta) \approx e^{-\mu}$. 

Now,

$$\frac{\operatorname{Bin}(k \mid n, \theta)}{\operatorname{Bin}(k-1 \mid n, \theta)} = \frac{(n-k+1)\theta}{k(1-\theta)} \approx \frac{n \theta}{k} = \frac{\mu}{k},$$

as we have assumed that $n$ is large and $\theta$ is small. Therefore,

$$\operatorname{Bin}(k \mid n, \theta) \approx \frac{\mu}{k} \operatorname{Bin}(k-1 \mid n, \theta) \approx \cdots \approx \frac{\mu^k}{k!} e^{-\mu}.$$

What scenarios might have large $n$ and small $\theta$? We could consider, for example, the probability of $k$ people having their birthdays on New Year's Day ($\theta = 1/365$), given that the group of people is large. We could also  consider the probability of $k$ factory-produced items being defective, given a low fraction defective. In other words, the Poisson approximation is appropriate for modeling *rare events*.

**1.3.3. Poisson distribution.** In light of the Poisson approximation of the Binomial distribution, we say that a natural-number-valued random variable $X$ has a **Poisson distribution** with parameter $\lambda > 0$ if its probability mass function equals

$$p_X(k) = \operatorname{Poi}(k \mid \lambda) = e^{-\lambda} \frac{\lambda^k}{k!}.$$

Let us consider an example from [Rutherford, Chadwick, and Ellis, *Radiations From Radioactive Substances*](https://books.google.com/books?id=hOx-ZDIztiQC&lpg=PA171&vq=172&pg=PA172#v=snippet&q=172&f=false). In a radioactive decay experiment, emissions of $\alpha$-particles were observed 2608 times. The total count of $\alpha$-particles observed was 10097, so that the average number of new particles appearing during a unit time interval (7.5 seconds) was about 3.87. Using the Poisson distribution with $\lambda = 3.87$, we expect the following:

| $k$ | $2608$ $\times$ $\operatorname{Poi}$ $(k$ $\mid$ $3.87$ $)$ |
| ----- | ---------------------------------------- |
| 0     | 54                                       |
| 1     | 210                                      |
| 2     | 407                                      |
| 3     | 525                                      |
| 4     | 508                                      |
| 5     | 394                                      |
| 6     | 254                                      |
| 7     | 140                                      |
| 8     | 68                                       |
| 9     | 29                                       |
| 10    | 11                                       |
| 11    | 4                                        |
| 12    | 1                                        |
| 13    | 1                                        |
| 14    | 1                                        |

Here $\operatorname{Poi}(k \mid 3.87)$ represents the expected probability that $k$ new particles are observed in a unit time interval. Therefore, $ 2608 \times \operatorname{Poi}(k \mid 3.87)$ yields the expected number of the unit time intervals in which $k$ new particles are observed.

Compare the above with the actual number of occurrence:

| Number of particles observed in interval | Observed number of occurrences |
| ---------------------------------------- | ------------------------------ |
| 0                                        | 57                             |
| 1                                        | 203                            |
| 2                                        | 383                            |
| 3                                        | 525                            |
| 4                                        | 532                            |
| 5                                        | 408                            |
| 6                                        | 273                            |
| 7                                        | 139                            |
| 8                                        | 45                             |
| 9                                        | 27                             |
| 10                                       | 10                             |
| 11                                       | 4                              |
| 12                                       | 0                              |
| 13                                       | 1                              |
| 14                                       | 1                              |

Other scenarios that conform to the Poisson distribution include chromosome interchange in cells exposed to X-ray irradiation, telephone connections to wrong numbers, bacterial colonies distribution in a Petri dish, and so on.

Using the change-of-variables formula, we can compute the moment-generating function $M_X(t) = \mathbb{E}[e^{tX}]$ of the Poisson distribution as follows: 

$$\begin{align*} M_X(t) &= \sum_{k=0}^\infty e^{tk} \operatorname{Poi}(k \mid \lambda) \\ &= e^{-\lambda} \sum_{k=0}^\infty  \frac{(\lambda e^t)^k}{k!} \\ &= e^{\lambda(e^t - 1)}. \end{align*}$$

Since $M_X'(t) = e^{\lambda(e^t - 1)}(\lambda e^t)$, we see that the mean is $\mu = M_X'(0) = \lambda$. Moreover,

$$M_X''(t) = e^{\lambda (e^t - 1)} (\lambda e^t)^2 + e^{\lambda (e^t - 1)} (\lambda e^t),$$

and so $\mu_2 = M_X''(0) = \lambda^2 + \lambda$. Therefore,

$$\operatorname{var}[X] = \mu_2 - \mu^2 = \lambda.$$

**1.3.4. Normal approximation to the binomial distribution.** We once again consider a scenario in which we perform a large number of Bernoulli trials&mdash;this time, with $\theta = 1/2$. We let $n = 2\nu$ and define

$$a_k = \operatorname{Bin}\left(\nu + k \mid 2\nu, \frac{1}{2} \right)$$

for each $-\nu \leq k \leq \nu$, so that the terms of the binomial distribution $\operatorname{Bin}(\cdot \mid n, \frac{1}{2} )$ are represented by the sequence

$$a_{-\nu}, a_{-\nu+1}, \cdots, a_{-1},a_0,a_1,\cdots,a_{\nu-1},a_{\nu}.$$

For $k \geq 0$, we have that

$$\begin{align*}
a_k
&= \operatorname{Bin} \left( \nu + k \mid 2\nu, \frac{1}{2} \right) \\
&= \frac{(2\nu)!}{(\nu - k)! (\nu + k)!} 2^{-2\nu} \\
&= \frac{\nu ! \nu !}{(\nu - k)! (\nu + k)!} \frac{(2\nu)!}{\nu ! \nu !} 2^{-2\nu} \\
&= \frac{\nu \cdot (\nu - 1) \cdots (\nu -k + 1)}{(\nu + 1) \cdot (\nu + 2) \cdots (\nu + k)} a_0 \\
&= \frac{1 \cdot (1 - \frac{1}{\nu}) \cdots (1 - \frac{k-1}{\nu})}{(1 + \frac{1}{\nu}) (1 + \frac{2}{\nu}) \cdots (1 + \frac{k}{\nu})} a_0.
\end{align*}$$

We, of course, have $a_k = a_{-k}$, and so the above formula characterizes all terms of $\operatorname{Bin}(\cdot \mid n, \frac{1}{2})$ in terms of $a_0$. Now, <a href="https://en.wikipedia.org/wiki/Taylor's_theorem">Taylor's theorem</a> implies that

$$1 + \frac{j}{\nu} = e^{j/\nu} + o(\nu^{-1}),$$

and so, for large enough $\nu$,

$$a_k \approx \frac{e^{0/\nu} \cdot e^{-1/\nu} \cdots e^{-(k-1)/\nu}}{e^{0/\nu} \cdot e^{1/\nu} \cdots e^{k/\nu}} a_0 \approx e^{-k^2/\nu} a_0.$$

It follows from <a href="https://en.wikipedia.org/wiki/Stirling's_approximation">Stirling's formula</a> that

$$a_k \approx e^{-k^2/\nu} a_0 = e^{-k^2/\nu} \begin{pmatrix} 2\nu \\ \nu \end{pmatrix} 2^{-2\nu} \approx e^{-k^2/\nu} \cdot \frac{1}{\sqrt{\pi \nu}}.$$


**1.3.5. Gaussian integrals.** In light of the normal approximation of the binomial distribution, we consider random variables with a **Gaussian function**

$$a e^{-b(x-c)^2}$$

as their probability density functions (Section [2.8](#2-8)). It might appear initially that Gaussian distributions have *three* parameters&mdash;$a$, $b$, and $c$&mdash;but one of them, $a$, depends on the other two. Given $b$ and $c$, $a$ must be chosen to ensure that the resulting function integrates to 1.

We begin our study of Gaussian functions by computing the integral of $e^{-x^2}$. Note first that

$$\left( \int_{-\infty}^\infty e^{-x^2} \, dx \right)^2 = \int_{-\infty}^\infty e^{-x^2} \, dx \int_{-\infty}^\infty e^{-y^2} \, dy = \int_{-\infty}^\infty \int_{-\infty}^\infty e^{-(x^2+y^2)} \, dx \, dy.$$

Now, we take the [polar coordinate transform](https://en.wikipedia.org/wiki/Multiple_integral#Polar_coordinates), which yields $r^2=  x^2 + y^2$, and observe that

$$\begin{align*}
\int_{-\infty}^\infty \int_{-\infty}^\infty e^{-(x^2+y^2)} \, dx \, dy
&= \int_0^\infty \int_0^{2\pi} e^{-r^2}r \, d\theta \, dr \\
&= 2\pi \int_0^\infty e^{-r^2}r \, dr \\
&= 2\pi \int_{-\infty}^0 \frac{1}{2} e^s \, ds \\
&= \pi.
\end{align*}$$

Here, an additional coordinate transform $-r^2 \mapsto s$ has been made. It follows that

$$ \int_{-\infty}^\infty e^{-x^2} \, dx = \sqrt{\pi},$$

and so

$$\int_{-\infty}^\infty a e^{-b(x-c)^2} \, dx = \frac{a}{\sqrt{b}} \int_{-\infty}^\infty e^{-t^2} \, dt = a \sqrt{\frac{\pi}{b}}.$$

From this, we see that any Gaussian probability density function must be of the form

$$\sqrt{\frac{b}{\pi}} e^{-b(x-c)^2}.$$

Now, we let $X$ be a random variable with a Gaussian probability density function, so that

$$\begin{align*}
\mu &= \int_{-\infty}^\infty x \sqrt{ \frac{b}{\pi}} e^{-b(x-c)^2} \, dx \\
\mu_2 &= \int_{-\infty}^\infty x^2 \sqrt{ \frac{b}{\pi}} e^{-b(x-c)^2} \, dx.
\end{align*}$$

We evaluate the first integral by taking the coordinate transform $\sqrt{b}(x-c) = u$:

$$\begin{align*}
\mu &= \frac{1}{\sqrt{\pi}} \int_{-\infty}^\infty \left( \frac{1}{\sqrt{b}} u + c \right) e^{-u^2} \, du \\
&= \frac{1}{\sqrt{b \pi}} \int_{-\infty}^\infty u e^{-u^2} \, du
+ \frac{c}{\sqrt{\pi}} e^{-u^2} \, du \\
  &= \frac{1}{\sqrt{b \pi}} \int_{-\infty}^\infty u e^{-u^2} \, du + c.
  \end{align*}$$

Another coordinate transform, $u^2 = t$, yields

$$\begin{align*}
\int_{-\infty}^\infty u e^{-u^2} \, du
&= \frac{1}{2} \inf_{u = -\infty}^{u = \infty} e^{-t} \, dt \\
&= -\frac{e^{-t}}{2} \bigg|_{u = -\infty}^{u = \infty} \\
&= -\frac{e^{-u^2}}{2} \bigg|_{u=-\infty}^{u = \infty} = 0,
\end{align*}$$

and so we conclude that $\mu = c$.

As for the variance, we begin, once again, by  take the coordinate transform $\sqrt{b}(x-c) = u$:

$$\begin{align*}
\mu_2
&= \frac{1}{\sqrt{\pi}} \int_{-\infty}^\infty \left(\frac{1}{\sqrt{b}} u + c\right)^2 e^{-u^2} \, du \\
&= \frac{1}{\sqrt{\pi}}
\left(
\frac{1}{b}  \int_{-\infty}^\infty u^2 e^{-u^2} \, du
+ \frac{2c}{\sqrt{b}} \int_{-\infty}^\infty u e^{-u^2} \, du
+ c^2 \int_{-\infty}^\infty e^{-u^2} \, du \right) \\
  &= \frac{1}{\sqrt{\pi}} \left( \frac{1}{b}  \int_{-\infty}^\infty u^2 e^{-u^2} \, du + 0 + c^2 \sqrt{\pi} \right) \\
  &= \frac{1}{b\sqrt{\pi}} \int_{-\infty}^\infty u^2  e^{-u^2} \, du + c^2.
  \end{align*}$$

To evaluate

$$\int_{-\infty}^\infty u^2 e^{-u^2} \, du,$$

we note that

$$\int_{-\infty}^\infty e^{-vu^2} \, du = \sqrt{\frac{\pi}{v}}$$

and [differentiate the integral](https://en.wikipedia.org/wiki/Leibniz_integral_rule) with respect to $v$:

$$-\frac{\sqrt{\pi}}{2} v^{-3/2} = \frac{d}{dv} \sqrt{\frac{pi}{v}} = \int_{-\infty}^\infty \frac{d}{dv} e^{-vu^2} \, du
= \int_{-\infty}^\infty -u^2e^{-vu^2} \, du.$$

We now let $v = 1$ to conclude that

$$\int_{-\infty}^\infty u^2 e^{-u^2} \, du = \frac{\sqrt{\pi}}{2},$$

whence it follows that

$$\mu_2 = \frac{1}{2b} + c^2 .$$

We compute the variance as follows:

$$\sigma^2 = \mu_2 - \mu^2 = \frac{1}{2b} + c^2 - c^2 = \frac{1}{2b}.$$


All in all, we have shown that $a = \frac{1}{\sqrt{2 \pi \sigma^2}}$,  $b = \frac{1}{2\sigma^2}$, and $c = \mu$.

**1.3.6. Normal distribution.** We discussed how to approximate the binomial distribution using Gaussian functions $a e^{-b(x-c)^2}$. We have shown that a random variable with $a e^{-b(x-c)^2}$ as its probability density function has mean $\mu = c$ and variance $\sigma^2 = \frac{1}{2b}$. Since $a$ must equal $\sqrt{\frac{b}{\pi}}$ for the integral of the Gaussian probability density function to be 1, it makes sense to write

$$\mathcal{N}(x \mid \mu, \sigma^2) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{1}{2\sigma^2}(x-\mu)^2}$$

to represent the pdf of the normal distribution with mean $\mu$ and variance $\sigma^2$.

The cumulative distribution function of the normal distribution is then

$$\Phi(x; \mu, \sigma^2) = \int_{-\infty}^x \mathcal{N}(t \mid \mu, \sigma^2) \, dt.$$

While the above integral has no closed-form expression, the **error function**

$$\operatorname{erf}(x) = \frac{2}{\sqrt{\pi}} \int_0^x e^{-t^2} \, dt$$

allows us to write the cumulative distribution function as follows:

$$\Phi(x; \mu, \sigma^2) = \frac{1}{2} \left( 1 + \operatorname{erf}(x) \right).$$

### 1.4. Conditioning

**1.4.1. Introduction.** Given a probability space $(\Omega,\mathcal{F}, \mathbb{P})$ and two events $A, B \in \mathcal{F}$, what should the **probability of $A$ given $B$** be? To answer this question, we can create a new sample space where $B$ always happens, i.e., all the events intersect with $B$. Formally, we take $\Omega_B = \Omega \cap B$ and

$$\mathcal{F}_B = \{E \cap B : E \in \mathcal{F}\}.$$

The construction implies that every $E \in \mathcal{F}_B$ admits $E' \in \mathcal{F}$ such that $E = E' \cap B$. It thus makes sense to define the new probability measure $\mathbb{P}_B[E]$ in terms of $\mathbb{P}[E' \cap B]$. Note, however, that $\mathbb{P}$ restricted to $\mathcal{F}_B$ is not necessarily a probability measure, as its maximum value is $\mathbb{P}[B]$. We therefore take the normalization

$$\mathbb{P}_B[E] = \frac{\mathbb{P}[E' \cap B]}{\mathbb{P}[B]},$$

so that $\mathbb{P}_B$ is a *bona fide* probability measure.

With this construction, it would be reasonable to say that the probability of $A$ given $B$ is $\mathbb{P}_B[A]$. We thus define

$$\mathbb{P}[A \mid B] = \mathbb{P}_B[A] = \frac{\mathbb{P}[A \cap B]}{\mathbb{P}[B]},$$

provided that $\mathbb{P}[B] > 0$. If $\mathbb{P}[B] = 0$, conditioning is meaningless, as $\mathbb{P}[A \mid B]$ would have to be $0$ for any event $A$. 

Given the above notion of conditional probability, we declare two events $A$ and $B$ to be **independent** if

$$\mathbb{P}[A \mid B] = \mathbb{P}[A] \hspace{1em} \mbox{and} \hspace{1em} \mathbb{P}[B \mid A] = \mathbb{P}[B].$$

This holds if and only if $\mathbb{P}[A \cap B] = \mathbb{P}[A]\mathbb{P}[B]$, which is often taken as the definition of independence.

As an example of independence, consider the experiment of tossing an unbiased coin twice. The event $A$ that the first toss turns up heads is, intuitively, independent of the event $B$ that the second toss turns up heads. Since $A = \{$ `HH`, `HT`$\}$, $B = \{$ `HH`, `TH` $\}$, and $A \cap B = \{$ `HH` $\}$, we see that

$$\mathbb{P}[A \cap B] = \frac{1}{4} = \frac{1}{2} \times \frac{1}{2} = \mathbb{P}[A] \mathbb{P}[B].$$

Therefore, our intuition coincides with the definition in this case.

**1.4.2. Independence.** Recall that two events $A$ and $B$ from a probability space $(\Omega, \mathcal{F}, \mathbb{P})$ are **independent** if $\mathbb{P}[A \cap B] = \mathbb{P}[A] \mathbb{P}[B]$. It is not hard to extend this definition to several events by declaring $E_1,\ldots,E_n$ to be independent if

$$\mathbb{P}\left[ \bigcap_{j=1}^n E_j \right] = \prod_{j=1}^n \mathbb{P}[E_j].$$

Extending the definition further, we say that $\sigma$-algebras $\mathcal{G}_1,\ldots,\mathcal{G}_n$ such that $\mathcal{G}_j \subseteq \mathcal{F}$ for all $1 \leq j \leq n$ are **independent $\sigma$-subalgebras of $\mathcal{F}$** in case

$$\mathbb{P} \left[ \bigcap_{j=1}^n G_j \right] = \prod_{j=1}^n \mathbb{P}[G_j]$$

whenever $G_j \in \mathcal{G}_j$ for each $1 \leq j \leq n$.

How do we make sense of the notion of independent $\sigma$-algebras? Suppose we wish to say that two collections of events $\mathscr{E} = \{E_1,\ldots,E_n\}$ and $\mathscr{F} = \{F_1,\ldots,F_m\}$ are independent. We should then expect any probabilistic information we can deduce from $\mathscr{E}$ to not depend on $\mathscr{F}$, and vice versa. The natural way to guarantee this is to stipulate that any event that can be generated from $\mathscr{E}$ be independent of any event generated from $\mathscr{F}$. In other words, $\sigma(\mathscr{E})$ and $\sigma(\mathscr{F})$ must be independent.

We push further the idea of $\sigma$-algebras as the total collection of available information and define the **$\sigma$-algebra generated by a random variable $X$** to be the $\sigma$-algebra generated by the set of all preimages of $X$, i.e.,

$$\sigma(X) = \sigma \left( \left\{ \{\omega: X((\omega) \in B\} : B \in \mathscr{B}_{\mathbb{R}} \right\} \right),$$

where $\mathscr{B}_{\mathbb{R}}$ is the Borel $\sigma$-algebra on $\mathbb{R}$.

With this definition, we say that random variables $X_1,\ldots,X_n$ are **independent random variables** if the corresponding $\sigma$-algebras are independent.


**1.4.3. Conditional expectation.** Given a probability space $(\Omega,\mathcal{F},\mathbb{P})$, we consider the Hilbert space $L_2(\Omega,\mathcal{F},\mathbb{P})$ of random variables $X:\Omega \to \mathbb{R}$ such that $\mathbb{E}[\vert X \vert^2] < \infty$. We wish to compute the expectation of $X$ when we do not have all the information in $\mathcal{F}$ available to us.

To formalize, we let $\mathcal{G}$ be a $\sigma$-subalgebra of $\mathcal{F}$. Since a $\mathcal{G}$-measurable function on $\Omega$ is always $\mathcal{F}$-measurable, we see that the Hilbert space $L_2(\Omega,\mathcal{G},\mathbb{P})$ (**§1.2.6**) is a Hilbert subspace of $L_2(\Omega,\mathcal{F},\mathbb{P})$. The limited information availability is modeled by considering the least-squares estimation $Y \in L_2(\Omega,\mathcal{G},\mathbb{P})$, i.e., $\mathcal{G}$-measurable random variable $Y$ that minimizes

$$\|Y-X\|_{L_2(\Omega,\mathcal{F},\mathbb{P})}.$$

A standard result from Hilbert space theory tells us that such an estimation is computed by taking the orthogonal projection $P:L_2(\Omega,\mathcal{F},\mathbb{P}) \to L_2(\Omega,\mathcal{G},\mathbb{P})$, under which $PX = Y$. The linear operator $P$ is called the **conditional expectation** given $\mathcal{G}$ and is denoted by

$$PX = \mathbb{E}[X \mid \mathcal{G}].$$

If $\mathcal{G} = \sigma(Z)$ for some random variable $Z$, then we typically write

$$\mathbb{E}[X \mid Z]$$

to denote $\mathbb{E}[X \mid \sigma(Z)]$. The conditional expectation constructed this way satisfies the monotone convergence theorem and the dominated convergence theorem (**§1.2.5**).

Since $\mathbb{E}[\cdot\mid \mathcal{G}]$ is constructed to be a linear operator, we have that

$$\mathbb{E}[aX + bY\mid \mathcal{G}] = a \mathbb{E}[X\mid \mathcal{G}] + b \mathbb{E}[Y \mid \mathcal{G}].$$

Moreover $\mathbb{E}[X \mid \mathcal{G}] \geq 0$ whenever $X \geq 0$.

The conditional expectation satisfies a form of **Jensen's inequality**: if $c:\mathbb{R} \to \mathbb{R}$ is [convex](https://en.wikipedia.org/wiki/Convex_function), then

$$c(\mathbb{E}[X \mid \mathcal{G}]) \leq \mathbb{E}[c(X) \mid \mathcal{G}].$$

In particular, $t \mapsto t^p$ is convex for all $p \geq 1$, and so 

$$\mathbb{E}[\vert X \vert \mid \mathcal{G}]^p \leq \mathbb{E}[ \vert X \vert^p \mid \mathcal{G}].$$

Moreover $t \mapsto \vert\cdot\vert$ is convex, and so

$$\mathbb{E}[\vert X \vert \mid \mathcal{G}]^p \geq \left \vert \mathbb{E}[X \mid \mathcal{G}] \right\vert^p.$$

Takking the expectation, we see that

$$\|\mathbb{E}[X \mid \mathcal{G}]\|_p \leq \mathbb{E}[\mathbb{E}[\vert X \vert^p \mid \mathcal{G}]]^{1/p}= \|X\|_p.$$

The last equality follows from the **law of total expectation**:

$$\mathbb{E}[\mathbb{E}[Y \mid \mathcal{G}]] = \mathbb{E}[Y].$$

If $\mathcal{H}$ is a $\sigma$-subalgebra of $\mathcal{G}$, then the **tower property**

$$\mathbb{E}[\mathbb{E}[X \mid \mathcal{G}] \mid \mathcal{H}] = \mathbb{E}[X \mid \mathcal{H}]$$

holds. This is obtain by projecting $X$ to $L_2(\Omega,\mathcal{G},\mathbb{P})$, and then to $L_2(\Omega,\mathcal{H},\mathbb{P})$.

By Hölder's inequality (**§1.2.6**), 

$$\mathbb{E}[XY \mid \mathcal{G}] = X \mathbb{E}[Y \mid \mathcal{G}]$$

whenever $X \in L_p(\Omega,\mathcal{G},\mathbb{P})$ and $Y \in L_{p'}(\Omega,\mathcal{G},\mathbb{P})$ with $\frac{1}{p} + \frac{1}{p'} = 1$. In particular, if $Y$ is a $\mathcal{F}$-measurable random variable and $X$ is a bounded $\mathcal{G}$-measurable random variable, then

$$\mathbb{E}[XY] = X\mathbb{E}[Y].$$

Since we interpret $\mathbb{E}[\cdot \mid \mathcal{G}]$ in the context of *knowing only the information provided by $\mathcal{G}$*, we can think of the above identity as *taking out what is known*, as per <a href="https://books.google.com/books?id=e9saZ0YSi-AC&lpg=PP1&vq=%22taking%20out%20what's%20known%22&pg=PA89#v=onepage&q&f=false">David Williams</a>.

Finally, if $X$ is independent of $\mathcal{G}$, then

$$\mathbb{E}[X \mid \mathcal{G}] = \mathbb{E}[X],$$

as knowing $\mathcal{G}$ has no bearing on the expected value of $X$.

**1.4.4. Conditional probabilities.** A general form of conditional probabilities can be derived as a special case of the conditional expectation. Given a probability space $(\Omega,\mathcal{F},\mathbb{P})$, $\sigma$-subspace $\mathcal{G}$ of $\mathcal{F}$, and an element $A$ of $\mathcal{F}$, we define

$$\mathbb{P}[A \mid \mathcal{G}] = \mathbb{E}[\boldsymbol{1}_A \mid \mathcal{G}].$$

Observe that, for each $G \in \mathcal{G}$,

$$\int_G \mathbb{P}[A \mid \mathcal{G}] \, d\mathbb{P} = \mathbb{E}[\boldsymbol{1}_G\mathbb{E}[\boldsymbol{1}_A \mid \mathcal{G}]] = \mathbb{E}[\mathbb{E}[\boldsymbol{1}_G\boldsymbol{1}_A \mid \mathcal{G}]]$$

because $\boldsymbol{1}_G$ is bounded and $\mathcal{G}$-measurable. The law of total expectation now implies that

$$\int_G \mathbb{P}[A \mid \mathcal{G}] \, d\mathbb{P} = \mathbb{E}[\boldsymbol{1}_G \boldsymbol{1}_A] = \mathbb{P}[G \cap A].$$

We remark that $\mathbb{P}[A \mid \mathcal{G}] = \mathbb{P}[A]$ almost surely (**§1.1.3**) if and only if $\boldsymbol{1}_A$ is independent of $\mathcal{G}$.

Since a conditional probability is the conditional expectation of a nonnegative function,

$$\mathbb{P}[A \mid \mathcal{G}] \geq 0.$$

Moreover,

$$\mathbb{P}[A \mid \mathcal{G}] \leq \mathbb{P}[\Omega \mid \mathcal{G}] = \mathbb{E}[\boldsymbol{1}_\Omega] = 1$$

because  $\boldsymbol{1}_\Omega$ is independent of $\mathcal{G}$. We note, in particular, that $\mathbb{P}[\varnothing \mid \mathcal{G}] = 0$ and $\mathbb{P}[\Omega \mid \mathcal{G}] = 1$.

By the monotone convergence theorem (**§1.2.5**, **§1.4.3**) and the linearity of the conditional expectation,

$$\begin{align*}
\sum_{n=1}^\infty \mathbb{P}[A_n \mid \mathcal{G}]
&= \sum_{n=1}^\infty \mathbb{E}[\boldsymbol{1}_{A_n} \mid \mathcal{G}] \\
&= \lim_{N \to \infty} \sum_{n=1}^N \mathbb{E} [\boldsymbol{1}_{A_n} \mid \mathcal{G}] \\
&= \lim_{N \to \infty} \mathbb{E} \left[\sum_{n=1}^N \boldsymbol{1}_{A_n} \mid \mathcal{G}\right] \\
&= \lim_{N \to \infty} \mathbb{E} \left[\boldsymbol{1}_{\bigcup_{n=1}^N A_n} \mid \mathcal{G}\right] \\
&= \mathbb{E} \left[\lim_{N \to \infty} \boldsymbol{1}_{\bigcup_{n=1}^N A_n} \mid \mathcal{G}\right] \\
&= \mathbb{E} \left[ \boldsymbol{1}_{\bigcup_{n=1}^\infty A_n} \mid \mathcal{G} \right] \\
&= \mathbb{P} \left[ \bigcup_{n=1}^N A_n \mid \mathcal{G} \right]
\end{align*}$$

whenever $(A_n)_{n=1}^\infty$ is a disjoint collection of events. It follows that $\mathbb{P}[\cdot \mid \mathcal{G}]$ can be considered a probability measure on $(\Omega,\mathcal{F})$.

**1.4.5. Conditional distributions.** Recall that there is a one-to-one correspondence between probability distributions and the Lebesgue&ndash;Stieltjes measures (**§1.2.9**). In light of this, we shall construct conditional distributions as probability measures.

We let $(\Omega,\mathcal{F},\mathbb{P})$ be a probability space, $\mathcal{G}$ a $\sigma$-subalgebra of $\mathcal{F}$, and $X$ a $\mathcal{F}$-measurable random variable. The function $\mu:\mathscr{B}_\mathbb{R} \times \Omega \to [0,1]$ such that

- $\mu(\cdot,\omega)$ is a probability measure on $(\Omega,\mathcal{F})$ for each $w \in \Omega$, and that
- $\mu(E,\cdot) = \mathbb{P}[X \in E \mid \mathcal{G}]$ almost surely for each $E \in \mathscr{B}_{\mathbb{R}}$

is called a **conditional distribution of $X$ given $\mathcal{G}$**.

We also recall that, given a random variable $Y$ on $(\Omega,\mathcal{F})$ and a Borel measurable function $\varphi:\mathbb{R} \to \mathbb{R}$, we have the **change-of-variables formula**

$$\int_{-\infty}^\infty \varphi(\alpha) \, dF_Y(\alpha)$$

(**§1.2.7**, **§1.2.8**, **§1.2.9**). Similarly, conditional distributions are linked to the conditional expectation via the conditional version of the change-of-variables fomrula:

$$\int_{-\infty}^\infty \varphi(\alpha) \, d\mu(\alpha,\omega) = \mathbb{E}[\varphi(X) \mid \mathcal{G}](\omega).$$

In particular, if $\varphi(\alpha) = \alpha$, then

$$\int_{-\infty}^\infty \alpha \, d\mu(\alpha,\omega) = \mathbb{E}[X \mid \mathcal{G}](\omega).$$

### 1.5. Multivariate Random Variables

**1.5.1. Random vectors.** Let $(\Omega,\mathcal{F},\mathbb{P})$ be a probability space. A **random $n$-vector** is a function $X:\Omega \to \mathbb{R}^n$ whose coordinate functions $X_1,\ldots,X_n$ are random variables on $(\Omega,\mathcal{F},\mathbb{P})$.

Since $\mathscr{B}_{\mathbb{R}}^{\otimes n}$ agrees with $\mathscr{B}_{\mathbb{R}^n}$, the above definition is equivalent to $X:\Omega \to \mathbb{R}^n$ being $(\mathcal{F},\mathscr{B}_{\mathbb{R}^n})$-measurable, i.e., $E \in \mathscr{B}_{\mathbb{R}^n}$ implies $X^{-1}(E) \in \mathcal{F}$. This, in particular, implies that Borel measurable functions preserve random vectors. Indeed, if $g:\mathbb{R}^n \to \mathbb{R}^m$ is $(\mathscr{B}_{\mathbb{R}^n},\mathscr{B}_{\mathbb{R}^m})$-measurable, then $g(X):\Omega \to \mathbb{R}^m$ is a random $m$-vector.

As an example, the sum function $(x_1,\ldots,x_n) \mapsto x_1 + \cdots + x_n$ is Borel measurable, and so it follows that the sum of random variables is a random variable.

The expectation of a random vector is defined componentwise:

$$\mathbb{E}[X] = (\mathbb{E}X_1,\ldots,\mathbb{E}X_n).$$

**1.5.2. Variance-Covariance.** The **variance-covariance matrix** of a random vector $X = (X_1,\ldots,X_n)$ is the $n$-by-$n$ matrix $\Sigma$ whose entires are given by the formula

$$\Sigma_{ij} = \operatorname{Cov}(X_i, X_j).$$

We normalize the variance-covariance matrix to define the **correlation matrix**:

$$\rho(X)_{ij} = \rho(X_i,X_j).$$

We remark that the variance-covariance matrix is symmetric and positive-semidefinite. (Conversely, every symmetric and positive-semidefinite matrix is the variance-covariance matrix of a random vector.) By the spectral theorem, $\Sigma_{ij}$ can be unitarily diagonalized with nonnegative entries on the diagonal, a fact useful in [principal component analysis](https://en.wikipedia.org/wiki/Principal_component_analysis). If all the diagonal entries are positive, then the inverse $\Sigma^{-1}$ exists, called the **precision matrix**. The [square root](https://en.wikipedia.org/wiki/Square_root_of_a_matrix) of $\Sigma^{-1}$ is commonly used for [whitening transforms](https://en.wikipedia.org/wiki/Whitening_transformation). 

**1.5.3. Joint probability distributions.** Given a random vector $X = (X_1,\ldots,X_n)$, we define the **distribution** of $X$ to be the mapping $F_X:\mathbb{R}^n \to [0,1]$, given by the formula

$$F_X(t_1,\ldots,t_n) = \mathbb{P}[X_1 \leq t_1; \cdots ; X_n \leq t_n].$$

$F_X$ is also called the **joint probability distribution** of $X_1,\ldots,X_n$.

$f_X:\mathbb{R}^n \to \mathbb{R}$ is the **joint probability density function** of $X_1,\ldots,X_n$ in case

$$\mathbb{P}[X \in E] =  \int_E f_X \, d\mathscr{L}_{\mathbb{R}^n}$$

for all $E \in \mathscr{B}_{\mathbb{R}^n}$, where $\mathscr{L}_{\mathbb{R}^n} = \mathscr{L}_{\mathbb{R}} \otimes \cdots \otimes \mathscr{L}_{\mathbb{R}}$ the Lebesgue measure on $\mathbb{R}^n$.

**1.5.4. Multivariate Gaussian.**

The expectation of a random vector is defined componentwise:

$$\mathbb{E}[X] = (\mathbb{E}X_1,\ldots,\mathbb{E}X_n).$$