## 0. Introduction

The purpose of this notebook is to continue our exploration of random variables with reference to chapter 2 from *All of Statistics* (Wasserman, 2004).

## 1. Bivariate Distributions

Given a pair of discrete random variables $X$ and $Y$, define the joint mass function by $f(x, y) = \mathbb{P}(X = x, Y = y)$. We write $f$ as $f_{X,Y}$ when we want to be more explicit. 

Here is a bivariate distribution for two random variables $X$ and $Y$ each taking values $0$ or $1$:

$$
\begin{array}{c|cc|c}
    & Y=0 & Y=1 & \text{Total} \\
\hline
X=0 & 1/9 & 2/9 & 1/3 \\
X=1 & 2/9 & 4/9 & 2/3 \\
\hline
\text{Total} & 1/3 & 2/3 & 1 \\
\end{array}
$$

Thus, $f(1, 1) = \mathbb{P}(X = 1, Y = 1) = 4/9$.

In the continuous case, we call a function $f(x, y)$ a PDF for the random variables $(X, Y)$ if

(i) $f(x, y) \geq 0$ for all $(x, y)$,

(ii) $\int_{-\infty}^{\infty} \int_{-\infty}^{\infty} f(x, y) \, dx \, dy = 1$ and,

(iii) for any set $A \subset \mathbb{R} \times \mathbb{R}$, $\mathbb{P}((X, Y) \in A) = \int \int_A f(x, y) \, dx \, dy$.

In the discrete or continuous case, we define the joint CDF as $F_{X,Y}(x, y) = \mathbb{P}(X \leq x, Y \leq y)$.

<center><img src="../figures/bi_dist.png"/></center>

## 2. Marginal Distributions

A **marginal distribution** is a probability distribution that describes the probability of one random variable having a specific value, regardless of the value of any other random variables.

If $(X, Y)$ have a joint distribution with mass function $f_{X,Y}$, then the **marginal mass function** for $X$ is defined by

$$ f_X(x) = \mathbb{P}(X = x) = \sum_{y} \mathbb{P}(X = x, Y = y) = \sum_{y} f(x, y) $$

and the marginal mass function for $Y$ is defined by

$$ f_Y(y) = \mathbb{P}(Y = y) = \sum_{x} \mathbb{P}(X = x, Y = y) = \sum_{x} f(x, y) $$

Suppose that $f_{X,Y}$ is given in the table that follows. The marginal distribution for $X$ corresponds to the row totals and the marginal distribution for $Y$ corresponds to the columns totals:

$$
\begin{array}{c|cc|c}
    & Y=0 & Y=1 & \text{Total} \\
\hline
X=0 & 1/10 & 2/10 & 3/10 \\
X=1 & 3/10 & 4/10 & 7/10 \\
\hline
\text{Total} & 4/10 & 6/10 & 1 \\
\end{array}
$$

For example, $f_X(0) = 3/10$ and $f_X(1) = 7/10$.

For continuous random variables, the **marginal densities** are

$$ f_X(x) = \int f(x, y) \, dy, \quad \text{and} \quad f_Y(y) = \int f(x, y) \, dx. $$

The corresponding marginal distribution functions are denoted by $F_X$ and $F_Y$.

For example, suppose that

$$ f_{X, Y}(x, y) = e^{-(x + y)} $$

for $x, y \geq 0$. Then $f_X(x) = e^{-x} \int_{0}^{\infty} e^{-y} \, dy = e^{-x}$.

<center><img src="../figures/joint_dist.png"/></center>

## 3. Independent Distributions

An **independent distribution** refers to a situation where the occurrence or value of one random variable does not affect the occurrence or value of another random variable.

Two random variables $X$ and $Y$ are independent if, for every sets $A$ and $B$,

$$ \mathbb{P}(X \in A, Y \in B) = \mathbb{P}(X \in A) \mathbb{P}(Y \in B) $$

and we write $X \perp Y$. Otherwise, we say that $X$ and $Y$ are dependent.

Let $X$ and $Y$ have the following distribution:

$$
\begin{array}{c|cc|c}
    & Y = 0 & Y = 1 & \text{Total} \\
\hline
X = 0 & 1/4 & 1/4 & 1/2 \\
X = 1 & 1/4 & 1/4 & 1/2 \\
\hline
\text{Total} & 1/2 & 1/2 & 1 \\
\end{array}
$$

Then, $f_X(0) = f_X(1) = 1/2$ and $f_Y(0) = f_Y(1) = 1/2$. $X$ and $Y$ are independent because $f_X(0) f_Y(0) = f(0, 0)$, $f_X(0) f_Y(1) = f(0, 1)$, $f_X(1) f_Y(0) = f(1, 0)$, $f_X(1) f_Y(1) = f(1, 1)$.

Suppose instead that $X$ and $Y$ have the following distribution:

$$
\begin{array}{c|cc|c}
    & Y = 0 & Y = 1 & \text{Total} \\
\hline
X = 0 & 1/2 & 0 & 1/2 \\
X = 1 & 0 & 1/2 & 1/2 \\
\hline
\text{Total} & 1/2 & 1/2 & 1 \\
\end{array}
$$

These are not independent because $f_X(0) f_Y(1) = (1/2)(1/2) = 1/4$ yet $f(0, 1) = 0$. This means that while the individual probabilities suggest a certain likelihood of both events occurring, the joint probability is actually zero, indicating that the events cannot happen simultaneously.

## 4. Conditional Distributions

A **conditional distribution** is a probability distribution that describes the probability of one random variable having a specific value, given that another random variable has a specific value. 

If $X$ and $Y$ are discrete, then we can compute the conditional distribution of $X$ given that we have observed $Y = y$. Specifically, $\mathbb{P}(X = x | Y = y) = \mathbb{P}(X = x, Y = y) / \mathbb{P}(Y = y)$. This leads us to define the **conditional probability mass function** as

$$ f_{X \mid Y}(x \mid y) = \mathbb{P}(X = x \mid Y = y) = \frac{\mathbb{P}(X = x, Y = y)}{\mathbb{P}(Y = y)} = \frac{f_{X,Y}(x, y)}{f_Y(y)} $$

if $f_Y(y) > 0$.

For continuous distributions, we use the same definitions. The interpretation differs: in the discrete case, $f_{X \mid Y}(x \mid y)$ is $\mathbb{P}(X = x \mid Y = y)$, but in the continuous case, we must integrate to get a probability. For continuous random variables, the **conditional probability density function** is

$$ f_{X \mid Y}(x \mid y) = \frac{f_{X,Y}(x, y)}{f_Y(y)} $$

assuming that $f_Y > 0$. Then,

$$ \mathbb{P}(X \in A \mid Y = y) = \int_A f_{X \mid Y}(x \mid y) \, dx. $$

<center><img src="../figures/cond_dist.png"/></center>

## 5. Multivariate Distributions and IID Samples

Let $X = (X_1, \ldots, X_n)$ where $X_1, \ldots, X_n$ are random variables. We call $X$ a **random vector**. Let $f(x_1, \ldots, x_n)$ denote the PDF. It is possible to define their marginals, conditionals, etc., much the same way as in the bivariate case. We say that $X_1, \ldots, X_n$ are independent if, for every $A_1, \ldots, A_n$,

$$ \mathbb{P}(X_1 \in A_1, \ldots, X_n \in A_n) = \prod_{i=1}^n \mathbb{P}(X_i \in A_i). $$

It suffices to check that $f(x_1, \ldots, x_n) = \prod_{i=1}^n f_{X_i}(x_i)$.

If $X_1, \ldots, X_n$ are independent and each has the same marginal distribution with CDF $F$, we say that $X_1, \ldots, X_n$ are IID (independent and identically distributed) and we write

$$ X_1, \ldots X_n \sim F. $$

If $F$ has density $f$, we also write $X_1, \ldots, X_n \sim f$. We also call $X_1, \ldots, X_n$ a **random sample of size** $n$ **from** $F$.

Much of statistical theory and practice begins with IID observations, which we will see in later notebooks.

### 5.1 Multinomial Distributions

The multivariate version of a Binomial is called a Multinomial. Consider drawing a ball from an urn which has balls with $k$ different colours labeled "colour 1, colour 2, $\ldots,$ colour k." Let $p = (p_1, \ldots, p_k)$ where $p_j \ge 0$ and $\sum_{j=1}^k p_j = 1$ and suppose that $p_j$ is the probability of drawing a ball of colour $j$. Draw $n$ times (independent draws with replacement) and let $X = (X_1, \ldots, X_k)$ where $X_j$ is the number of times that colour $j$ appears. Hence, $n = \sum_{j=1}^k X_j$. We say that $X$ has a Multinomial $(n, p)$ distribution written $X \sim \text{Multinomial}(n, p)$. The probability function is

$$ f(x) = \binom{n}{x_1 \ldots x_k} p_1^{x_1} \cdots p_k^{x_k} $$

where

$$ \binom{n}{x_1, \ldots, x_k} = \frac{n!}{x_1! \cdots x_k!}. $$

### 5.2 Multivariate Normal Distributions

The univariate normal has two parameters, $\mu$ and $\sigma$. In the multivariate version, $\mu$ is a vector and $\sigma$ is replaced by a matrix $\Sigma$. To begin, let

$$ 
Z =
\begin{pmatrix}
Z_1 \\
\vdots \\
Z_k
\end{pmatrix}
$$

where $Z_1, \ldots, Z_k \sim N(0, 1)$ are independent. The density of $Z$ is

$$
\begin{align*}
f(z) \quad & = \quad \prod_{i=1}^{k} f(z_i) = \frac{1}{(2\pi)^{k/2}} \exp \left\{ -\frac{1}{2} \sum_{j=1}^{k} z_j^2 \right\} \\
     \quad & = \quad \frac{1}{(2\pi)^{k/2}} \exp \left\{ -\frac{1}{2} z^T z \right\}.
\end{align*}
$$

We say that $Z$ has a standard multivariate normal distribution written $Z \sim N(0, I)$ where it is understood that $0$ represents a vector of $k$ zeroes and $I$ is the $k \times k$ identity matrix. More generally, a vector $X$ has a multivariate normal distribution, denoted by $X \sim N(\mu, \Sigma)$, if it has density

$$ f(x; \mu, \Sigma) = \frac{1}{(2\pi)^{k/2} |(\Sigma)|^{1/2}} \exp \left\{ -\frac{1}{2} (x - \mu)^T \Sigma^{-1} (x - \mu) \right\} $$

where $|\Sigma|$ denotes the determinant of $\Sigma$, $\mu$ is a vector of length $k$, and $\Sigma$ is a $k \times k$ symmetric, positive definite matrix. Setting $\mu = 0$ and $\Sigma = I$ gives back the standard normal.

Since $\Sigma$ is symmetric and positive definite, it can be shown that there exists a matrix $\Sigma^{1/2}$ - called the square root of $\Sigma$ - with the following properties:

(i) $\Sigma^{1/2}$ is symmetric, 

(ii) $\Sigma = \Sigma^{1/2} \Sigma^{1/2}$, and 

(iii) $\Sigma^{1/2} \Sigma^{-1/2} = \Sigma^{-1/2} \Sigma^{1/2} = I$, where $\Sigma^{-1/2} = (\Sigma^{1/2})^{-1}$.

If $Z \sim N(0, I)$ and $X = \mu + \Sigma^{1/2} Z$, then $X \sim N(\mu, \Sigma)$. Conversely, if $X \sim N(\mu, \Sigma)$, then $\Sigma^{-1/2}(X - \mu) \sim N(0, I)$. Suppose we partition a random normal vector $X$ as $X = (X_a, X_b)$. We can similarly partition $\mu = (\mu_a, \mu_b)$ and

$$
\Sigma = 
\begin{pmatrix}
\Sigma_{aa} & \Sigma_{ab} \\
\Sigma_{ba} & \Sigma_{bb}
\end{pmatrix}.
$$

Let $X \sim N(\mu, \Sigma)$. Then:

(1) The marginal distribution of $X_a$ is $X_a \sim N(\mu_a, \Sigma_{aa})$. 

(2) The conditional distribution of $X_b$ given $X_a = x_a$ is: $ X_b | X_a = x_a \sim N \left( \mu_b + \Sigma_{ba} \Sigma_{aa}^{-1} (x_a - \mu_a), \Sigma_{bb} - \Sigma_{ba} \Sigma_{aa}^{-1} \Sigma_{ab} \right).$

(3) If $a$ is a vector, then $a^T X \sim N(a^T \mu, a^T \Sigma a)$.

(4) $V = (X - \mu)^T \Sigma^{-1} (X - \mu) \sim \chi^2_k$.