
# Background Note 

# Probability: Conditional Distributions

# By Albert S. Kyle

$\require{\newcommand}$
$\require{\renewcommand}$
$\renewcommand{\sm}{ {\scriptstyle{\text{*}}}}$ 
$\renewcommand{\mm}{{\scriptsize @}}$
$\newcommand{\E}{\mathrm{E}}$
$\newcommand{\e}{\mathrm{e}}$
$\newcommand{\drm}{\mathrm{\, d}}$
$\newcommand{\var}{\mathrm{var}}$
$\newcommand{\stdev}{\mathrm{stdev}}$
$\renewcommand{\t}{^{\mathsf{T}}}$
$\renewcommand{\comma}{\, , \,}$
$\renewcommand{\vec}[1]{\mathbf{#1}}$
$\newcommand{\skew}{\mathrm{skew}}$
$\newcommand{\kurt}{\mathrm{kurt}}$
$\newcommand{\prob}{\textrm{prob}}$
$\newcommand{\midx}{\, \mid \,}$


### Summary

This Background Note reviews probability theory, focussing on conditional probabilities.

#### Exercises

There are several exercises.  I encourage you to attempt to do the exercises.  Answers to the exercises are provided in a seaprate notebook.

##### Conventions

These notes define the function `f()` so that its definition changes from cell to cell.  By using local variables to make the examplesself-contained, this approach avoids name clashes, except for the name of the function `f()` itself. 

In [1]:
import pandas as pd
import numpy as np
import scipy
import matplotlib
import matplotlib.pyplot as plt
import sys
import datetime
import timeit
import math
import statistics
import nbconvert

print('Python version ' + sys.version)
print('Pandas version ' + pd.__version__)
print('NumPy version ' + np.__version__)
print('SciPy version ' + scipy.__version__)
print('matplotlib version ' + matplotlib.__version__)

timestamp = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
print("Timestamp:", timestamp)
tstart = timeit.default_timer()


Python version 3.8.11 (default, Aug  6 2021, 09:57:55) [MSC v.1916 64 bit (AMD64)]
Pandas version 1.5.3
NumPy version 1.23.5
SciPy version 1.10.1
matplotlib version 3.7.1
Timestamp: 2023-09-05 16:49:03



### Covariance and Correlation

Let $X$ be a random variable with mean $\mu_X$ and variance $\sigma^2_X$. Similarly, let $Y$ be a random variable with mean $\mu_Y$ and variance $\sigma^2_Y$. Assume the random variables do not have probability distributions concentrated at a single point with probability one (i.e., $\sigma^2_X > 0$ and $\sigma^2_Y > 0$).

The covariance between two random variables $X$ and $Y$ is defined by

\begin{equation}
\cov [ X, Y ] := \E [ (X - \mu_X) \sm (Y - \mu_Y) ].
\end{equation}

The correlation between $X$ and $Y$ is defined by

\begin{equation}
\corr [ X, Y ] := \frac{\cov[X, Y]}{\sigma_X \sm \sigma_Y}.
\end{equation}

It can be shown mathematically that

\begin{equation}
-1 \le \corr [ X, Y ] \le 1.
\end{equation}

The correlation is the covariance scaled to be dimensionless and to range between $-1$ and $+1$.. 

The covariance of a random variable with itself is its variance: $\cov[X,X] = \var[X]$. The correlation of a random variable with itself is $+1$, and the correlation of a random variable with its negative is $-1$:

\begin{equation}
\var [X,X] = \sigma^2_X, \qquad \corr [ X, X ] = 1, \qquad \corr[X, -X] = -1.
\end{equation}

If two random variables are **independent**, their covariance and correlation are both zero.

If we define new random variables $Z_X$ and $Z_Y$ which standardize $X$ and $Y$ to have means of zero and variances of 1, then the both the correlation and the covariance are the expectations of their products:

\begin{equation}
Z_X := \frac{X - \mu_X}{\sigma_X},
\qquad
Z_Y := \frac{Y - \mu_Y}{\sigma_Y}
\qquad \text{implies} \qquad
\corr [ Z_X, Z_Y ] = \cov[ Z_X, Z_Y] = \E [ Z_X \sm Z_Y ]
\end{equation}



### Exercise 1:

1. Let $Z_0$ and $Z_1$ be independently distributed random variables with mean 0 and variance 1. For some number $\rho$ with $0 \le \rho \le 1$, define a new random variable $Z_2 = \rho \sm Z_0 + \sqrt{1-\rho^2} \sm Z_1$. Calculate the mean and variance of $Z_2$, the covariance and correlation between $Z_0$ and $Z_2$, and the covariance and correlation between $Z_1$ and $Z_2$. 

2. For $\rho \in \{ -0.95, -0.50, 0.00, +0.25, +0.50, +0.80, +0.95, 0.99 \}$, simulate 100 outcomes of $Z_0$ and $Z_1$ under the assumption that both are normally distributed, then plot the values of $Z_0$ and $Z_2$.

### Skewness and Kurtosis

The **skewness** of a random variable $X$ with mean $\mu_X$ and standard deviation $\sigma_X > 0$ is defined as

\begin{equation}
\skew[X] = \E[Z_X^3], \qquad \text{where} \qquad Z_X := \frac{X - \mu_X}{\sigma_X}.
\end{equation}

A random variable tends to have positive skewness if realizations far from the mean tend to be much larger than the mean. It tends to have negative skewness if realizations from from the mean tend to be much smaller than the mean.  If the probability density function or probability mass function is symmetric about its mean, then the skewness is zero. The normal distribution has a skewness of zero (due to symmetry about its mean). 

The **kurtosis** of $X$ is defined as 

\begin{equation}
\kurt[X] = \E[Z_X^4], \qquad \text{where} \qquad Z_X := \frac{X - \mu_X}{\sigma_X}.
\end{equation}

The kurtosis tends to be large when there is more probability in the tails of the distribution. Kurtosis is related to the variance of $Z^2$ about its expectation. Indeed, it can be shown that $\kurt[X] = \var[X^2] + 1$.

The kurtosis of any normally distributed random variable is 3. Many statisticians like to work with **excess kurtosis**, defined as $\kurt[X] - 3$.


### Joint probability distributions

Two random variables have a **joint probability distribution**. 

To understand this, consider the following toy example.  A stock's price may can go up or down from open to close during a trading day. Let $R$ denote the random **gross return** on the stock. If you buy at a price of 100 dollars at the beginning of the day and sell at a price of 105 dollars at the end of the day, your gross return is $1.05 = 105/100$; in addition to getting your dollar back, you also obtain a profit of 5 dollars, from a **net return** of 0.05, or 5 percent. 

Let $R$ denote the gross return. For simplicity in a **toy example**, assume there are two outcomes, up $U=1.05$ and down $D=0.95$. At the end of the day, there is an earnings announcement, $Y$, which can be hign $H$ or low $L$. The particular numerical value for the earnings announcement does not matter here. We can think of $H=1$ and $L=0$ or $H=+1$ and $L=-1$. The joint probability mass function $f_{RY}$ describes the probability of all combinations of outcomes. Using obvious notation, we have:

\begin{equation}
f_{R Y}(r, y) = 
\left\{ 
\begin{array}{ll}
p_{U H} & \text{if } r = U \text{ and } y = H,\\
p_{U L} & \text{if } r = U \text{ and } y = L,\\
p_{D H} & \text{if } r = D \text{ and } y = H,\\
p_{D L} & \text{if } r = D \text{ and } y = L\\
\end{array} 
\right. .
\end{equation}

Since there are four possible joint outcomes and the probabilities sum to one, $p_{D L} + p_{U H} + p_{U L} + p_{D H} = 1$, the joint distribution is described by three parameters.  For example, the three parameters might be the probabilities of the first three outcomes, which define the probability of the fourth outcome implicitly as $p_{D L} = 1 - p_{U H} - p_{U L} - p_{D H}$.

Theoretically, the four probabilities could be any **convex combination** of the four probabilities. Here, the term **convex combination** means that all probabilities are nonnegative and sum exactly to one. Obviously, there are many such convex combinations. For four probabilities, the set of possible convex combinations is a three-dimensional tetrahedron. It has three dimensions because there are three free parameters.

### Marginal probability distributions

We can also define probabilities for each outcome individually. The stock price $R$ has a **marginal probability distribution** defined by

\begin{equation}
f_{R}(r) = 
\left\{ 
\begin{array}{ll}
p_U := p_{U H} + p_{U L} & \text{if } r = U ,\\
p_D := p_{D L} + p_{D U} & \text{if } r = D \\
\end{array} 
\right. .
\end{equation}

The earnings announcement $Y$ has a **marginal probability distribution** defined by

\begin{equation}
f_{Y}(y) = 
\left\{ 
\begin{array}{ll}
p_H := p_{U H} + p_{D H} & \text{if } y = H ,\\
p_L := p_{U L} + p_{D L} & \text{if } y = L \\
\end{array} 
\right. .
\end{equation}

These marginal **probability mass functions** (**pmf**s) are obtained by summing over all of the relevant outcomes.

Since there are only two outcomes for $R$, the pmf for $R$ can be described by one parameter, say $p_U$, with the other probability defined as $p_D := 1 - p_U$.  Similarly, since there are only two outcomes for $Y$, the pmf for $Y$ can be described by one parameter, say $p_H$, with the other probability defined as $p_L := 1 - p_H$.

These probabilities lie on a one dimensional set of possible convex combinations, defined by the line segment connecting the points $0,1$ and $1,0$ in the Euclindean plane.

### Joint pmf for independently distributed random variables

Since the pmf for the joint distribution is described by three "free" parameters and the pmf for each marginal distribution is described by one "free" parameter each, knowledge of the two marginal pmfs does not make it possible to determine the pmf for the joint distribution. At least one more parameter must be specified. This requires more information about how the two random variables are related.

If the two random variables are independently distributed, then the joint probabilites are the **products** of the marginal probabilities:

\begin{equation}
f_{R Y}(r, y) = 
\left\{ 
\begin{array}{ll}
p_{U} \sm p_{H} & \text{if } r = U \text{ and } y = H,\\
p_{U} \sm p_{L} & \text{if } r = U \text{ and } y = L,\\
p_{D} \sm p_{H} & \text{if } r = D \text{ and } y = H,\\
p_{D} \sm p_{L} & \text{if } r = D \text{ and } y = L\\
\end{array} 
\right. .
\end{equation}

The analysis so far generalizes to two jointly distributed discrete random variables with more than two outcomes.  For example, if there are 100 stock price outcomes and 50 possible earnings outcomes (presumably specified numerically rather than  with letters like $H$ and $L$), then there are $100 \sm 50 = 5000$ probabilities defining the joint probability distribution (4999 free parameters), 100 probabilities describing the marginal distribution for the stock price (99 free parameters), and 50 probabilities describing the marginal distribution for the earnings announcement (49 free parameters).

If there are more than two random variables, the analysis also generalizes in a natural way.

### Conditional probabilities

We can use **conditional probabilities** to describe the joint probability distribution of two random variables.

For example, traders in the stock market might be buying or selling based on whether they anticipate the earnings announcement to be high or low.  If the stock price rises during the day, the **conditional probability** of a high earnings announcement is greater than it would be if the stock price fell during the day.

Let $p_{H \vert U}$ denote the **conditional probability** of a high earnings announcement $H$ if the stock price goes up $U$. We have

\begin{equation}
f_{Y \vert R}(y \vert r) = 
\left\{ 
\begin{array}{ll}
p_{H \vert U} & \text{if } r = U \text{ and } y = H ,\\
p_{H \vert D} & \text{if } r = D \text{ and } y = H ,\\
p_{L \vert U} & \text{if } r = U \text{ and } y = L ,\\
p_{L \vert D} & \text{if } r = D \text{ and } y = L .\\
\end{array} 
\right. 
\end{equation}

The conditional probabilities must also add up to one: $p_{H \vert U} + p_{L \vert U} = 1$ and $p_{H \vert D} + p_{L \vert D} = 1$. Therefore, the conditional pmf $f_{Y\midx R}(y\midx r)$ is defined by two free parameters.

We can condition on $Y$ rather than $R$ and define $f_{R\midx Y}(r\midx y)$ analogously.

The joint pmf is related to the conditional pmfs and marginal pmfs by

\begin{equation}
f_{R Y}(r,y) = f_{R\midx Y}(r\midx y) \sm f_{Y}(y) = f_{Y\midx R}(y\midx r) \sm f_{R}(r) .
\end{equation}

This equation implies that joint probabilities can be obtained from conditional probabilities and marginal probabilities as follows:

\begin{equation}
\begin{aligned}
p_{U H} &= p_{H \vert U} \sm p_U = p_{U \vert H} \sm p_H,\\
p_{U L} &= p_{L \vert U} \sm p_U = p_{U \vert L} \sm p_L,\\
p_{D H} &= p_{H \vert D} \sm p_D = p_{D \vert H} \sm p_H,\\
p_{D L} &= p_{L \vert D} \sm p_D = p_{D \vert L} \sm p_L.\\
\end{aligned}
\end{equation}







### Bayes' Law

Suppose that we know one conditional pmf $f_{Y\midx R}(y\midx r)$ and know the associated marginal pmf $f_R(r)$. We can obtain the joint pmf from

\begin{equation}
f_{R Y}(r,y) = f_{Y\midx R}(y\midx r) \sm f_{R}(r) .
\end{equation}

We can then obtain the other marginal pmf from

\begin{equation}
f_{Y}(y) = \sum_{r \in \{ U, D \}} f_{R Y}(r,y) .
\end{equation}

We can then obtain the conditional probability pmf $f_{R\midx Y}(r\midx y)$ by solving the equation

\begin{equation}
f_{R Y}(r,y) = f_{R\midx Y}(r\midx y) \sm f_{Y}(y) = f_{Y\midx R}(y\midx r) \sm f_{R}(r) .
\end{equation}

for $f_{R\midx Y}(r\midx y)$ to obtain

\begin{equation}
f_{R\midx Y}(y\midx r) = \frac{f_{R Y}(r,y)}{f_{Y}(y)} = \frac{f_{Y\midx R}(y\midx r) \sm f_{R}(r)}{f_{Y}(y)} .
\end{equation}

This result is called **Bayes' Law** or **Bayes' Theorem**.

For our specific case where both $R$ and $Y$ have two outcomes, the conditional pmf $f_{Y\midx R}(y,r)$ requires two free parameters and the marginal pmf for $R$ requires one free parameter. This is consistent with the joint pmf requiring three free parameters. These three parameters are all that is needed to obtain to specify all proabbilities fully.  

Of course, Bayes' Law applies to discrete probabilities with an arbitrary number of outcomes. It also generalizes to continuous distributions like the normal distribution.

### Conditional Expectation

Just as the (**unconditional**) expectation is the weighted sum of outcomes with probabilities as weights, the **conditional expectation** is the weighted sum of conditional outcomes with conditional probabilities as weights.

In the two-outcome example, the unconditional expectation of the return $R$ is

\begin{equation}
\begin{aligned}
\E [ R ] &= \sum_{r \in \{ U, D \}}  r \sm f_{R}(r) \\
&= U \sm p_{U} + D \sm p_{D}
\end{aligned}
\end{equation}


In the two outcome example, the conditional expectation of the return $R$ given the earninging $Y$ is

\begin{equation}
\begin{aligned}
\E [ R \midx  Y=y] &= \sum_{r \in \{ U, D \}}  r \sm f_{R \midx  Y}(r \midx  y) \\
&= \left\{ 
\begin{array}{ll}
U \sm p_{U \midx  H} + D \sm p_{D \midx  H} & \text{if }  y = H ,\\
U \sm p_{U \midx  L} + D \sm p_{D \midx  L} & \text{if }  y = L .\\
\end{array} 
\right. .
\end{aligned}
\end{equation}

This formula generalizes in an obvious way to random variables with more than two outcomes.

The conditional expectation depends only on the conditional probabilities, $f_{R \midx  Y}(r \midx  y)$, not on the entire set of joint probabilities, $f_{R ,  Y}(r ,  y)$. It typically takes much more information to specify the joint probabilities than the conditional probabilities.  For example, if there are 100 possible outcomes for $R$ and 50 possible outcomes for $Y$, specifiying arbitrary conditional probabilities for a specific out come $y$ requires specifying 49 values (Why?), the same as specifying a marginal probability distribution. But specifying the joint pmf requires specifying 4999 values (Why?).


### Exercise 2

Suppose the unconditional (marginal) open-to-close stock return is $R=+5$ percent for $U$ and the unconditional expected net return is zero (which defines $D$), with the earnings announcement made after the close.  Suppose that the earnings announcement is more likely to be high after the stock prices goes up than after the stock price goes down.  To be specific, the conditional probability of earnings given stock prices is

\begin{equation}
f_{Y\midx R}(y\midx r) = 
\left\{ 
\begin{array}{ll}
0.65 & \text{if } r = U \text{ and } y = H ,\\
0.35 & \text{if } r = U \text{ and } y = L ,\\
0.55 & \text{if } r = D \text{ and } y = H ,\\
0.45 & \text{if } r = D \text{ and } y = L .\\
\end{array} 
\right. .
\end{equation}

1. Describe the joint pmf of $R$ and $Y$.

2. Describe the marginal pmf for $Y$.

3. Describe the conditional pmf for $R$ given $Y$.

4. If you knew whether the earnings announcement was going to be high $H$ or low $L$ before the stock opened, how large a return can you make day-trading the stock, buying or selling on the open and unwinding the position on the close. Are your profits larger when you observe $H$ or when you observe $L$? (Note that you are liquidating your bet before the earnings are announced because you are betting on how well the market can predict the earnings announcement, not what the earnings announcement will actually be.) 



### Generalization to multiple discrete outcomes and continuous random variables

The concepts of **joint probability distribution**, **marginal probability distribution**, **conditional probability distribution**, **conditional expectation**, and **Bayes' Law** generalize in an obvious manner to more than two discrete random variables, each of which potentially has more than two outcomes. 

These concepts also generalize to continuous random variables. For example, three continuous random variables $X_1$, $X_2$, $X_3$ may have a **joint density function** $f_{X_0 X_1 X_2}( x_0, x_1, x_2)$ which is nonnegative and satisfies 

$$
\int_{x_0=-\infty}^{+\infty} \int_{x_1=-\infty}^{+\infty} \int_{x_2=-\infty}^{+\infty}
f_{X_0 X_1 X_2}( x_0, x_1, x_2) \, \drm x_0 \drm x_1 \drm x_2 = 1 .
$$

The joint probability distribution function is

\begin{equation}
\begin{aligned}
F_{X_0 X_1 X_2}( \bar x_0, \bar x_1, \bar x_2) 
&= \text{prob}(X_0 \le \bar x_0 \text{ & } X_1 \le \bar x_1 \text{ & } X_2 \le \bar x_2) )  \\
&= \int_{x_0=-\infty}^{\bar x_0} \int_{x_1=-\infty}^{\bar x_1} \int_{x_2=-\infty}^{\bar x_2}
f_{X_0 X_1 X_2}( x_0, x_1, x_2) \, \drm x_0 \drm x_1 \drm x_2 .
\end{aligned}
\end{equation}

The **marginal probability density** function for $x_2$ is obtained by "integrating out" $x_1$ and $x_2$:

$$
f_{X_0}(x_0) = \int_{x_1=-\infty}^{+\infty} \int_{x_2=-\infty}^{+\infty}
f_{X_0 X_1 X_2}( x_0, x_1, x_2) \, \drm x_1 \drm x_2 .
$$

Conditional probabilities, conditional expectations, and Bayes' Law generalize in an obvious manner.

If the random variables $X_0$, $X_1$, and $X_2$ are mutually independent, the their joint density function is the product of their marginal densities:

$$f_{X_0 X_1 X_2}( x_0, x_1, x_2 ) = f_{X_0}(x_0) \sm f_{X_1}(x_1) \sm f_{X_2}(x_2) .
$$

Note: It is theoretically possible for $X_0$ and $X_1$ to be **pairwise independent**, $X_0$ and $X_2$ to be pairwise independent, $X_1$ and $X_2$ to be pairwise independent, but at the same time the three random variables $X_0$, $X_1$, $X_2$ are not **mutually independent**.


### Conditional expectation and theoretical regression

Let $X$ and $Y$ be random variables with finite means and variances $\mu_X$, $\mu_Y$, $\sigma^2_X$, $\sigma^2_Y$.

The **conditional expectation** of $Y$ given $X$ is a function of realizations of $X$. Call this function $g_{Y\midx X}$:

\begin{equation}
g_{Y\midx X}(x) = \E [ Y \midx  X=x ].
\end{equation}

It can be shown that the conditional expectation $g_{Y\midx X}(x)$ minimizes the **mean squared error**:

\begin{equation}
\min_{h \in \mathcal{H}} \E \left[(Y - h(X))^2 \right] = \E \left[ (Y - g_{Y\midx X}(X))^2 \right], 
\end{equation}

where $\mathcal{H}$ is the set of all functions $h \in \mathcal{H}$ such that $h(X)$ defines a meaningful random variable.

This conditional expectation is also called a **regression**. With a near infinite number of realizations of $X$ and $Y$, one could theoretically calculate the conditional expectation function (**regression function**) $g(x)$ almost perfectly accurately. I like to call $g(x)$ the **theoretical regression** function.

### Intuition for why conditional expectation minimizes mean squared error

To obtain some intuition for why the conditional expectation minimizes the mean squared error rather than, say, the mean absolute error $\E \left[ \lvert Y - h(X) \rvert \right]$, suppose that $g(x)$ is changed by adding a small increment $\alpha \cdot \Delta g(x)$.  The first-order condition for the value of $\alpha$ which minimizes $\E \left[(Y - g(X) - \alpha \sm \Delta g(X)  )^2 \right]$ is

\begin{align}
0 &= \frac{\partial}{\partial \alpha} \, \E \left[\big( Y - g(X) - \alpha \sm \Delta g(X) \big)^2 \mid X = x \right] \Big\vert^{\alpha=0} \\
&= 2 \sm \E \left[ \big(Y - g(X) - \alpha \sm \Delta g(X) \big) \sm \Delta g(X) \mid X = x  \right] \Big\vert^{\alpha=0} \\ 
&= 2 \sm \big( [\E Y \mid X = x] - g(x) \big) \sm \Delta g(x). 
\end{align}

Since this must be true for every direction $\Delta g(x)$, it must be the case that $\E [Y \mid X = x] = g(x)$.



### Restricting the set of functions

The set of all possible functions $\mathcal{H}$ includes functions with complicated shapes.  In applications, the functional form of $g$ may not be known. To simplify analysis, it is common to restrict $\mathcal{H}$ to a smaller set of functions which is more tractable. For example, $\mathcal{H}$ may be restricted to constant functions, to linear functions, to polynomials, or to ratios of polynomials. 

1. If $\mathcal{H}$ is restricted to the subset of $\mathcal{H}$ consisting of constant functions $g(x) := \alpha$, the choice of $\alpha$ which minimizes mean squared error $\E \left[(Y - \alpha)^2 \right]$ is the unconditional mean of $Y$, defined by the function $g_0(x)=\mu_Y$, and the minimized mean squared error is the variance of $Y$ (by definition). 

2. If $\mathcal{H}$ is restricted to linear functions of the form $a + b \sm X$, then mean square error is minimized by the **simple linear least square regression** function $g_1(x) := \alpha + \beta \sm x$, where the constant term $\alpha$ and coefficient $\beta$ are coefficients in the **theoretical linear regression** function. If the unrestricted $g(x)$ is not actually linear in $x$, the linear regression function will have a higher mean squared error than the mean squared error associated with the actual conditional expectation function $g(X)$.  If $\beta \ne 0$, the mean squared error will be less than the unconditional variance of $Y$.

3. If $\mathcal{H}$ is restricted to polynomials of degree three of less, $b_0 + b_1 \sm x + b \sm x^2 + b_3 \sm x^3$, then the mean squared error will be minimized by the theoretical **multiple linear regression** function $g_3(x) := \beta_0 + \beta_1 \sm x + \beta_2 \sm x^2 + \beta_3 \sm x^3$. The minimized variance is less than the mean square error from the linear regression function if the best value for $\beta_2$ or $\beta_3$ is nonzero. The mean squared error is greater than for the conditional expectation $g(x)$ unless the conditional expectation can be expressed as a cubic polynomial. In this polynomial example, the regression function is not linear in $x$, but it is linear in the coefficients $\beta_0$, $\beta_1$, $\beta_2$, $\beta_3$. This is an example of **multiple linear regression** because the function minimizing the mean squared error is defined by a linear combination of multiple functions of $X$.

4. If $\mathcal{H}$ is restricted to a ratio of polynomials (**rational function**) of the form
$$
\frac{b_0 + b_1 \sm X + b_2 \sm x^2 + b_3 \sm x^3}{1 + a_0 \sm (x - a_1)^2} ,
$$
then the varianc-minimizing function $g_{3,1} =:
\frac{\beta_0 + \beta_1 \sm X + \beta_2 \sm x^2 + \beta_3 \sm x^3}{1 + \alpha_0 \sm (x - \alpha_1)^2}$, which is defined by six parameters $\beta_0$, $\beta_1$, $\beta_2$, $\beta_3$, $\alpha_0$, $\alpha_1$, may achieve an even lower minimized variance than the case where $\mathcal{H}$ is restricted to cubic functions of $X$. This is an example on of **nonlinear regression** because the function is not linear in $\alpha_0$ and $\alpha_1$.

## Law of Iterated Expectations, Law of Total Variance

The conditional expectation satisfies the **law of iterated expectations** (**law of total expectation**), which says

\begin{equation}
E[Y] = \E \big[ \E[Y\midx X] \big].
\end{equation}

The **conditional variance**, defined as $\var[Y\midx X] := \E \big[ (Y - \E[Y\midx X])^2 \mid X \big]$, satisfies the **law of total variance**:

\begin{equation}
\var[Y] = \var \big[ \E[Y\midx X] \big] + \E \big[ \var[Y\midx X] \big].
\end{equation}

All of the variance minimizing random variables $g(X)$, $g_1(X), $g_3(X)$, $g_{3,1}(X)$ are uncorrelated with their **residuals** (prediction errors):

\begin{equation}
\begin{aligned}
\corr [ Y - g(X) \comma g(X)] &= \cov [ Y - g(X) \comma g(X)] = 0,  \\
\corr [ Y - g_1(X) \comma g_1(X)] &= \cov [ Y - g_1(X) \comma g_1(X)] = 0, \\ 
\corr [ Y - g_3(X) \comma g_3(X)] &= \cov [ Y - g_3(X) \comma g_3(X)] = 0, \\
\corr [ Y - g_{3,1}(X) \comma g_{3,1}(X)] &= \cov [ Y - g_{3,1}(X) \comma g_{3,1}(X)] = 0 .\\
\end{aligned}
\end{equation}

More generally, any rendom variable in set of random variables defined by restrictions on $\mathcal{H}$ is uncorrelated with the residuals:

\begin{equation}
\begin{aligned}
\corr [ Y - g(X) \comma h(X)] &= \cov [ Y - g(X) \comma h(X)] = 0,  \\
\corr [ Y - g_1(X) \comma a + b \sm X] &= \cov [ Y - g_1(X) \comma a + b \sm X] = 0, \\ 
\corr [ Y - g_3(X) \comma b_0 + b_1 \sm X + b_2 \sm X^2 + b_3 \sm X^3] &= \cov [ Y - g_3(X) \comma  b_0 + b_1 \sm X + b_2 \sm X^2 + b_3 \sm X^3] = 0, \\
\corr [ Y - g_{3,1}(X) \comma \frac{b_0 + b_1 \sm X + b_2 \sm x^2 + b_3 \sm x^3}{1 + a_0 \sm (x - a_1)^2}] &= \cov [ Y - g_{3,1}(X) \comma \frac{\beta_0 + \beta_1 \sm X + \beta_2 \sm x^2 + \beta_3 \sm x^3}{1 + \alpha_0 \sm (x - \alpha_1)^2}] = 0 .
\end{aligned}
\end{equation}

The first line in the above equations is another way of stating the **law of total variance**.

The implies that the variance of $Y$ can be decomposed as

\begin{equation}
\begin{aligned}
\var[Y] 
&= \var [ Y - g(X) ] + \var [g(X)], \\
&= \var [ Y - g_{3,1}(X) ] + \var [g_{3,1}(X)] ,\\
&= \var [ Y - g_3(X) ] + \var [g_3(X)], \\
&= \var [ Y - g_1(X) ] + \var [g_1(X)]. 
\end{aligned}
\end{equation}

Since we have

\begin{equation}
\var [ Y - g(X) ] 
\le \var [ Y - g_{3,1}(X) ]
\le \var [ Y - g_3(X) ]
\le \var [ Y - g_1(X) ]
\le \var[Y],
\end{equation}

we must also have

\begin{equation}
\var [g(X) ] 
\ge \var [g_{3,1}(X) ]
\ge \var [g_3(X) ]
\ge \var [g_1(X) ]
\ge 0.
\end{equation}

Obviously, these inequalities depend on the various restrictions of $\mathcal{H}$ being nested. The set of constants is contained in the set of linear functions; the set of linear functions is contained in the set of cubic polynomials; the set of cubic polynomials is containd within the set of rational functions with cubic numerators; and the specific set of rational functions is contained within the set of all functions.

All of these results pertain to **theoretical regressions** defined by the joint probability distribution of $\vec{X}$ and $\vec{Y}$.  Although they are not based on data, we shall see that empirical relationships mimic the theoretical relationships discussed here.

### Multiple conditioning variables

The concept of conditional expectation generalizes in an obvious manner to multiple conditioning variables.  For example, if $X_0$, $X_1$, $\ldots$, $X_{N-1}$ are $N$ random variables with finite means and variances, we can define

\begin{equation}
g(x_0, \ldots, x_{N-1}) = \E [ Y \midx X_0=x_0, \ldots, X_{N-1} = x_{n-1} ].
\end{equation}

We can also calculate conditional expectations of functions of $Y$, $h(Y)$, when such functions define a meaningful random variable:

\begin{equation}
g_{h(Y)\midx X_0,\ldots,X_{N-1}}(x_0, \ldots, x_{N-1}) = \E [ h(Y) \midx X_0=x_0, \ldots, X_{N-1} = x_{n-1} ].
\end{equation}

In [2]:
timestamp = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
tfinish = timeit.default_timer()
print(f"Finished: {timestamp = }\nExecution time = {tfinish - tstart} s")


Finished: timestamp = '2023-09-05 16:49:03'
Execution time = 0.02548729999999999 s
