# Decision Theory
Decision theory is about finding "optimal" actions, under various definitions of optimality.


## Typical Sequence of Events
* Many problem domains can be formalized as follows:
  * Observe input $x$
  * Take action $a$
  * Observe outcome $y$
    * Outcome $y$ is often **independent** of action $a$
    * But this is **not always the case**:
      * search result ranking
      * automated driving
      * stock market predicitons from analysts might affect market movement
  * Evaluate action in relation to the outcome: $L(a,y)$
  
## The Three Spaces
* Input space: $\mathcal X$
* Action spcae: $\mathcal A$
* Outcome space: $\mathcal Y$

### Action
* Definition: An action is the generic term for what is produced by our system (my understanding the system here means prediction function)

* Examples of Actions
  * Produce a $0/1$ classification [classical ML]
  * Reject hypothesis that $\theta = 0$ [classical Statistics]
  * Written English text [image captioning, speech recognition, machine translation]
  
### Decision Function
Definition: A **decision function (or **predicition function**) gets input $x \in \mathcal{X}$ and produces an action $a \in \mathcal {A}$:

\begin{equation}
f: \mathcal{X} \to \mathcal{A}\\
\end{equation}

\begin{equation}
x \to f(x)
\end{equation}

### Loss Function
Definition: A **loss function** evaluates an action in the context of the outcome $y$:

\begin{equation}
L: \mathcal {A} \times \mathcal{Y} \to \mathbb{R} \\
\end{equation}

\begin{equation}
(a,y) \to L(a,y)
\end{equation}

## Formalizing a "Data Science" Problem
1. First two steps to formalizing a problem  
  1. Define the *action space* (i.e. the set of possible actions)
  1. Specify the evaluation criterion.
1. When a "stakeholder" asks the data scientist to solve a problem, she
  1. may have an opinion on what the action space should be, and
  1. hopefully has an opinion on the evaluation criterion, but
  1. she really cares about your **producing a "good" decision function**.
1. Typical sequence:
  1. Stakeholder presents problem to data scientist
  1. Data scientist produces decision function.
  1. Engineer deploys "industrial strength" version of decision function.

## Evaluating a Decision Function
* Loss function $L$ only evaluates a single action
* How to evaluate the decision function as a whole? (Answer: Statistical Learning Theory)

***

# Statistical Learning Theory

## A Simplifying Assumption
* Assume action has no effect on the output
* Assume there is a data generating distribution $\mathcal{P}_{\mathcal{X}\times\mathcal{Y}}$.
* All input/output pairs $(x,y)$ are generated i.i.d. from $\mathcal{P}_{\mathcal{X}\times\mathcal{Y}}$.
  * no covariate shift
  * no concept drift
* Want decision function $f(x)$ that generally "does well on average":
\begin{equation}
L(f(x), y) \;\;\;\text{ is usually small, in some sense}
\end{equation}

## Risk of a Decision Function
Definition: Given a decision function (or prediction function) $f(x): \mathcal{X} \to \mathcal{A}$, the **risk** of this decision funciton is defined as:

\begin{equation}
R(f) = E[L(f(x), y)]
\end{equation}

where $L(f(x), y)$ is the **loss function**.

In words, it's the **expected loss** of $f$ on a new example $(x,y)$ drawn randomly from $\mathcal{P}_{\mathcal{X}\times\mathcal{Y}}$.

* We usually don't know $\mathcal{P}_{\mathcal{X}\times\mathcal{Y}}$, so we cannot compute the expectation. But we can estimate it.

### The Bayes Decision Function
Definition: A **Bayes decision function** $f^* : \mathcal{X} \to \mathcal{A}$ is a function that achieves the *minimal risk* among all possible functions:

\begin{equation}
f^* = argmin_f{R(f)}
\end{equation}

where the minimum is taken over all functions from $\mathcal{X}$ to $\mathcal{A}$.

* The risk of a Bayes decision function is called the **Bayes Risk**.
  * There can be multiple Bayes decision functions that achieve the same minimal risk.
* A Bayes decision function is often called the **target function**, since it's the best decision function we can possibly produce.


## The Empirical Risk Functional
### The Empirical Risk of a Decision Function
Let $\mathcal{D}_n=((x_1,y_1),\dots,(x_n,y_n))$ be drawn i.i.d. from $\mathcal{P}_{\mathcal{X}\times\mathcal{Y}}$.

Definition: The **empirical risk** of $f:\mathcal{X}\to \mathcal{A}$ with respect to $\mathcal{D}_n$ is:

\begin{equation}
\hat{R}_n(f) = \frac{1}{n}\sum_{i=1}^{n}L(f(x_i),y_i)
\end{equation}

By the Strong Law of Large Numbers,

\begin{equation}
\lim_{n\to \infty}\hat{R}_n(f)=R(f) \;\;\;\text{ almost surely.}
\end{equation}

### Empirical Risk Minimization (ERM)
Definition: A function $\hat{f}$ is an **empirical risk minimizer** if 

\begin{equation}
\hat{f} = argmin_f{\hat{R}_n(f)}
\end{equation}

where the minimum is takend over all function.


### Constrained Empirical Risk Minimization (CERM)
* ERM led to a function $f$ that just memorized the data, ERM can lead to overfitting.
* How to spread information or "generalize" from training inputs to new inputs?
* Need to smooth things out somehow...
  * A lot of modeling is about spreading and extrapolating information from one part of the input space $\mathcal{X}$ into unobserved parts of the space.
* One approach: "Constrained ERM"
  * Instead of minimizing empirical risk over all decision functions,
  * constrain to a particular subset, called a **hypothesis space**.
* The restrictions are called an *inductive bias*
* A fundamental question in learning theory is, over which hypothesis classes ERM learning will not result in overfitting.

#### Hypothesis Spaces 
Definition: A **hypothesis space** $\mathcal{F}$ is a set of [decision ] functions mapping $\mathcal{X} \to \mathcal{A}$. It is the collection of decision functions we are considering.

#### CERM
* **Empirical Risk Minimizer** (ERM) in $\mathcal{F}$ is

\begin{equation}
\hat{f}_n = argmin_{f\in \mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}{L(f(x_i),y_i)}
\end{equation}

* **Risk minimizer** in $\mathcal{F}$ is $f^*_{\mathcal{F}} \in \mathcal{F}$, where

\begin{equation}
f^*_{\mathcal{F}} = argmin_{f\in \mathcal{F}}E[L(f(x),y)]
\end{equation}

### Procedure of ERM
* Given a loss function $L:\mathcal{A}\times \mathcal{Y} \to \mathbb{R}$
* Choose hypothesis space $\mathcal{F}$
* Use an optimization method to find ERM $\hat{f}_n \in \mathcal{F}$
  
\begin{equation}
\hat{f}_n=argmin_{f\in \mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}{L(f(x_i),y_i)}
\end{equation}


## Excess Risk Decomposition

### Error Decomposition

Consider following decision functions

* Risk minimizer (i.e. Bayes decision function) over all functions

\begin{equation}
f^{*}=argmin_{f} E[L(f(X),Y)]
\end{equation}

* Risk minimizer over all functions within hypothesis space $\mathcal{F}$

\begin{equation}
f_{\mathcal{F}} = argmin_{f\in \mathcal{F}} E[L(f(X),Y)]
\end{equation}

* Empirical Risk minimizer over all functions within hypothesis space $\mathcal{F}$

\begin{equation}
\hat{f}_n = argmin_{f\in \mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}{L(f(x_i),y_i)}
\end{equation}

![Risk Error Decomposition](resources/risk_error_decomposition.gif)

### Approximation Error (of $\mathcal{F}$)  =  $R(f_{\mathcal{F}}) - R(f^*)$
* Approximation error is a property of the class $\mathcal{F}$
* It is the penalty for restricting to $\mathcal{F}$ (rather than consdering all possible functions)
* Bigger $\mathcal{F}$ means smaller approximation error.
* Approximation error is a non-random variable.

### Estimation Error (of $\hat{f}_n\;in\; \mathcal{F}$)  =  $R(\hat{f}_n) - R(f_{\mathcal{F}})$
* Note, $R(\hat{f}_n) = E[L(\hat{f}_n(X),Y)]$. 
* Estimation error is a random variable since the data, $(x_i,y_i), i = 1, 2, \dots, n$, used to compute  $\hat{f}_n$, i.e. $argmin_{f\in \mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}{L(f(x_i),y_i)}$, is randomly sampled from distribution $\mathcal{P}_{\mathcal{X}\times\mathcal{Y}}$. So the risk $R(\hat{f}_n)$ depends on training data.
* Estimation error is the performance hit for choosing $f$ using finite training data
* It is the performance hit for minimizing empirical risk rather than true risk
* With smaller $\mathcal{F}$ we expect smaller estimation error. 
* Under typical conditions: "With infinite training data, estimation error goes to zero"

### Optimization Error
* In practice of ERM, we don't find the empirical risk minimizer $\hat{f}_n \in \mathcal{F}$
  * For nice choices of loss functions and classes $\mathcal{F}$, we can get arbitrarily close to a empirical risk minimizer, but that takes time, is it worth it?
  * For some hypothesis spaces (e.g. neural networks), we don't know how to find $\hat{f}_n \in \mathcal{F}$
* In stead, in practice, we find $\tilde{f}_n \in \mathcal{F}$ that we hope is good enough

Definition: If $\tilde{f}_n$ is the function our optimization method returns, and $\hat{f}_n$ is the empirical risk minimizer, then 

\begin{equation}
\text{Optimization Error } = R(\tilde{f}_n) - R(\hat{f}_n)
\end{equation}

* Note: optimization error can be negative. 
* But by definition of $\hat{f}_n$, in empirical risk we have

\begin{equation}
\hat{R}(\tilde{f}_n) - \hat{R}(\hat{f}_n) \ge 0
\end{equation}


### Excess Risk
Definition: The **excess risk** compares the risk of $f$ to the Bayes optimal $f^*$

\begin{equation}
\text{Excess Risk}(f)=R(f) - R(f^*)
\end{equation}

Note: excess risk can never be negative.

### Excess Risk Decomposition for ERM
#### The excess risk of the ERM $\hat{f}_n$ can be decomposed

\begin{equation}
\text{Excess Risk}(\hat{f}_n) = R(\hat{f}_n)-R(f^*)= [R(\hat{f}_n)-R(f_{\mathcal{F}})] + [R(f_{\mathcal{F}}) - R(f^*)]= \\
\end{equation}

\begin{equation}
\text{                                                           Estimation Error } + \text{        Approximation Error}
\end{equation}

* Data scientist's job
  * choose $\mathcal{F}$ to balance between approximation and estimation error
  * as we get more training data, use a bigger $\mathcal{F}$.
  
#### The excess risk for function $\tilde{f}_n$ can be decomposed  

\begin{equation}
\text{Excess Risk}(\tilde{f}_n) = R(\tilde{f}_n)-R(f^*)= \\
\end{equation}

\begin{equation}
[R(\tilde{f}_n)-R(\hat{f}_n)] + [R(\hat{f}_n)-R(f_{\mathcal{F}})] + [R(f_{\mathcal{F}}) - R(f^*)]= \\
\end{equation}

\begin{equation}
\text{  Optimization Error} + \text{ Estimation Error } + \text{        Approximation Error}
\end{equation}

### Function Approximation and Estimation

* The goal is to obtain a useful approximation to $f(x)$ for all $x$ in some region of $\mathbb{R}^p$, which is the input space, given the representations in the training data, $T$.
* This is compared with 'learning by example' paradigm.
* ESL Approach: Treating supervised learning as a problem in function approximation and estimation encourages the geometrical concepts of Euclidean spaces and mathematical concepts of probablistic inference to be applied to the problem.


# Applications

## Regression Function

### Squared Error Loss
Let $X \in R^p$ denote a real valued random input vector, and $Y \in R$ a real valued random output variable, with joint distribution $P(X,Y)$. We seek a function $f(X)$ for predicting $Y$ given values of the input $X$.

Consider the **squared error loss** function, where $L(f(x), y) = (y-f(x))^2$. We have the risk or **Expected Prediction Error (EPE in ESL)** as:

\begin{equation}
R(f) = E[(Y-f(X))^2] = \int [y-f(x)]^2P(x,y)
\end{equation}

By conditioning on $X$, we can write the risk as

\begin{equation}
R(f) = E_XE_{Y|X}[(Y-f(X))^2|X]
\end{equation}

We see that it suffices to minimize the risk $R(f)$ pointwise:

\begin{equation}
f(x) = argmin_cE_{Y|X}[(Y-c)^2|X]
\end{equation}

The solution is the target function or Bayes decision function:

\begin{equation}
f(x) = E[Y|X=x]
\end{equation}

The conditional expectation is also known as the **regression function**. Thus the best prediction of $Y$ at any point $X=x$ is the conditional mean, when best is measured by average squared error.

### $L_1$ Loss

The loss function is: $L_1: E|Y-f(X)|$. The Bayes function in this case is the conditional median

\begin{equation}
\hat{f}(x) = \text{median}(Y|X=x)
\end{equation}

Its estimate are more robust than those for the conditional mean. 

## Classification Function with 0-1 Loss
* The loss function is 0-1 loss.
* The Bayes Function or *Bayes classifier* is found to be

\begin{equation}
f(x) = argmin_{g \in \mathcal{G}} [1-P(Y=g|X=x)]\; \text{ or simply} \\
f(x) = \mathcal{G}_k \; if \; P(Y=g_k|X=x) = max_{g \in \mathcal{G}} P(Y=g|X=x)
\end{equation}

The error rate of the Bayes classifier is called the Bayes rate (i.e. Bayes Risk).

The Bayes classifier says that we classify to the most probable class, using the conditional (discrete) distribution $P(Y|X)$.

### Connection with the Bayes function for Regression Problem
Suppose we have a two-class problem and use the dummy-variable approach, followed by squared error loss estimation. Then

\begin{equation}
f(X)=E(Y|X)=P(Y=g_1|X)\; if\; g_1 \text{ corresponds to }\; Y=1 \\
\end{equation}

Likewise for a K-class problem

\begin{equation}
E(Y_k|X) = P(Y=g_k|X)
\end{equation}

This shows that our dummy-variable regression procedure, followed by classification to the largest fitted value, is another way of representing the Bayes classifier. 

### The Method of Least Squares


* The hypothesis space is of linear models between output and its arguments

\begin{equation}
\hat{f} = X^T \hat{\beta}
\end{equation}

* The loss function is squared loss.
* We then minimize the empirical risk without the $\frac{1}{N}$ term, i.e.

\begin{equation}
argmin_{\beta} RSS(\beta)=\sum_{i=1}^{N}(y_i - x^T_i\beta)^2
\end{equation}

This is normally called **Residual Sum of Squares** in linear regression. 

The optimal value $\hat{\beta}$ shall give us the ERM, i.e. $\hat{f} = X^T \hat{\beta}$.

* The minimization can be done analytically. By differentiating w.r.t. $\beta$, we obtain the *normal equations*:

\begin{equation}
\mathbb{X}^T(\mathbb{y} - \mathbb{X}\beta) = 0
\end{equation}

where $\mathbb{X}$ is $N \times p$, $\mathbb{y}$ is an N-vector.

If $\mathbb{X}^T\mathbb{X}$ is nonsingular, then the solution is unique, and it is given by:

\begin{equation}
\hat{\beta} = (\mathbb{X}^T \mathbb{X})^{-1}\mathbb{X}^T\mathbb{y}.
\end{equation}

* If we substitute the linear model into risk formula (i.e. EPE) and tries to find the Bayes decision function, we have

\begin{equation}
\beta = [E(XX^T)]^{-1}E(XY)
\end{equation}

* Compare this with the general Bayes decision function (without a linear hypothesis space assumption). 
  * We have not conditioned on $X$; rather we have used our knowledge of the functional relationship to pool over values of $X$. The least squares solution from ERM amounts to replace the expectation above by averages over the training data. 

### The Nearest-Neighbor Methods (KNN)

* What's the hypothesis space for the nearest-neighbor method?
  * Locally constant (or piecewise constant) functions with knots determined by training data. 
  * Can we have a hypothesis space that depends on $k$? 
    * Probably yes, the prediction function $f(x)$ can be a function of $N/k$ number of nonoverlap (a strong assumption?) regions (or $N/k-1$ number of knots). For example, when $k=1$, there'll be $N$ regions in the piecewise function. When $k=2$, there'll be $N/2$ regions for the function. The position of knots and the function values at the knots will be determined by empirical risk minimization procedure.
    * It can be seen that the hypothesis space with $N$ regions is a super set of hypothesis space with $N/2$ regions because the latter can always be divided into half and keep the same values as before and we obtain a function in $N$ space. But we can't do it the other way around. This demonstrates that any function in hypothesis space $N/2$ is in space $N$. So we have: $\mathcal{F}_{N} \subseteq \mathcal{F}_{N-1} \subseteq \dots \mathcal{F}_i \subseteq \dots \mathcal{F}_2 \subseteq \mathcal{F}_1$
    
    where the subscripts $i$ means the number of points used to compute the average, i.e. $k$.
* What's the loss function ?
  * 0-1 loss
* Can we derive the ERM for nearest-neighbor method?
  * Seems to be straight forward for $k=1$
  * When $k>1$, if we can use the hypothesis space depends on $k$, it's easy to see that the ERM should be average. 
  
* The prediction function for KNN is

\begin{equation}
\hat{f}(x)=\frac{1}{k}\sum_{x_i \in N_k(x)} y_i
\end{equation}

where $N_k(x)$ is the neighborhood of $x$ defined by the $k$ closest points $x_i$ in the training sample.

* The error on the training data should be approximately an increasing function of $k$, and will always be $0$ for $k=1$.
* The *effective* number of parameters of k-nearest neighbors is $N/k$ and is generally bigger than $p$ in least-squares fits, and decreases with increasing $k$.
* The KNN attempts to approximate the Bayes decision function, i.e. $f(x)=E(Y|X=x)$ using the training data. The KNN is our $f_{\mathcal{F}}$.
  * Expectation is approximated by averaging over sample data.
  * Conditioning at a point is relaxed to conditioning on some region "close" to the target point.
* Under mild regularity conditions on the joint probability distribution $P(X,Y)$, one can show that as $N,k \to \infty$ such that $k/N \to 0, \hat{f}(x) \to E(Y|X=x)$.
* However
  * We often do not have very large samples, Linear or other structured models can be more stable than KNN.
  * As dimension $p$ gets large, so does the metric size of the k-nearest neighborhood. The convergence above still holds, but at the rate of convergence decrease as the dimension increases.
  
* If we compare the KNN with the Bayes classifier () for classification problem (derived above). We see that the KNN classifier directly approximates the Bayes classifier: a majority vote in a nearest neighborhood amounts to exactly this, except that conditional probability at a point is relaxed to conditional probability within a neighborhood of a point, and probabilities are estimated by training-sample proportions.  

#### Curse of Dimensionality
* If the dimension of the input space is high, the nearest neighbors need not be close to the target point, and can result in large errors.
  * In high dimension, to have a neighborhood of a point $x$ to form a local average (say, 10% of total volume), the expected edge length (on each dimension) needs to increase a lot, e.g. 80%, in terms of total length in each dimension. (to a degree, such neighborhoods are no longer "local"). Reducing the neighborhood size causes higher variance of our fit.
  * In high dimension, all sample points are close to an edge of the sample. Most data points are closer to the boundary of the sample space than to any other data point. Prediction is much more difficult near the edges of the training sample. One must extrapolate from neighboring sample points rather than interpolate between them.
  * In high dimension, the sampling density is proportional to $N^{1/p}$, where $p$ is the dimension of the input space and $N$ is the sample size. Thus all feasible training samples sparsely populate the input space. You need enormous amount of data to have a meaningful sampling density. 

* By imposing some heavy restrictions on the class of models being fitted, we can reduce both the bias and the variance of the estimates.
  * The complexity of functions of many variables can grow exponentially with the dimension, and if we wish to be able to estimate such functions with the same accuracy as function in low dimensions, then we need the size of our training set to grow exponentially as well. 


### Relationships to Other Methods
* Kernel methods use weights that decrease smoothly to zero with distance from the target point, rather than the effective $0/1$ weights used by k-nearest neighbors. 
* Local regression fits linear models by locally weighted least squares, rather than fitting constants locally.
* Linear models fit to a basis expansion of the original inputs allow arbitrarily complex models.
* Projection pursuit and neural network models consist of sums of non-linearly transformed linear models.

### Additive Models
* Additive Error Model

\begin{equation}
Y=f(X)+\epsilon
\end{equation}

Where the random error $\epsilon$ has $E(\epsilon)=0$ and is independent of $X$. The $f(X)$ is a deterministic function. For most systems the input-output pairs $(X,Y)$ will not have a deterministic relationship. The additive model assumes that we can capture all these departures from a deterministic relationship via the error $\epsilon$.

  * Many of the classification problems are of this form. The randomness enters through the $x$ location of the training points.
  

* The hypothesis space of additive models is:

\begin{equation}
f(X) = \sum_{j=1}^{p}f_j(X_j)
\end{equation}

This retains the additivity of the linear model, but each coordinate function $f_j$ is arbitrary.



