# Decision Theory
Decision theory is about finding "optimal" actions, under various definitions of optimality.


## Typical Sequence of Events
* Many problem domains can be formalized as follows:
  * Observe input $x$
  * Take action $a$
  * Observe outcome $y$
    * Outcome $y$ is often **independent** of action $a$
    * But this is **not always the case**:
      * search result ranking
      * automated driving
      * stock market predicitons from analysts might affect market movement
  * Evaluate action in relation to the outcome: $L(a,y)$
  
## The Three Spaces
* Input space: $\mathcal X$
* Action spcae: $\mathcal A$
* Outcome space: $\mathcal Y$

### Action
* Definition: An action is the generic term for what is produced by our system (my understanding the system here means prediction function)

* Examples of Actions
  * Produce a $0/1$ classification [classical ML]
  * Reject hypothesis that $\theta = 0$ [classical Statistics]
  * Written English text [image captioning, speech recognition, machine translation]
  
### Decision Function
Definition: A **decision function (or **predicition function**) gets input $x \in \mathcal{X}$ and produces an action $a \in \mathcal {A}$:

\begin{equation}
f: \mathcal{X} \to \mathcal{A}\\
\end{equation}

\begin{equation}
x \to f(x)
\end{equation}

### Loss Function
Definition: A **loss function** evaluates an action in the context of the outcome $y$:

\begin{equation}
L: \mathcal {A} \times \mathcal{Y} \to \mathbb{R} \\
\end{equation}

\begin{equation}
(a,y) \to L(a,y)
\end{equation}

## Formalizing a "Data Science" Problem
1. First two steps to formalizing a problem  
  1. Define the *action space* (i.e. the set of possible actions)
  1. Specify the evaluation criterion.
1. When a "stakeholder" asks the data scientist to solve a problem, she
  1. may have an opinion on what the action space should be, and
  1. hopefully has an opinion on the evaluation criterion, but
  1. she really cares about your **producing a "good" decision function**.
1. Typical sequence:
  1. Stakeholder presents problem to data scientist
  1. Data scientist produces decision function.
  1. Engineer deploys "industrial strength" version of decision function.

## Evaluating a Decision Function
* Loss function $L$ only evaluates a single action
* How to evaluate the decision function as a whole? (Answer: Statistical Learning Theory)

***

# Statistical Learning Theory

## A Simplifying Assumption
* Assume action has no effect on the output
* Assume there is a data generating distribution $\mathcal{P}_{\mathcal{X}\times\mathcal{Y}}$.
* All input/output pairs $(x,y)$ are generated i.i.d. from $\mathcal{P}_{\mathcal{X}\times\mathcal{Y}}$.
  * no covariate shift
  * no concept drift
* Want decision function $f(x)$ that generally "does well on average":
\begin{equation}
L(f(x), y) \;\;\;\text{ is usually small, in some sense}
\end{equation}

## Risk of a Decision Function
Definition: Given a decision function (or prediction function) $f(x): \mathcal{X} \to \mathcal{A}$, the **risk** of this decision funciton is defined as:

\begin{equation}
R(f) = E[L(f(x), y)]
\end{equation}

where $L(f(x), y)$ is the **loss function**.

In words, it's the **expected loss** of $f$ on a new example $(x,y)$ drawn randomly from $\mathcal{P}_{\mathcal{X}\times\mathcal{Y}}$.

* We usually don't know $\mathcal{P}_{\mathcal{X}\times\mathcal{Y}}$, so we cannot compute the expectation. But we can estimate it.

### The Bayes Decision Function
Definition: A **Bayes decision function** $f^* : \mathcal{X} \to \mathcal{A}$ is a function that achieves the *minimal risk* among all possible functions:

\begin{equation}
f^* = argmin_f{R(f)}
\end{equation}

where the minimum is taken over all functions from $\mathcal{X}$ to $\mathcal{A}$.

* The risk of a Bayes decision function is called the **Bayes Risk**.
  * There can be multiple Bayes decision functions that achieve the same minimal risk.
* A Bayes decision function is often called the "target function", since it's the best decision function we can possibly produce.


### Regression Function and Squared Error Loss
Let $X \in R^p$ denote a real valued random input vector, and $Y \in R$ a real valued random output variable, with joint distribution $P(X,Y)$. 

Consider the **squared error loss** function, where $L(f(x), y) = (y-f(x))^2$.

\begin{equation}
R(f) = E[(Y-f(X))^2] = \int [y-f(x)]^2P(x,y)
\end{equation}

By conditioning on $X$, we can write the risk as

\begin{equation}
R(f) = E_XE_{Y|X}[(Y-f(X))^2|X]
\end{equation}

We see that it suffices to minimize $R(f)$ pointwise:

\begin{equation}
f(x) = argmin_cE_{Y|X}[(Y-c)^2|X]
\end{equation}

The solution is

\begin{equation}
f(x) = E[Y|X]
\end{equation}

the conditional expectation, also known as the **regression function**. Thus the best prediction of $Y$ at any point $X=x$ is the conditional mean, when best is measured by average squared error.

## The Empirical Risk Functional
### The Empirical Risk of a Decision Function
Let $\mathcal{D}_n=((x_1,y_1),\dots,(x_n,y_n))$ be drawn i.i.d. from $\mathcal{P}_{\mathcal{X}\times\mathcal{Y}}$.

Definition: The **empirical risk** of $f:\mathcal{X}\to \mathcal{A}$ with respect to $\mathcal{D}_n$ is:

\begin{equation}
\hat{R}_n(f) = \frac{1}{n}\sum_{i=1}^{n}L(f(x_i),y_i)
\end{equation}

By the Strong Law of Large Numbers,

\begin{equation}
\lim_{n\to \infty}\hat{R}_n(f)=R(f) \;\;\;\text{ almost surely.}
\end{equation}

### Empirical Risk Minimization (ERM)
Definition: A function $\hat{f}$ is an **empirical risk minimizer** if 

\begin{equation}
\hat{f} = argmin_f{\hat{R}_n(f)}
\end{equation}

where the minimum is takend over all function.


### Constrained Empirical Risk Minimization (CERM)
* ERM led to a function $f$ that just memorized the data.
* How to spread information or "generalize" from training inputs to new inputs?
* Need to smooth things out somehow...
  * A lot of modeling is about spreading and extrapolating information from one part of the input space $\mathcal{X}$ into unobserved parts of the space.
* One approach: "Constrained ERM"
  * Instead of minimizing empirical risk over all decision functions,
  * constrain to a particular subset, called a **hypothesis space**.

#### Hypothesis Spaces 
Definition: A **hypothesis space** $\mathcal{F}$ is a set of [decision ] functions mapping $\mathcal{X} \to \mathcal{A}$. It is the collection of decision functions we are considering.

#### CERM
* **Empirical Risk Minimizer** (ERM) in $\mathcal{F}$ is

\begin{equation}
\hat{f}_n = argmin_{f\in \mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}{L(f(x_i),y_i)}
\end{equation}

* **Risk minimizer** in $\mathcal{F}$ is $f^*_{\mathcal{F}} \in \mathcal{F}$, where

\begin{equation}
f^*_{\mathcal{F}} = argmin_{f\in \mathcal{F}}E[L(f(x),y)]
\end{equation}

### Procedure of ERM
* Given a loss function $L:\mathcal{A}\times \mathcal{Y} \to \mathbb{R}$
* Choose hypothesis space $\mathcal{F}$
* Use an optimization method to find ERM $\hat{f}_n \in \mathcal{F}$
  
\begin{equation}
\hat{f}_n=argmin_{f\in \mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}{L(f(x_i),y_i)}
\end{equation}


## Excess Risk Decomposition

### Error Decomposition

Consider following decision functions

* Risk minimizer (i.e. Bayes decision function) over all functions

\begin{equation}
f^{*}=argmin_{f} E[L(f(X),Y)]
\end{equation}

* Risk minimizer over all functions within hypothesis space $\mathcal{F}$

\begin{equation}
f_{\mathcal{F}} = argmin_{f\in \mathcal{F}} E[L(f(X),Y)]
\end{equation}

* Empirical Risk minimizer over all functions within hypothesis space $\mathcal{F}$

\begin{equation}
\hat{f}_n = argmin_{f\in \mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}{L(f(x_i),y_i)}
\end{equation}

![Risk Error Decomposition](resources/risk_error_decomposition.gif)

### Approximation Error (of $\mathcal{F}$)  =  $R(f_{\mathcal{F}}) - R(f^*)$
* Approximation error is a property of the class $\mathcal{F}$
* It is the penalty for restricting to $\mathcal{F}$ (rather than consdering all possible functions)
* Bigger $\mathcal{F}$ means smaller approximation error.
* Approximation error is a non-random variable.

### Estimation Error (of $\hat{f}_n\;in\; \mathcal{F}$)  =  $R(\hat{f}_n) - R(f_{\mathcal{F}})$
* Note, $R(\hat{f}_n) = E[L(\hat{f}_n(X),Y)]$. 
* Estimation error is a random variable since the data, $(x_i,y_i), i = 1, 2, \dots, n$, used to compute  $\hat{f}_n$, i.e. $argmin_{f\in \mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}{L(f(x_i),y_i)}$, is randomly sampled from distribution $\mathcal{P}_{\mathcal{X}\times\mathcal{Y}}$. So the risk $R(\hat{f}_n)$ depends on training data.
* Estimation error is the performance hit for choosing $f$ using finite training data
* It is the performance hit for minimizing empirical risk rather than true risk
* With smaller $\mathcal{F}$ we expect smaller estimation error. 
* Under typical conditions: "With infinite training data, estimation error goes to zero"

### Optimization Error
* In practice of ERM, we don't find the empirical risk minimizer $\hat{f}_n \in \mathcal{F}$
  * For nice choices of loss functions and classes $\mathcal{F}$, we can get arbitrarily close to a empirical risk minimizer, but that takes time, is it worth it?
  * For some hypothesis spaces (e.g. neural networks), we don't know how to find $\hat{f}_n \in \mathcal{F}$
* In stead, in practice, we find $\tilde{f}_n \in \mathcal{F}$ that we hope is good enough

Definition: If $\tilde{f}_n$ is the function our optimization method returns, and $\hat{f}_n$ is the empirical risk minimizer, then 

\begin{equation}
\text{Optimization Error } = R(\tilde{f}_n) - R(\hat{f}_n)
\end{equation}

* Note: optimization error can be negative. 
* But by definition of $\hat{f}_n$, in empirical risk we have

\begin{equation}
\hat{R}(\tilde{f}_n) - \hat{R}(\hat{f}_n) \ge 0
\end{equation}


### Excess Risk
Definition: The **excess risk** compares the risk of $f$ to the Bayes optimal $f^*$

\begin{equation}
\text{Excess Risk}(f)=R(f) - R(f^*)
\end{equation}

Note: excess risk can never be negative.

### Excess Risk Decomposition for ERM
#### The excess risk of the ERM $\hat{f}_n$ can be decomposed

\begin{equation}
\text{Excess Risk}(\hat{f}_n) = R(\hat{f}_n)-R(f^*)= [R(\hat{f}_n)-R(f_{\mathcal{F}})] + [R(f_{\mathcal{F}}) - R(f^*)]= \\
\end{equation}

\begin{equation}
\text{                                                           Estimation Error } + \text{        Approximation Error}
\end{equation}

* Data scientist's job
  * choose $\mathcal{F}$ to balance between approximation and estimation error
  * as we get more training data, use a bigger $\mathcal{F}$.
  
#### The excess risk for function $\tilde{f}_n$ can be decomposed  

\begin{equation}
\text{Excess Risk}(\tilde{f}_n) = R(\tilde{f}_n)-R(f^*)= \\
\end{equation}

\begin{equation}
[R(\tilde{f}_n)-R(\hat{f}_n)] + [R(\hat{f}_n)-R(f_{\mathcal{F}})] + [R(f_{\mathcal{F}}) - R(f^*)]= \\
\end{equation}

\begin{equation}
\text{  Optimization Error} + \text{ Estimation Error } + \text{        Approximation Error}
\end{equation}
