$\DeclareMathOperator*{\argmin}{argmin} $

Notes from [this excellent article](http://www.statsathome.com/2017/10/12/bayesian-decision-theory-made-ridiculously-simple/)

## Formalizing Decisions
The article starts, as so many math papers do, by framing the definitions. Justin introduces a number of useful notations and definitions which we will note here.

| Term | Notation | Definition |
|:----:|:--------:|:----------:|
| Decision Space | $\mathcal{A}$ | The space of all possible decisions |
| Decision | $a \in \mathcal{A}$ | A particular decision drawn from the decision space |
| Information Space | $\Theta$ | The space of all information for a particular decision |
| A piece of information | $\theta \in \Theta$ | A particular piece of information |
| Beliefs on \theta | $p(\theta)$ | The probability distribution reflecting our beliefs on the value of $\theta$ where it is uncertain | 

### Examples of Decisions
* If I am trying to decide a price to use for a cell phone, then the decision space might be the space of positive real numbers $a \in [0, + \infty )$
* If I am trying to decide between two brands of cereal, then the decision might simply be the set $\{a_1, a_2\}$

### Examples of Information
* In the cell phone case.  We might develop a model based on previous online listing in order to predict the probability the phone will be sold at a given price. In this case $\Theta \in [0, 1]$ and $\theta$ would be any particular probability for a given price. 
* For breakfast cereal we might use the grams of sugar per serving as a piece of information.  In this case we would have $\Theta \in \mathcal{R}^{2+}$, the positive quadrant of 2 dimensional real space, where $\theta_1$ and $\theta_2$ are the amounts of sugar in two selected cereals. 

### The Loss Function
At this point we know what our decisions are and we have information with which to make it.  However, we don't yet have a way of determining which decision is best, thus the loss function $\mathcal{L}$.  The crucial purpose of $\mathcal{L}$ is to quantify how good or bad a given decision $a$ is given some information $\theta$, typically as a real number.  We can thus generally think of loss/utility/acquisition functions as a mapping $$\mathcal{L}:\Theta \times \mathcal{A} \rightarrow \mathcal{R}$$

Defining a reasonable loss function is often one of the hardest parts of this problem, because it is subjective and has to capture everything that is meaningful to you. 

### Loss Function Examples
* Take the cell phone example. If my goal is to maximize my return, then I might define my loss function as having the form $\mathcal{L}(\theta, a) = -\theta a$. This loss function represents that we would like to choose a price where the probability of sale is high.  Note the negative sign is present because we want to *minimize* $\mathcal{L}$, so as $\theta$ increases $\mathcal{L}$ will become more negative. 
* In this decision between two cereals we might have a fairly simple loss function of the forms:
$$ \mathcal{L}(\theta_1, \theta_2, a) = \left\{
\begin{array}{ll}
      \theta_1 & \text{if } a = a_1 \\
      \theta_2 & \text{if } a = a_2
\end{array} 
\right.$$

### Building in Uncertainty
So in truth, we often don't have access to perfect information for a decision we would like to make.  Instead, often we are in a position where we only have some beliefs about what the value of a piece of information might be.  However, our models can handle this nicely by expressing information as probability distributions.  And where we would have typically simply minimized our loss function, now we will minimize our expectation of loss. $$\text{Expected Loss}(a) = \int_{\Theta} \mathcal{L}(\theta, a)p(\theta)d\theta$$

One could try to calculate this analytically, but we will always approach this integral computationally.  So we will draw $N$ samples $(\theta^{(1)}, ..., \theta^{(N)})$ from the distribution $p(\theta)$ to approximate the above integral as $$\text{Expected Loss}(a) \approx \frac{1}{N} \sum_{n=1}^{N} \mathcal{L}(\theta^{(n)}, a)$$

Given the calculated expected loss, we are now in a position to choose the "best" choice.  Denoted the "bayes action", this is the choice which minmiized our expected loss.  Formally, $$\hat{a} \approx \argmin_{a \in \mathcal{A}} \frac{1}{N} \sum_{n=1}^{N} \mathcal{L}(\theta^{(n)}, a)$$

In [3]:
import pandas as pd
import pymc3 as pm

In [None]:
pd.DataFrame({})