# KIN 482D: Computational modeling of human sensorimotor control and learning

## Chapter 4: The response distribution

## Housekeeping

- Make sure you have someone you can consult and pair program with if you are new or a little rusty with Python (also, come speak with me if you're concerned)
- Problem Set 3 will be time consuming, by necessity&mdash;fitting models to data takes lots of time, even when you know how! 
- Broken record time: Get started early, and have fun!
- In two weeks, you will learn one of the most popular (and useful) Bayesian models ever, and it will likely feel like a breeze
- Start thinking about final projects 
    - Novel modeling idea (must be Bayesian)
    - Take a challenging, influential paper (I can recommend) and create a combined tutorial/simulation based on model


## The big question in Chapter 4

*How does a Bayesian model predict a human observer's responses on a perceptual task?*

A model formalizes our understanding of a behavior. If our understanding is correct, it must be able to make accurate predictions. 

## Plan

- Discuss why, upon repeated presentations of the same stimulus, a Bayesian observer’s posterior mean estimate (i.e., the observer’s response) is a random variable

- Derive probability distribution of the Bayesian observer’s responses (Step 3 of Bayesian modeling)

- The derived response distribution allows researchers to compare the predictions of the Bayesian model to the observer’s actual behaviour in the context of a psychophysical experiment

- Discuss bias and variance of the posterior mean estimate (PME) and compare to the maximum-likelihood estimate (MLE)

- Discuss optimality


## Inherited variability

- To compare our model with observer's behavior, we must specify what the model predicts for observer's responses when true stim is $s$

- In other words, we must derive $p(\hat{s}|s)$ - the "estimate distribution" or "response distribution"

- Because $x_\text{obs}$ is a RV for a given $s$, so is the stimulus estimate

## "Inherited variability": Likelihood function, posterior, and PME are not fixed!

&nbsp;
<img src="images/fig4-1.png" width=500>

- Remember: The posterior probability is realized on a single trial
- Distributions move around on a trial-by-trial basis, even when $s$ is constant, because $x_{\text{obs}}$ is a RV (why we need lots of trials in our experiments) 
- Stochasticity in the stimulus estimate, or the subject’s response, is “inherited from” the stochasticity in the measurement.
- In figure above, note relationship between $s_n$ and $x_{\text{obs}_n}. They will usually not be aligned because of measurement noise/variability


## The Response Distribution

\begin{equation}
\begin{aligned}
\mu_\text{post} &= \dfrac{Jx_\text{obs} + J_s\mu}{J + J_s} \\
\hat{s}_\text{PM} &= \mu_\text{post} \\
\end{aligned}
\end{equation}

That is, on a given trial, the stimulus estimate is the posterior mean.

The rest goes up on the board, including "Properties of Linear Combinations of Random Variables"

## Note on notation

- **Expected value** is a fancy term for the average, or mean
- Precision notation: $J = \dfrac{1}{\sigma^2}$

## First 3 steps of Bayesian modeling

<img src="images/table4-1.png" width=600>

## Belief versus Response Distributions

- Priors, normalized likelihoods, and posteriors represent degree of belief observer has in different hypothesized world states
    - Beliefs are defined on each individual trial, internal to observer, and not directly measurable (they are estimated based on assumptions of our model)
- Response distribution is a summary of observer's behavior aacross many trials
    - Directly measurable and exists even if observer is not Bayesian    
- Difference manifests in their variances (see next figures)

## First 3 steps of Bayesian modeling

<img src="images/table3-1.png" width=600>
<img src="images/table4-2.png" width=600>

## Variance of belief versus response distributions

<img src="images/fig4-2.png" width=500>

Exercise 4.2 asks you why SD of response distribution eventually starts to decrease. Hint: Remember that we are assuming $s$ is held constant; therefore, what contributes to variability of $\hat{s}$ across trials? 

## 4.4 Maximum-likelihood estimate

- $\hat{s}_{\text{ML}} = x_{\text{obs}}$
- Distribution of MLE of s is equivalent to measurement distribution: $N(s, \sigma^2)$
- MLE ignores stimulus (prior) distribution
- Just as we studied distribution of $\hat{s}_\text{PM}$ for given $s$, we can study distribution of MLE, $\hat{s}_\text{ML}$, for given $s$
- We already know this latter distribution&mdash;what is it?

<img src="images/fig4-3_tophalf.png" width=800>

## 4.5 Bias and mean squared error

- **The posterior mean is biased from the stimulus towards the mean of the prior.**

- The *bias* of an estimate $\hat{s}$ is defined as the difference between the average estimate and the true stimulus:

$$ Bias[\hat{s}|s] \equiv \mathbb{E}[\hat{s}|s] - s$$

- As the posterior mean estimate is biased, so how can it be "optimal" as is frequently claimed about Bayesian models? 

- Turns out that PME is good in the sense that it minimizes the overall *mean squared error* between the estimate and the true stimulus. 

## Bias-Variance Decomposition of MSE

Go over on board. 

<img src="images/fig4-3.png" width=800>
Same Mean Squared Error can arise from different combinations of bias versus variance

## More comparisons of PME vs MLE and why PME is optimal

<img src="images/fig4-3.png" width=800>

**Figure 4.3:** Comparison between the posterior mean estimate (PME) and the maximum-likelihood
estimate (MLE). In this example, the stimulus distribution has $\mu = 0$ and $\sigma_s = 8$. (A) Scatterplots
of PMEs and MLEs against the true stimulus. Dashed lines indicate the expected values. The
larger the noise, the lower the slope of the expected value of the PME. (B) Mean squared error as
a function of the stimulus for the PMEs and MLEs. Mean squared error (solid lines) is the sum
of squared bias and variance. Although the PME is biased, its variance (dashed light blue line) is
lower than that of the MLE (green line). The stimuli that occur often according to the stimulus
distribution (shading indicates probability) are such that the overall (stimulus-averaged) MSE of
the PME (light blue number, in parentheses) is always lower than that of the MLE (green number).

## The world according to E.T. Jaynes (possibly the G.O.A.T. of Bayesian statistics)

<img src="images/ETJaynes2.jpg">

“When we call the quantity...‘bias’, that makes it sound like something awfully reprehensible, which we must get rid of at all costs. If it had been called instead the ‘component of error orthogonal to the variance’,...it would have been clear to all that these two contributions to the error are on an equal footing...This is just the price one pays for choosing a technical terminology that carries an emotional load, implying value judgments...” (Jaynes, 2003, p. 514).

## Brief aside about response noise

- We have assumed observer's response is equal to stimulus estimate
- In reality, response (e.g., motor) noise could exist (e.g., accuracy in choosing a response with cursor, reach to inferred target location, etc.)
- We will see in a few weeks how $\sigma^2_\text{motor}$ can be incorporated into model (not difficult, but adds extra parameter)

## Reflections on Bayesian models

- This model is representative of Bayesian modeling in general
- Essence of Bayesian *ideal observer* is that they consider all possible values of world state, and compute probabilities of those values (most non-Bayesian models work with only point estimates)
- A great advantage of Bayesian modeling is that you can build a complete model of a psychophysical task before collecting any data
    - Model specifies how observer should do task in order to be optimal
    - Bayesian models are also called *normative* models, as they set the norm/standard

## Summary

- Bayesian modeling consists of three steps: defining the generative model, deriving an expression for the observer’s posterior mean estimate (PME), and deriving the distribution of the posterior mean estimate over many trials.
- The mean squared error (MSE) as a measure of the performance of an estimate. We distinguished stimulus-conditioned MSE and overall MSE.
- Stimulus-conditioned MSE is a sum of squared bias and variance.
- We compared the distribution of the PME to that of the MLE. Although the latter is unbiased, it has higher variance for frequently occurring stimuli, and it is worse overall as a result.
- It is easy to confuse the functions and distributions in the different steps, as they look similar (in this chapter, they are all Gaussian). They must be distinguished carefully, as we did in Table 4.1.
- Many conceptual mistakes can be made if the three steps are not followed. In particular, attempts at calculating the response distribution through shortcuts are bound to fail.
- The PME model is a minimal model. Different forms of decision noise and response noise can be considered as variants and extensions.
