## Week 8: Decoding Models and Information Theory

Up until now, we have mostly been considering models of how various stimulus features are **encoded** in neural activity.

We'll now consider the other side of the coin: given a pattern of neural activity, what can we tell about what stimulus was presented?


In [None]:
# load matplotlib inline mode
%matplotlib inline

# import some useful libraries
import sys
import numpy as np                # numerical analysis linear algebra
import matplotlib as mpl
import matplotlib.pyplot as plt   # plotting
sys.path.insert(0,"/project/psyc5270-cdm8j/comp-neurosci")
from comp_neurosci_uva import dists

# set some style options
mpl.rcParams['image.origin'] = 'lower'
mpl.rcParams['image.aspect'] = 'auto'
mpl.rcParams['image.cmap'] = 'jet'

## Encoding vs decoding models

We can think of the sensory stimulus and the response it evokes as being random variables. In each trial, a stimulus is selected from the stimulus distribution, and this results in some spiking pattern which comes from a large (but finite) set of possible responses.

<img src="images/l9_distributions.png" alt="V1 tuning" style="width: 500px;"/>

### Encoding models

Goal is to model how the response depends on which stimulus was presented. More formally, a probabilistic model of the response conditional on the stimulus, $p(r|s)$. 

RF models covered in our last lesson are an example. The response at time $t_i$ is modeled as a weighted sum of the present and past values of the stimulus, plus some random noise, $\varepsilon_i$.

$$
r_i = \mathbf{s}_i \cdot \mathbf{h} + \varepsilon_i
$$

If the errors are normally distributed with variance $\sigma^2$, then the conditional probability is simply

$$
p(r|\mathbf{s}) = \mathrm{N}(\mathbf{s} \cdot \mathbf{h}, \sigma^2)
$$

### Decoding models

In contrast, a decoding model attempts to reconstruct the stimulus based on the response. More formally, it represents $p(\mathbf{s}|r)$, the probability that a stimulus was presented conditional on the response that was observed.

Decoding models are related to encoding models through Bayes' rule:

$$
p(\mathbf{s}|r) = \frac{p(r|\mathbf{s})p(\mathbf{s})}{p(r)}
$$

## Signal detection

Let's consider a simple problem in which we have to detect a signal against a noisy background. We'll start by describing the encoding model and then develop a decoding model.

On a given trial, the stimulus $s$ is either present ($s = 1$) or absent ($s = 0$).

Similarly, in each trial, the neuron that we're monitoring generates a response that we'll summarize by the rate $r$.

Let's assume that the conditional distribution of the responses is Gaussian:

$$
p(r|s) = \mathrm{N}(\mu, \sigma^2)
$$

Furthermore, let's say that the average response $\mu$ is a simple linear function of whether or not the stimulus is present:

$$
\mu = \beta_0 + \beta_1 s
$$

Let's look at the distributions in Python:

In [None]:
beta  = [10, 5]
sigma = np.sqrt(5)

r = np.arange(0, 30, 0.1)
pr_noise = dists.normal(mean=beta[0], std=sigma)
pr_signal = dists.normal(mean=beta[0] + beta[1], std=sigma)
plt.plot(r, pr_noise.pdf(r), lw=2, label="p(r|s=0)")
plt.plot(r, pr_signal.pdf(r), lw=2, label="p(r|s=1)")
plt.legend()
plt.xlabel("Rate")
plt.ylabel("Probability");

Now let's use Bayes's Rule to derive the decoding model, $p(s|r)$.

First, note that $p(s=1|r) = 1 - p(s=0|r)$, so let's just consider the probability that the stimulus is present.

Next, we need to calculate the marginal response probability $p(r)$, which is done by integrating the conditional response distributions:

\begin{align}
p(r) & = \sum_{s \in S} p_{S,R}(s, r) \\
     & = \sum_{s \in S} p_R(r|s) p_S(s) \\
     & = p_R(r|s=0)p_S(s=0) + p_R(r|s=1)p_S(s=1)
\end{align}

If the probability of the signal being present $p_S(s=1) = 0.5$, then

$$p(r) = \frac{1}{2} p(r|s=0) + \frac{1}{2} p(r|s=1)$$

In [None]:
ps = 0.5
pr = (1 - ps) * pr_noise.pdf(r) + ps * pr_signal.pdf(r)
plt.plot(r, pr)
plt.xlabel("Rate")
plt.ylabel("p(r)")

Finally, we'll use Bayes's Rule to calculate $p(s|r)$. 

In [None]:
# Bayes's Rule
psr = (pr_signal.pdf(r) * ps) / pr
plt.plot(r, psr)
plt.xlabel("Rate")
plt.ylabel("p(s=1|r)");

### Exercise

Write a **function** to calculate $p(s|r)$ for any value of $p(s)$. Plot $p(s|r)$ when $p(s) = 0.1$ and when $p(s) = 0.9$. How does the decoding model shift its output based on this prior knowledge?

Consult the Software Carpentry [module for functions](http://swcarpentry.github.io/python-novice-inflammation/08-func/index.html) if you need a refresher.

## ROC analysis

The foregoing approach is called a **naive Bayesian decoder**. It works well if you happen to know or control the stimulus probabilities, and if you can collect enough data to accurately model $p(r|s)$.

However, in the real world, it may not be possible to know what the prior probability of a stimulus is. In this case, the only strategy is to set a **criterion**, or threshold, for deciding whether the signal is present or not.

In [None]:
plt.plot(r, pr_noise.pdf(r), lw=2, label="p(r|s=0)")
plt.plot(r, pr_signal.pdf(r), lw=2, label="p(r|s=1)")
plt.vlines(12, 0, 0.2, label="criterion")
plt.legend()
plt.xlabel("Rate")
plt.ylabel("Probability");

If the response is above the criterion, the signal is assumed to be present, and if it's below, it's assumed to be absent.

### Exercise

With $p(s) = 0.5$, what criterion would you choose to minimize the probability of an error? Give your best guess and explain why. We'll come up with a more formal solution later.

This problem is at the heart of **signal detection theory** (SDT). SDT applies not only to neurons but to the receiver at the end of any noisy communication channel. A core concept is the **reciever operating characteristic** (ROC), which describes how the receiver's ability to accurately detect the signal depends on where it sets its criterion.

Because the criterion and the signal are both binary, this leads to a 2-by-2 table of possible outcomes. Let's let $y = 0$ represent a response of "no signal" (i.e. below criterion) and $y = 1$ represent a response of "signal" (above criterion):


|  -  | s = 0 | s = 1 |
|----|------|------|
|y = 0  | correct rejection| miss |
|y = 1  | false alarm | hit |

It should be clear that false alarms and correct rejections are complementary, and so are misses and hits. Thus, we only need two values from the table, the false alarm rate and the hit rate. If we let $\gamma$ be the criterion, then these are:

\begin{align}
p(FA) & = p(r > \gamma|s = 0) \\
p(H) & = p(r > \gamma|s = 1)
\end{align}

As the criterion is changed, the false alarm and hit rate also change. You should be able to see this by looking at the plot of the two conditional response distributions above.

Let's illustrate in Python. An important concept we need to cover first is the **cumulative probability distribution** or **cdf**. This is a function that gives the area under the pdf up to the specified value. Given $p(x)$, the cdf is defined as:

$$
P(y) = p(x \leq y) = \int_{-\infty}^y p(x) dx
$$

### Exercise

Define $p(FA)$ and $p(H)$ in terms of the **cdfs** for $p(r|s)$:

\begin{align}
p(FA) & =  \\
p(H) & = 
\end{align}

The Python implementation of the normal distirbution has a method `cdf()` that will evaluate the **cdf** for you.

This allows us to generate a plot of $p(H)$ versus $p(FA)$ for different values of the criterion. This is called the ROC curve.

In [None]:
gamma = np.arange(5, 20)
pfa = 1 - pr_noise.cdf(gamma)
phit = 1 - pr_signal.cdf(gamma)
plt.plot(pfa, phit, 'o', label=r"$\sigma^2 = {}$".format(5))
plt.legend(loc='lower right')
plt.xlabel("P(FA)");
plt.ylabel("P(H)");

Note that there is an inherent tradeoff. You can only increase the hit rate by also increasing the false alarm rate.

### Exercise

Go back to the original definitions of $p(r|s)$. What do you think will happen if you increase or decrease $\sigma^2$?

Write a function to calculate the ROC as a function of $\sigma^2$. Plot ROCs (using the same array of $\gamma$ values as above) for $\sigma^2 = 2, 10, 20, 100$.

### Excursion

If we assume that the conditional response distributions are normally distributed with equal variance, then we can summarize the relationship between them using a single number, $d'$ (d-prime). This is defined as the difference between the means divided by the standard deviation:

$$
d' = \frac{\beta_1}{\sigma} = \frac{\mu_s - \mu_n}{\sigma}
$$

Furthermore, because the difference between the two means is itself a normal distribution, we can estimate $d'$ from the false alarm and hit rate at a single criterion value:

$$
d' = z(H) - z(FA)
$$

where $z(H)$ is the z-score of the hit rate and $z(FA)$ is the z-score of the false alarm rate. The z-score is simply the inverse of the normal cdf. We can calculate it in Python using the `ppf` method:

In [None]:
pfa = 1 - pr_noise.cdf(10)
phit = 1 - pr_signal.cdf(10)

std_normal = normal()
d_prime = std_normal.ppf(phit) - std_normal.ppf(pfa)
print("d' =", d_prime)

### Exercise

1. With $\sigma^2 = 5$, verify that $d'$ is the same for any value of $\gamma$ in the support.
2. Verify that this empirical calculation is correct given the value defined for $\beta_1$ above.
3. What is $d'$ when $\sigma^2$ is increased to 10?

## Information Theory

Signal detection theory is powerful, but it's not trival to apply to **discrimination** problems where there are more than two stimuli.

Another way of summarizing decoding models is through information theory, which builds on the foundational work of Claude Shannon.

We need to understand two key concepts: **entropy** and **information**

### Entropy

The idea of uncertainty is inherent to probability theory. If some event has a probability that's less than 1.0, that means that we're uncertain about whether it will happen.

A probability distribution quantifies our uncertainty about all the values a particular random variable can take on.

One way of summarizing the uncertainty of the whole distribution is by its **entropy**, which is defined as follows:

$$
H(R) = - \sum_{r \in R} p(r) \log_2 p(r)
$$

For continuous distributions, the entropy is defined by an integral over the support of the pdf:

$$
H(R) = - \int_{R} p(r) \log_2 p(r) dr
$$

If, as above, a base-2 logarithm is used, then entropy has units of **bits**. You can interpret the entropy as the number of digital bits needed to represent the possible values of the distribution.

For example, a fair coin has equal probability of being heads or tails, which gives it an entropy of 1 bit:

In [None]:
coin = dists.bernoulli(p=0.5)
coin.entropy() / np.log(2)

When the natural log is used, the units are called **nats**. The `scipy` distribution functions return entropy in nats, so we have to convert to bits by dividing by $\log 2$.

### Exercise

In constrast, if we have a coin that always lands heads, then there is no uncertainty about the outcome, and we need 0 bits to store it.

Thus, we can infer that the entropy of the Bernoulli distribution depends on its parameter.

Plot the entropy of the Bernoulli distribution as a function of its parameter $p$ (the probability that the value will be 1).

## Information

Information has a lot of colloquial usages, but in information theory it means a reduction in uncertainty.

**Mutual information** is the reduction in uncertainty about the value of one random variable when you know the value of another.

$$
I(S;R) = \sum_{s \in S, r \in R} p(s, r) \log_2 \frac{p(s,r)}{p(s) p(r)}
$$

To understand this formula, think about the definition of a joint probability distribution. If the two variables are independent, that means that their joint distribution is just the product of the marginal distributions, 
$p(s, r) = p(s) p(r)$.

In this case, the last term in the sum above is

$$
\log_2 \frac{p(s,r)}{p(s) p(r)} = \log_2 \frac{p(s)p(r)}{p(s)p(r)} = \log_2 1 = 0
$$

In contrast, if $S$ and $R$ are not mutually independent, then $p(s,r) \neq p(s)p(r)$ and $I(S; R) > 0$. MI is always non-negative.

### Exercise

For the signal detection problem, with $\beta_0 = 10$, $\beta_1 = 5$, $\sigma^2 = 5$, and $p(s) = 0.5$, calculate the mutual information between the signal $S$ and the response $R$.

Calculating the joint probability distributions can be a little tricky, so it helps to factor them out:

\begin{align}
I(S;R) & = \sum_{s \in S, r \in R} p(s, r) \log_2 \frac{p(s,r)}{p(s) p(r)} \\
 & = \sum_{s \in S, r \in R} p(r|s)p(s) \log_2 \frac{p(r|s)}{p(r)} \\
 & = \sum_{s \in S} p(s) \sum_{r \in R} p(r|s) \log_2 \frac{p(r|s)}{p(r)}
\end{align}


In [None]:
beta  = [10, 5]
sigma = np.sqrt(5)
r = np.arange(0, 30, 0.1)
pr_noise = dists.normal(mean=beta[0], std=sigma)
pr_signal = dists.normal(mean=beta[0] + beta[1], std=sigma)
ps = 0.5

What happens when you change the discretization of $𝑟$? Why? What would you need to do to remove this dependence?

## Information and entropy

Mutual information can also be calculated as a reduction in entropy:

\begin{align}
I(S;R) & = H(S) - H(S|R) \\
I(S;R) & = H(R) - H(R|S)
\end{align}

The second term in each of the equations above is a **conditional entropy**, which can be calculated much like the standard entropy, but integrated over all the possible values of the conditioned variable:

$$
H(R|S) = - \sum_{s \in S} p(s) \sum_{r \in R} p(r|s) \log_2 p(r|s)
$$

In the neural setting, the conditional entropy represents how much noise there is in the response. In other words, if the stimulus is known, how much does the response vary from trial to trial?

One way of interpreting this relationship is as an *information channel* between the stimulus and the neural response. 

<img src="images/l9_information_channel.png" alt="entropy and mutual information" style="width: 400px;"/>

The amount of information carried by this channel can be thought of as either the total amount of information the neuron *could* represent in its response distribution $H(R)$, reduced by the variability in the response that is independent of the stimulus, $H(R|S)$, or as the total amount of uncertainty about the stimulus $H(S)$ reduced by the remaining uncertainty about the stimulus when the response is known, $H(S|R)$.

### Exercise

Use one of the entropy formulas to calculate mutual information for the SDT problem, and verify that it agrees with the value you calculated previously.

## Summary and parting thoughts

Decoding models represent the probability of the stimulus conditional on the response.

They can only represent a **best-case scenario**. They don't actually tell us how information in the neural response is intepreted by neurons downstream.

MI can be used to identify stimulus regimes where a neuron carries more information.

MI can also be used to compare neurons and brain regions.

Calculating MI with small amounts of data requires a lot of assumptions.