Let $\mathcal{D} = \{y, \bold{X}\}$
where $y$ is the target variable and $X$ are the source variables.

We are interested in what makes $y$ tick.

If the generating process for $y$ is ergodic, then the total uncertainty in $y$ is captured by the Shannon entropy.

### Shannon entropy (discrete)

We can conceive of the Shannon entropy as the average surprise of each observation $y_i$, taking into account the rest:

$$H(y) = - \langle \log_2 p(y) \rangle $$

$$H(y) = - \sum_y p(y) \log_2 p(y) $$

The Shannon entropy is measured in bits.



### Differential entropy (continuous)

If $y$ is 'continuous', (i.e. we are willing to interpolate the discrete histogram) then the differential entropy is 

$$ H(y) = - \int_{- \infty}^{\infty} p(y) \log p(y) dy $$ 

The differential entropy
- Excludes the point in the integral where y = 0.
- Uses natural logs, and produces a measure in nats.
- Can be negative


# Conditional entropy

Let's say $H(y) = 10$.

The conditional entropy, $H(y | \bold{X}$) measures the remaining uncertainty after accounting for $\bold{X}$. 

Let's assume $H(y|\bold{X}) = 2$. 

This is ideal, because the conditional entropy is significantly lower than the entropy $\bold{X}$. This means that a lot of uncertainty was reduced by conditioning.


### Mutual information

The difference between the entropy and the conditional entropy is the mutual information.

$$MI(y;\bold{X}) = H(y) - H(y|\bold{X})$$

In the above example, we had 10 bits of uncertainty, and after conditioning we only had 2 This means the mutual information 10 - 2 = 8.

**Explained information**

An important ratio is the ratio of mutual information to total entropy, which I will call explained information.

$$E(y|\bold{X}) = \dfrac{MI(y;\bold{X})}{H(y)}$$

The "explained information" is just the proportion of entropy that is "explained" by conditioning. I.e. the mutual information divided by the total entropy.




# MI in depth

The formula for MI is relatively simple, it's the total entropy minus the entropy that's left after conditioning. It's the information gain.

$$MI(y;\bold{X}) = H(y) - H(y|\bold{X})$$

This is massively complicated by the fact the $\bold{X}$ is a matrix. This means, for a simple data set,  what we really have is:

$$MI(y; x_1, x_2) = H(y) - H(y|x_1, x_2)$$

### Partial information decomposition

$$MI(y; x_1, x_2) = H(y) - H(y|x_1, x_2)$$

We figure out that we can decompose this total information gain from covariates into three sources:
- Unique information
- Synergistic information
- Redundant information


$$MI(y; x_1, x_2) = \text{Uni}(x_1) + \text{Uni}(x_2) + \text{Syn}(x_1,x_2) - \text{Red}(x_1,x_2) $$ 

Or more rigorously

$$MI(y; x_1, x_2) = \text{Uni}(y: x_1 | x_2) + \text{Uni}(y: x_2 | x_1) + \text{Syn}(y: (x_1,x_2)) - \text{Red}(y: (x_1,x_2)) $$ 
