In [12]:
import pandas as pd
import plotly.express as px
import numpy as np

# Maximum Likelihood Estimation

Suppose we start with a very very simple model.

$$y_i=mx_i + \epsilon_i \sim \mathcal{N}(mx_i, \sigma^2)$$

In this case the probability density function $f(y;m)$ is:

$$f(y;m)=\frac{1}{\sigma \sqrt{2\pi}}e^{\frac{1}{2}(\frac{y - mx}{\sigma})^2}$$

Given this we can define the likelihood of a set of observations $\bold{y}=(y_1, y_2, ..., y_n)$ given $m$ as:

$$L(\bold{y};m)=\prod_i f(y_i;m)$$

However given typically $f(y;m)<<1$ and the product of many small numbers is a far smaller number, it is more useful to deal with the logarithm of this quantity:

$$l(\bold{y};m)=\ln{L(\bold{y};m)}=\sum_i \ln{f(y_i;m)}$$

Now you can imagine that the best estimate of $m$ is the value that maximizes the overall likelihood. From calculus you may remember that wherever a function $h(u)$ is maximized we have $\partial_u h=0$. Therefore we are really looking for:

$$\partial_m l(\bold{y};m) = 0$$

Let's work this out for our case:

$$\ln{f(y;m)} = -\ln{\sigma \sqrt{2\pi}} - \frac{1}{2}(\frac{y-mx}{\sigma})^2$$

$$\partial_m \ln{f(y;m)} = -\frac{y-mx}{\sigma}(\frac{-x}{\sigma})=\frac{x}{\sigma^2}(y-mx)$$

$$\partial_m l(\bold{y};m) = \sum_i \frac{x_i}{\sigma^2}(y_i-mx_i)$$

Let's take a quick look at each of the individual terms in the summand and just plot them out (assuming $y=mx$): 

$$\chi=\frac{x}{\sigma^2}(y-mx)$$

We'll call this our penalty (because we want it to be zero).

In [13]:
m_real = 2.0
sigma = 0.1
M = np.arange(m_real-0.5, m_real+0.5, 0.01)
X = np.array([0.1, 0.5, 1., 1.5])
Y = m_real * X
rows = []
for x, y in zip(X, Y):
    penalties = x / (sigma ** 2) * (y - M * x)
    for m, penalty in zip(M, penalties):
        rows.append({
            'x': x,
            'm': m,
            'penalty': penalty,
        })
df = pd.DataFrame(rows)
px.line(df, x='m', y='penalty', color='x', title='m vs penalty')

Now this is super interesting because if we get $m$ wrong the magnitude of our penalty is actually a function of $x$! In our case if $x$ is really close to zero then modifications to $m$ make almost no difference to our penalty. If you think of a machine trying to find $m$ that give us a penalty near zero then this would obviously make the machine's work a lot harder. However if $x$ is far from zero then small changes in $m$ result in large changes in the penalty - making life a lot easier for the machine. 

To put all of this another way if we were to sample points near zero we'd make finding $m$ very difficult. However if we sample far away from zero we can find $m$ much faster. 

Another, perhaps more intuitive way, to think about this is that for low $x$ our $mx$ is also going to be relatively small. That means our error term $\epsilon$ can have a dominating presense - the noise exceeds the signal. However if $x$ and therefore $mx$ is quite large $\epsilon$ is going to get washed out - our signal will overpower the noise. 

Different samples give us different amounts of information about our model parameter $m$. 

Let's see if we can formalize this. 

# Fisher Information

Our penalty (per observation) is just:

$$\chi = \partial_m \ln{f(y;m)}$$

Therefore our sensitivity to changes in $m$ is simply:

$$\partial_m \chi = \partial^2 \ln{f(y;m)}$$

In our case this is just:

$$\partial_m \chi = -\frac{x^2}{\sigma ^2}$$

It turns out this quantity (plus a sign modification) has a name:

$$I = -E[\partial^2 \ln{f(y;m)}]$$

is called the Fisher Information. And it has exactly the interpretation you'd expect. The larger the Fisher Information the easier it's going to be fit your parameter - the better the signal to noise ratio. 

Note that here we're using the expectation of our second derivative. This is the expectation over $y$ as sometimes the second derivative will actually be a function of $y$ and then in order to get a single number for the information we need to take the expectation. 

# Multiparameter Estimation

Let's make our model a little more complicated by adding in another parameter:

$$y_i=mx_i + b + \epsilon_i \sim \mathcal{N}(mx_i +b, \sigma^2)$$

In this case we have:

$$f(y;m,b)=\frac{1}{\sigma \sqrt{2\pi}}e^{\frac{1}{2}(\frac{y - mx - b}{\sigma})^2}$$

And we actually have two scores now:

$$\chi_m = \partial_m \ln{f(y;m,b)} = \frac{x}{\sigma^2}(y-mx - b)$$

$$\chi_b = \partial_b \ln{f(y;m,b)} = \frac{1}{\sigma^2}(y-mx - b)$$

Given we have two parameters to vary this means that we also have two sensitivities to look at. In other words we are interested in four quantities:

$$\partial_m \chi_m = - \frac{x^2}{\sigma^2}$$

$$\partial_b \chi_m = - \frac{x}{\sigma^2}$$

$$\partial_b \chi_b = - \frac{1}{\sigma^2}$$

$$\partial_m \chi_b = - \frac{x}{\sigma^2}$$

The Fisher Information of these just takes the negative as we did above in the single parameter case.

We can make our lives easier by thinking of this as a matrix:

$$
\bold{I}=
\begin{pmatrix}
-E[\partial_m \chi_m] & -E[\partial_m \chi_b] \\

-E[\partial_b \chi_m] & -E[\partial_b \chi_b]
\end{pmatrix} = 

\begin{pmatrix}

x^2/\sigma^2 & x/\sigma^2 \\

x/\sigma^2 & 1/\sigma^2
\end{pmatrix}
$$

Now you should know that one sample does not two parameters make. Specifically if I only sampled at say $x=1$ then I could fit my entire model just using $b$ or $m$, I do not need both. Therefore we will *have* to sample more than one $x$. What's the information of a two samples? Well turns out it's just the sum of the information matrices of each sample! (This makes sense because we are assuming that our samples are independently distributed). 

Therefore what we really have is:

$$
\bold{I}=
\begin{pmatrix}
-\sum_{x_i} E[\partial_m \chi_m] & -\sum_{x_i} E[\partial_m \chi_b] \\

-\sum_{x_i} E[\partial_b \chi_m] & -\sum_{x_i} E[\partial_b \chi_b]
\end{pmatrix} = 

\begin{pmatrix}

\sum_{x_i} x_i^2/\sigma^2 & \sum_{x_i} x_i/\sigma^2 \\

\sum_{x_i} x_i/\sigma^2 & \sum_{x_i} 1/\sigma^2
\end{pmatrix}
$$

Alright, for illustration sake suppose we just sampled at $x=2$. We'd end up with:

$$
\bold{I}=\begin{pmatrix}

 4/\sigma^2 & 2/\sigma^2 \\

2/\sigma^2 & 1/\sigma^2
\end{pmatrix}
$$

Let's take a little leap now. If we have a 2 x 2 matrix we can imagine a parallelogram made from the rows of that matrix. Turns out the absolute value of the determinant is the volume of that parallelogram. This is pretty useful because each of our rows corresponds to the sensitivity of our score for one parameter. For example the first row in our matrix is the vector of sensitivities with respect to $m$. Now what's the size of the determinant of the matrix above? 

$$\det{

  \begin{pmatrix}

 4/\sigma^2 & 2/\sigma^2 \\

2/\sigma^2 & 1/\sigma^2
\end{pmatrix} = 0  
}$$

This means that our vectors are parallel! In some sense this means we can't distinguish between sensitivities to $m$ vs sensitivies to $b$ which is exactly what we'd expect.

Let's now assume we sample at $x_1=2$ and $x_2=0$ this gives us:

$$\bold{I} = 

  \begin{pmatrix}

 4/\sigma^2 & 2/\sigma^2 \\

2/\sigma^2 & 2/\sigma^2
\end{pmatrix}
$$

from which we get:

$$\det{

  \begin{pmatrix}

 4/\sigma^2 & 2/\sigma^2 \\

2/\sigma^2 & 2/\sigma^2
\end{pmatrix} = 4 / \sigma^2 
}$$

What's nice about this geometric interpretation is it captures two things for us. First the volume is obviously going to get larger as the information for each parameter increases. But it will also increase as the orthogonality between those sensitivities increases as well. Therefore we can use:

$$|\det{I}|$$

as a good indication of how informative the samples we are taking are going to be. And if it's zero we know we just don't have enough distinct points sampled!



# Some Useful Sources

[MLE and Fisher Information](https://minerva.it.manchester.ac.uk/~saralees/statbook2.pdf)

[Determinant for Maximization](https://onlinelibrary.wiley.com/doi/book/10.1002/9781119974017)

[Determinant as Volume](https://textbooks.math.gatech.edu/ila/determinants-volumes.html#:~:text=When%20A%20is%20a%202,adjacent%20sides%20of%20the%20parallelogram.)

[Parametrized Density Functions](https://courses.cs.washington.edu/courses/cse599i/18wi/resources/lecture8/lecture8.pdf)
