# Prereading

We start here with some mostly optional material, though placed within these is a section on **AutoDiff** which will likely be of general interest, as well as potentially helpful as a review of the **chain rule** which is implicitly leveraged with respect to the **logistic regression** example in the homework.

After the optional sections, the topics of the **Jacobian**, **Hessian**, and **Taylor series approximations** and the **Gauss-Newton method** they produce are covered. Familiarity with these topics is again useful for the considerations of the **logistic regression** example, which is indeed built upon these considerations.

The juxtoposition of the more **AutoDiff** oriented topics and the more and **least squares** $Ax=b$ oriented topics is meant to create some clear contrast between these two domains. And then the homework emphsizes the role of **Newton's method** in the latter context en route to introducing the statistically import **IRLS** method.

The notion of convexity creates a natural divide between the relevance of "modern" **gradient descent** and "classic" **Newton's method** but within the statistical domain, we are often working with concave likelihoods, so we statisticians should not be so quick to dispense with **Newton's method**.

---



### [OPTIONAL] Non-Analytical Derivative Numerical Approximation

[Finite differences numerical differentiation](https://en.wikipedia.org/wiki/Numerical_differentiation) 

$
\begin{align*}
\frac{\partial f(x_1 \cdots x_{i+1} \cdots x_m)}{dx_i} 
&={} \underset{h \rightarrow 0}{\lim} \frac{f(x_1 \cdots x_{i+1}+h \cdots x_m)-f(x)}{h}\\
&\approx{}\frac{f(x_1 \cdots x_i^{(k)} \cdots x_m)-f(x_1 \cdots x_i \cdots x_m)}{x^{(k)}-x_i} \longrightarrow c \approx \frac{\partial f(x)}{dx_i}\\
& \quad \; {} \text{ as $x^{(k)}_i \longrightarrow  x_i$ for $k=1,2,...,m$}
\end{align*}$

will be necessary for functions without known analytical derivatives. 

But if functional compositions are restricted to those with analytical ([derivative](https://en.wikipedia.org/wiki/Chain_rule_(probability))) [chain rule](https://en.wikipedia.org/wiki/Chain_rule) differentiations, then gradients can derivable algorithmically via the so called [AutoDiff](https://www.cs.toronto.edu/~rgrosse/courses/csc421_2019/readings/L06%20Automatic%20Differentiation.pdf) algorithm. 


## AutoDiff

The specifications of neural network frameworks are predicated on leveraging the analytical chain rule functionalities of AutoDiff.

The optimization of 
$\quad \displaystyle \min_{w,b} \sum_{i=1}^n\overbrace{\frac{1}{2}(y_i-h(x_i^Tw + b))^2}^{L_i(w,b) = f_1(f_2(f_3(f_4(w,b))))} \quad $ can proceed  


- on the basis of $\quad \frac{\partial L_i(w,b)}{\partial w} = \frac{\partial f_1}{\partial f_2}\frac{\partial f_2}{\partial f_3}\frac{\partial f_3}{\partial f_4}\frac{\partial f_4}{\partial w} \quad$ and $\quad \frac{\partial L_i(w,b)}{\partial b} = \frac{\partial f_1}{\partial f_2}\frac{\partial f_2}{\partial f_3}\frac{\partial f_3}{\partial f_4}\frac{\partial f_4}{\partial b}$

with analytically known derivatives chain rule of the function decomposition

- thus providing $\quad w^{(k+1)} = w^{(k)} - \alpha \frac{\partial L(w^{(k)},b^{(k)})}{\partial w^{(k)}} \quad $  and $\quad  b^{(k+1)} = b^{(k)} - \alpha \frac{\partial L(w^{(k)},b^{(k)})}{\partial b^{(k)}}$ 



The [AutoDiff algorithm](https://www.cs.toronto.edu/~rgrosse/courses/csc421_2019/readings/L06%20Automatic%20Differentiation.pdf) detects the functional decomposition simply as a computational order of operations decomposition, and then (by restricting all such computational steps to those with known analytical derivatives) collects the analytical partial derivative evaluations

$$\scriptsize \begin{array}{lllll} z_1=y_i-h(x_i^Tw + b) & z_2=h(x_i^Tw + b) & z_3=x_i^Tw + b & z_4=w & \tilde z_4=b \\
f_1(z_1) = \frac{z_1^2}{2} & f_2(z_2) = y_i-z_2 & f_3(z_3) = h(z_3) & f_4(z_4) =  x_i^Tz_4 + b & f_4(\tilde z_4) =  x_i^Tw + \tilde z_4 \\
\frac{\partial f_1}{\partial f_2} = f_1'(z_1) = z_1 & \frac{\partial f_2}{\partial f_3} = f_2'(z_2) = -1 & \frac{\partial f_3}{\partial f_4} = f_3'(z_3) = h'(z_3) & \frac{\partial f_4}{\partial w} = f_4'(z_4) = x_i & \frac{\partial f_4}{\partial b}= f_4'(\tilde z_4)=1\end{array}$$

and thus 

- $\frac{\partial L_i(w^{(k)},b^{(k)})}{\partial b^{(k)}} = \frac{\partial f_1}{\partial f_2}\frac{\partial f_2}{\partial f_3}\frac{\partial f_3}{\partial f_4}\frac{\partial f_4}{\partial b^{(k)}} = -(y_i-h(x_i^Tw^{(k)} + b^{(k)}))h'(x_i^Tw^{(k)} + b^{(k)})$
- $\frac{\partial L_i(w^{(k)},b^{(k)})}{\partial w^{(k)}} = \frac{\partial f_1}{\partial f_2}\frac{\partial f_2}{\partial f_3}\frac{\partial f_3}{\partial f_4}\frac{\partial f_4}{\partial w^{(k)}} = \frac{\partial L_i(w^{(k)},b^{(k)})}{\partial b^{(k)}}x_i$

### [OPTIONAL] Forward Pass and Backpropegation

The example above is for the final output of a regression neural network, while the entire functional decomposition of a neural network would include $L$ preceding "layers" of "feature engineering" of input feature $\tilde x_i$ leading to the above

$$x_i = h_L \circ (W_L \{ \cdots \{h_2 \circ (W_2\{h_1 \circ (W_1\tilde x_i + b_1)\} + b_2)\} \cdots \} + b_K)$$

and the rucurrent intermediate layer outputs 

$$x_i^l = h_l \circ (W_l x_i^{l-1}  + b_l)$$

The extension of AutoDiff back through this multilayer context is a straight forward (albeit a tedius bookkeeping) exercise that begins analagously to the demonstration above but must now continue through $\frac{\partial x_i^l}{\partial x_i^{l-1}}$ in order to eventually arrive at

$$\frac{\partial x_i^l}{\partial W_l} = h_l' \circ (W_l x_i^{l-1}  + b_l) x_i^{l-1} \quad \textrm{ and } \quad \frac{\partial L_i(\cdots)}{\partial W^l} = \frac{\partial f_1}{\partial f_2}\frac{\partial f_2}{\partial f_3}\frac{\partial f_3}{\partial f_4}\frac{\partial f_4}{\partial x_i}\frac{\partial x_i}{\partial x_i^L}\frac{\partial x_i^{L-1}}{\partial x_i^L}\cdots \frac{\partial x_i^{l+1}}{\partial x_i^l}\frac{\partial x_i^l}{\partial W_l}$$

and 

$$ \frac{\partial x_i^l}{\partial b_l} = h_l' \circ (W_l x_i^{l-1}  + b_l)
 \quad \textrm{ and } \quad \frac{\partial L_i(\cdots)}{\partial b^l} =  \frac{\partial f_1}{\partial f_2}\frac{\partial f_2}{\partial f_3}\frac{\partial f_3}{\partial f_4}\frac{\partial f_4}{\partial x_i}\frac{\partial x_i}{\partial x_i^L}\frac{\partial x_i^{L-1}}{\partial x_i^L}\cdots \frac{\partial x_i^{l+1}}{\partial x_i^l}\frac{\partial x_i^l}{\partial b_l}$$

The sequential computation of the $x_i^l$ layers is known as the **forward pass** and is necessary since it all the **partial derivatives** for **gradient descent** depend on these values. 

**Gradient descent** updates are then made to all the model parameters in the so-called **backpropegation** manner, meaning that as the sequential computation of the chain rule is completed the gradients for the layers of the neural network become sequentially available and are updated as this occurs

1. $\frac{\partial f_1}{\partial f_2}\frac{\partial f_2}{\partial f_3}\frac{\partial f_3}{\partial f_4}$ are computed
2. $\frac{\partial f_4}{\partial b^{(k)}}$ and $\frac{\partial f_4}{\partial w^{(k)}}$ are computed
    1. meaning $\frac{\partial L_i(w^{(k)},b^{(k)})}{\partial b^{(k)}}$ and $\frac{\partial L_i(w^{(k)},b^{(k)})}{\partial w^{(k)}}$ may now be computed 
         2. so $b^{(k+1)}$ and $w^{(k+1)}$ are thus now updated 
3. $\frac{\partial f_4}{\partial x_i}\frac{\partial x_i}{\partial b_L^{(k)}}$ and $\frac{\partial f_4}{\partial x_i}\frac{\partial x_i}{\partial W_L^{(k)}}$ are computed
    1. meaning $\frac{\partial L_i(\cdots)}{\partial b_L^{(k)}}$ and $\frac{\partial L_i(\cdots)}{\partial W^{(k)}_L}$ may now be computed 
        2. so $b^{(k+1)}_L$ and $W^{(k+1)}_L$ are thus now updated 
4. now sequentially for each $l = L, L-1, \cdots, 2, 1$ and $x_i = x_i^{L+1}$ and $\tilde x_i = x_i^{1}$

   $\frac{\partial x_i^{l+1}}{\partial x_i^l}\frac{\partial x_i^l}{\partial b_l}$ and $ \frac{\partial x_i^{l+1}}{\partial x_i^l}\frac{\partial x_i^l}{\partial W_l}$ are computed
    1. meaning $\frac{\partial L_i(\cdots)}{\partial b_l^{(k)}}$ and $\frac{\partial L_i(\cdots)}{\partial W^{(k)}_l}$ may now be computed 
        2. so $b^{(k+1)}_l$ and $W^{(k+1)}_l$ are thus now updated 




## [OPTIONAL] Stochastic Gradient Descent

The demonstration above is for **gradient descent** with a single data point, but strictly speaking interest lies in optimizing the surface over all available data points

$$0 = E_x\left[ \nabla_\theta g_x(\theta^*) \right] \approx \nabla_\theta \frac{1}{n} \sum_{i=1}^n g_{x_i}(\theta^*)$$

**Stochastic gradient descent (SGD)** divides a dataset up into small (often size $n\neq m=32$) **batches**, each of which provide the gradient for a single update step

$$E_x\left[ \nabla_\theta g_x(\theta_{t-1}) \right] \approx \frac{1}{m} \sum_{i=1}^m \nabla_\theta g_{x_i}(\theta_{t-1}) = \nabla_\theta \frac{1}{m} \sum_{i=1}^m g_{x_i}(\theta_{t-1}) \quad \longrightarrow \quad \theta_{t} = \theta_{t-1} + \alpha \nabla_\theta g_{x_{m=32}}(\theta_{t-1})$$

and uses multiple passes (**epochs**) through these **batches** of the dataset to achieve an optimum. **SGD** $\nabla_\theta \frac{1}{m} \sum_{i=1}^m g_{x_i}$ avoids the volatility of $\nabla_\theta g_{x_i}(\theta_{t-1})$ and better estimate the appropriate descent direction while also avoiding the (often utterly intractable) computationalal burnden of the full dataset gradient $\nabla_\theta \frac{1}{n} \sum_{i=1}^n g_{x_i}(\theta^*)$.

> For $m<<n$ there is huge computational savings in the $O(smp)$ "steps $\times$ samples $\times$ parameter gradients" computational cost of ***SGD***. 

An initial objective in modern large parameter non-convex (neural network) optimization problems is to make sufficient initial progress while avoiding suboptimal **local optima** "traps" early in the optimization process. **Momentum** and **RMSprop** (and **Adam**) optimizers largely address this, and **SGD** further supports avoiding **local optima** through its stochastic nature. 

Finding "optimal out of sample performance" before **gradient norms** vanish is actually quite common, and typically indicates that **local optima** have been sufficiently avoided, and generalizable prediction capabilities have been achieved. While large parameter models are at major risk of highly idiosyncratic overfitting, this can be largely avoided through a careful and measured optimization process. 

### [OPTIONAL] Vanishing Gradients

Vanishing gradients are the primary challenge facing deep learning, and hence the ability for increased capacity in generalization that deep learning offers. 

If partial derivatives are less than $1$ then the chain rule product increasingly shrinks towards zero with each added neural network layer, which is the vanishing gradient problem. A vanishing gradient thus limits how "deep" neural networks can be since later gradients (deeper into the chain rule product) will be increasingly shrunk towards zero, making learning the parameters at deeper neural network layers a much slower process (and thus a much more intractible computational problem) than than learning the parameters at earlier layers.

The gradient of the **relu activation function** is only either 0 or 1 only and so it avoids contributing to the vanishing gradient problem globally in favor creating some zero valued outputs. Standardizing outputs using **batch norm** means zero valued outputs become negative lower bounds on outputs, which "reactivates" their signals and corresonpding potential for non-zero downstream gradients. The standardization also means gradient surfaces are more sphereical (as opposed to variance elongated or correlation diagonalized ellipsoid) shaped, which equalizes the partial derivatives within a gradient, further reducing the potential for vanishing gradients along any particular axis. And **momentum** and **RMSprop** (and **Adam**) like algorithms further work to reduce limit the vanishing gradients problem. 

Another recently introduced mechanism to even further reduce the vanishing gradient problem are so-called **residual connections**, where the output of the layer is stacked along with the input to the layer centered by the output.

$$x_i^l = \left[\begin{array}{cc} \mu_i^l = h_l \circ (W_l x_i^{l-1}  + b_l)\\x_i^{l-1} - \mu_i^l\end{array}\right]$$

The purposes of these **residual connections** is that they make
$\partial x_i^{l}/ \partial x_i^{l-1} = 1$ and avoid any decay of a vanishing gradient from layer $l-1$ to $l$. Effectively, this allows the traditional feature engineering to still proceed as usual through $\mu_i^l$, while at the same time creating a path through which a non-decayed gradient can flow deeper into the neural network to keep deep parameters from experiencing vanishing gradient problems so they an continue moving towards optimum through gradient descent. 

> While the **attention mechanism** has proved to be a powerful tool in deep learning architectures, it is the ability to create deep networks (through the other mechanisms noted above) which has quietly enabled the generalization capabilities of the **attention mechanism** to be leveraged (in conjunction with the significant data resources it leverages) 


## The Jacobian $J$ 

---

The **Hessian** $H_{f(z')}$ matrix of **second order partial derivatives** of $f(z)$ is (of course) distinct from the **Jacobian** $J$, which is a (different) matrix of **first order partial derivatives** for the **multivariate** $y = g(z)$ which maps $z \in {\rm I\!R}^p$ to $y\in {\rm I\!R}^q$.

The **Jacobian** orientation intuitively naturally concatenates the  columns of partial derivatives of the vector output

$$g(z) = \left[ \begin{array}{c}g_1(z)\\\vdots \\ g_q(z) \end{array}\right] \quad\quad \Longrightarrow \quad\quad J g(z') = \nabla_z^T g(z') = \left[ \begin{array}{c:c:c} \frac{\partial}{\partial z_1} g_1(z') & \longrightarrow & \frac{\partial}{\partial z_p} g_1(z') \\\vdots \\ \frac{\partial}{\partial z_1} g_q(z') & \longrightarrow &\frac{\partial}{\partial z_p} g_q(z') \end{array}\right]$$

where $y_i = g_i(z)$ is the $i^{th}$ element of the multivariate output of $g(z)$. Some other expressions of this are 

$$ [Jg(z')]_{ij} = \frac{\partial g_i(z')}{\partial z_j}  
 \quad \text{ or } \quad Jg(z') = \nabla_z^T g(z') = \begin{array}{c}\overset{y_1}{\underset{y_q}{\Bigg \downarrow}}\end{array} \overset{z_1 \overset{\partial}{\;-\!-\!-\!-\!-\!-\!-\!-\!\!\longrightarrow} \; z_p}{\left[ \begin{array}{ccc}
\frac{\partial g_1(z')}{\partial z_1} & \cdots & \frac{\partial g_1(z')}{\partial z_p}\\
\vdots & \ddots & \vdots \\
\frac{\partial g_q(z')}{\partial z_1} & \cdots & \frac{\partial g_q(z')}{\partial z_p}
\end{array} \right]} = \left[ \begin{array}{c}\nabla_z g_1(z')^T\\\vdots  \\ \nabla_z g_p(z')^T \end{array}\right]$$ 



### The Jacobian, Hessian, and Multi-Multivariate Taylor Series Approximations 

---

As noted previously and now seen clearly from the definition of the **Jacobian**, the **Hessian** (requiring second order derivatives) is

$$H_{f(\theta^*)} = J\nabla_\theta f(\theta^*) = \nabla_\theta^T \nabla_\theta f(\theta^*) = \left[ \frac{\partial}{\partial \theta_1}\nabla_\theta f(\theta^*) \;\;\cdots\;\; \frac{\partial}{\partial \theta_j}\nabla_\theta f(\theta^*) \;\;\cdots\;\; \frac{\partial}{\partial \theta_p}\nabla_\theta f(\theta^*) \right]$$

For functions with both multivariate outputs *and inputs*...<br>the **first order multi-multivariate Taylor Series approximiation** replaces the **gradient** with the **Jacobian** 

$$\underbrace{f(\theta) \approx f(\theta^*) + \nabla_\theta f(\theta^*)^T(\theta-\theta^*)}_{\text{when $f$ has multivariate input and univariate output}} \quad \text{ generalizes to } \quad \underbrace{f(\theta) \approx f(\theta^*) + J f (\theta^*)(\theta-\theta^*)}_{\text{when $f$ has both multivariate out } \textbf{and input}} $$

So the $i^{th}$ approximation vector element is the **first order Taylor Series approximation** for the $i^{th}$ univariate output $f_i$

$$\scriptsize\begin{align*}& \quad \min_\theta \frac{1}{2}||y - f_\theta(x)||_2^2 \\
&\approx {} \min_\theta \frac{1}{2}\big|\big| \,y - \big(\overbrace{f_x(\theta^*)}^{f_{\theta^*}(x)}+\overbrace{J f_x(\theta^*)}^{Jf_{\theta^*}(x)}(\theta-\theta^*)\big)\big|\big|_2^2 \\
&={} \min_\theta \frac{1}{2}\left( y - f_x(\theta^*) - J f_x(\theta^*)(\theta-\theta^*) \right)^T\left( y - f_x(\theta^*) - Jf_x(\theta^*)(\theta-\theta^*)\right) \\
&= {} \min_\theta \underbrace{ \frac{1}{2} (\theta-\theta^*)^T J f_x(\theta^*)^T  J f_x(\theta^*)(\theta-\theta^*) - (y -  f_x(\theta^*) J f_x(\theta^*)(\theta-\theta^*)
}_{g(\theta)}
\end{align*}$$

so the **Hessian** of a **least squares objective function** $g(\theta)$ 
for a **first order multi-multivariate Taylor series approximation** of prediction function $f_\theta(x) \equiv f_x(\theta)$ around $\theta^*$
is the **inner product** of the **Jacobian** $H_{g(\theta)} = \left(Jf_x(\theta^*)\right)^T\left(J f_x(\theta^*)\right)$ which depends only on first order derivatives.


## Gauss-Newton 

---

The previous approximation replaces a **nonlinear least squares** with an $Ax=b$ **least squares** problem 

$|| y - f_x(\theta^*) - J f_x(\theta^*)(\theta - \theta^*) || _2^2$

that can be expressed as $\hat \beta = (\tilde X^T \tilde X)^{-1} \tilde X^T y = \min_{\beta}||\tilde y-\tilde X\beta ||_2^2$ 


$$\min_\theta \Bigg|\Bigg|\; \overbrace{\left[ \begin{array}{c}y_1\\\vdots\\y_i\\\vdots\\y_n\end{array}\right] - \left[ \begin{array}{c} f_{\theta^{(t)}}(x_1) \\\vdots\\f_{\theta^{(t)}}(x_i)\\\vdots\\f_{\theta^{(t)}}(x_n)\end{array}\right]}^{\tilde y^{(t)}} \;\; -  \overbrace{\left[ \begin{array}{c} (\nabla_\theta f_{\theta^{(t)}}(x_1))^T \\\vdots\\(\nabla_\theta f_{\theta^{(t)}}(x_i))^T\\\vdots\\(\nabla_\theta f_{\theta^{(t)}}(x_n))^T\end{array}\right]}^{\tilde X^{(t)} \,=\, J f_{x}(\theta^{(t)})}\overbrace{\left[ \begin{array}{c}\theta-\theta^{(t)}_1\\\vdots\\\theta-\theta^{(t)}_k\\\vdots\\\theta-\theta^{(t)}_p\end{array}\right]}^{{\tilde \beta^{(t+1)}_\Delta}} \; \Bigg|\Bigg|^2_2 \quad \text{ where } \quad f_\theta(x) \equiv f_x(\theta)$$

> $Jf_x(\theta^{(t)})$ here might be **artificially ill-conditioned** but if so the rows of the original problem could be scaled to mitigate this issue. The columns cannot be centered and scaled in this case as that would destroy the approximation. 

This would be solved rather than inverted, but nonetheless we have

$$\begin{align*}
\quad\;\tilde \beta^{(t+1)}_\Delta &= {}  \left((\tilde X^{(t)})^T\tilde X^{(t)}\right)^{-1} (\tilde X^{(t)})^T \tilde y^{(t)} \; \text{ or}\\ 
 \theta^{(t+1)} & = {}  \theta^{(t)} + \bigg(Jf_x(\theta^{(t)})^TJf_x(\theta^{(t)})\bigg)^{-1} Jf_x(\theta^{(t)})^T \tilde y^{(t)} \\
& = {}  \theta^{(t)} + \bigg(\sum_{i=1}^n \nabla_\theta f_{x_i}(\theta^{(t)}) [\nabla_\theta f_{x_i}(\theta^{(t)}) ]^T \bigg)^{-1} Jf_x(\theta^{(t)})^T \tilde y^{(t)}\\
 &={} \theta^{(t)} + H_{g_x(\theta^{(t)})}^{-1} \sum_{i=1}^n \nabla_\theta f_{x_i}(\theta^{(t)}) (\underbrace{y^{(t)}_i - f_{x_i}(\theta^{(t)})}_{\text{residual }i})
 \end{align*}$$

which can be updated as `𝜃[t+1] = 𝜃[t] + np.linalg.solve(H, grad.T@residuals)`.  

Note that this is exactly **Newton's method** for the **least squares** objective function where $f_\theta(z)$ is replaced with its **first order Taylor series approximation** since, as previously noted alongside the introduction of the **Jacobian**, the **Hessian** of this objective function is the inner product of the **Jocobian** of $f$. Note also that the **Jacobian** inner product is the sum of the outer products of the gradients, making the sum of the outer products of the gradients an approximation of the **Hessian** in the same **first order Taylor series approximation** sense. 

- **Modified Gauss-Newton** adds **step size factor** $\alpha$ for possible **backtracking** or improved **line search**.
- The **Gauss-Newton method** will probably not converge for poorly fitting models, but it will converge quickly when the model fits well or $f$ is nearly linear (assuming  **well conditioned** of $J_{f_\theta(z)}(\theta^{(t)})$).


# Lecture

First hour of class

---


Suppose $\quad f(x) \approx f(\tilde x) + (x - \tilde x) f'(\tilde x) + f''(\tilde x)\frac{(x-\tilde x)^2}{2} \quad $ then

- at $x \approx \tilde x = x_0$ a **root**, changes in $f(x)$ are proportional to changes in $x$ since $(\underbrace{x-\tilde x}_{\epsilon_{machine}})f'(\tilde x)$ dominates $f''(\tilde x)\frac{(x-\tilde x)^2}{2}$

- but at $x \approx \tilde x = x^*$ a  **(stationary point) optimum**, changes in $f(x)$ are proportional to squared changes in $x$ as given by $\frac{1}{2}(\underbrace{x-\tilde x}_{\sqrt{\epsilon_{machine}}})^2f''(\tilde x)$, since $f'(\tilde x \approx x^*)\approx 0$

Thus differentiating changes in $f(x)$ requires twice as much numeric resolution in $x$ near $x^*$ an **optimization problem solution** than for $x$ near $x_0$ a **root**

- $(0.1)^2f''(x^*)$ corresponds to $(0.01)f'(\tilde x)$
- $(0.01)^2f''(x^*)$ corresponds to $(0.0001)f'(\tilde x)$

so there is about half as much numeric precision for differentiating function outputs near an **optimum** of a function as in a **linear regime** of a function. 

## The Score Function and Maximum Likelihood Estimation (MLE) 

---

The ***score function*** is the gradient of the ***log likelihood***

$$\nabla_\theta l(\theta') = \left( \frac{\partial l(\theta')}{\partial \theta_1}, \cdots, \frac{\partial l(\theta')}{\partial \theta_p} \right)^T
\quad \text{ where } \quad l(\theta) = \log f(x|\theta) \overset{iid}{=} \log \prod_{i=1}^n f(x_i|\theta)$$

**Maximum Likelihood Estimates** (**MLEs**) come from solving the system of (**nonlinear**) **score equations** which sets the **score function** equal to $\mathbf{0}$, and for the **true value** of the parameter $\theta^{\text{true}}$ the **score function** has expected value $\mathbf{0}$ (with respect to $f(x|\theta^{\text{true}})$ the distribution of the data)

$$\underbrace{\nabla_\theta l(\hat \theta) = \mathbf{0}}_\text{score equation} \quad \text{ and } \quad E_X \!\underbrace{\left[\nabla_\theta l(\theta^{\text{true}})\right]}_{\text{score function}} \!= \mathbf{0}$$

The expected value of the ***score function*** follows since

$$\scriptsize
\begin{align*}
E \left[\nabla_\theta l(\theta)\right] 
= {} & \int \nabla_\theta l(\theta) f(x|\theta) dx = \int \left( \frac{\partial l(\theta)}{\partial \theta_1}, \cdots, \frac{\partial l(\theta)}{\partial \theta_p} \right)^T f(x|\theta) dx\\
= {} &\int \left( \frac{1}{f(x|\theta)}\frac{\partial f(x|\theta)}{\partial \theta_1}, \cdots, \frac{1}{f(x|\theta)}\frac{\partial f(x|\theta)}{\partial \theta_p} \right)^T f(x|\theta) dx \\
= {} & \int \left( \frac{\partial f(x|\theta)}{\partial \theta_1}, \cdots, \frac{\partial f(x|\theta)}{\partial \theta_p} \right)^T dx = \int \nabla_\theta f(x|\theta) dx \\
= {}& \nabla_\theta \int  f(x|\theta) dx = \nabla_\theta \, 1 = \mathbf{0} \\
\end{align*}$$

## Fisher Information 

---

The **Fisher information matrix** $I(\theta^{\text{true}})$ or **expected Fisher information matrix** is the expected value of the **outer product** of the **score function** with itself and [is equal to](https://math.stackexchange.com/questions/3585130/why-is-the-fisher-information-matrix-both-an-expected-outer-product-and-a-hessia) the expected value of the negative of the **Hessian** of the log likelihood $l(\theta^{\text{true}}) =  \log f(x|\theta^{\text{true}})$ at the true value of the parameter $\theta^{\text{true}}$

$$\mathcal I(\theta^{\text{true}}) = E_X[\nabla_\theta l(\theta^{\text{true}})(\nabla_\theta l(\theta^{\text{true}})^T] = E_X[-H_{l(\theta)}(\theta^{\text{true}})] \quad \text{ with respect to the distribution of the data } f(x|\theta^{\text{true}})$$

> **Fisher information** is the negative of the **expected hessian** since **Fisher information** is **positive definite** whereas the **hessian** is **negative definite** since **log likelihood (MLE) optimization** is a **maximization** problem

The **observed Fisher information** refers to
$$\begin{align*}
\text{ either } \quad \hat{\mathcal I( \theta)} = {} & -H_{l(\theta)}(\hat \theta) = \overbrace{-J\nabla_\theta l(\hat \theta) \approx - \sum_{i=1}^n J \nabla_\theta \log f(x_i|\theta)\big|_{\hat \theta}}^{\text{$J(\nabla_\theta l) (\hat \theta)$ }\textit{Jacobian}\text{ of }\textit{score function} \text{ evaluated at } \hat \theta} \\
\text{ or } \quad \hat{\mathcal I( \theta)} \approx {} & \sum_{i=1}^n \left(\nabla_\theta \log f(x_i|\theta)\big|_{\hat \theta}\right)\left(\nabla_\theta \log f(x_i| \theta)\big|_{\hat \theta}\right)^T  \quad \longleftarrow \quad \text{ score function outer product}\\
 & \sum_{i=1}^n \left(\nabla_\theta \log f(x_i|\theta)\big|_{\hat \theta}\right)\left(\nabla_\theta \log f(x_i| \theta)\big|_{\hat \theta}\right)^T  \quad \longleftarrow \quad \text{ score function outer product}\\
= {} & \nabla_\theta l(\hat \theta)\nabla_\theta l(\hat \theta)^T \\
= {} & (J \log f(x_i|\theta)|_{\hat \theta})^TJ \log f(x_i|\theta)|_{\hat \theta} \quad \longleftarrow \quad \text{ inner product of the Jacabian of the log likelihood}\\
& \text{which looks like the hessian of squared $L_2$ loss for nonlinear least squares}\\
& \text{where the nonlinear function is a first order Taylor series approximation...}
\end{align*}$$

And this is exactly correct for very good reason, because the [**asymptotic distribution** of the **MLE**](https://gregorygundersen.com/blog/2019/11/28/asymptotic-normality-mle/) is 

$$ p(\hat \theta) \overset{n \rightarrow \infty}{\longrightarrow} N\!\left(\theta^{\text{true}}, \Sigma = \frac{\mathcal I(\theta^{\text{true}})^{-1}}{n}\right) \approx N\!\left(\theta^{\text{true}}, \Sigma = \frac{\hat{\mathcal I( \theta)}{}^{-1}}{n} \approx \frac{\mathcal I(\theta^{\text{true}})^{-1}}{n}\right)$$

where either plug-in **expected Fisher information matrix** $\mathcal I(\theta^{\text{true}})$ might be the preferred choice for a given context

so then 

$$\Large
\begin{align*}
 -\log p(\hat \theta) &\propto{} \left(\hat \theta - \theta^{\text{true}} \right)^T  \hat{\mathcal I( \theta)} \left(\hat \theta - \theta^{\text{true}} \right)\\
& = {} ||\hat \theta - J \log f(x_i|\theta)|_{\hat \theta} \theta ||_2^2
\end{align*}$$


##  Fisher Scoring (and all the rest) [are just Newton's method]

---

In the MLE context where we're optimizing relative to $l_i(\theta) = \log f(x_i| \theta)$ the negative of the (positive definite) [Fisher information](https://math.stackexchange.com/questions/3585130/why-is-the-fisher-information-matrix-both-an-expected-outer-product-and-a-hessia) $I(\theta) = {E[\nabla_\theta l(\theta)\nabla_\theta l(\theta)^T]} = E[ {{-\underbrace{H_{l(\theta)}(\theta)}_{J(\nabla_\theta l(\theta))(\theta)}}}]$ can replace the **Hessian** and doing so is in an iterative manner is called **Fisher scoring**

$$
\begin{align*} 
\theta^{(t+1)} & {} = \theta^{(t)} + \hat{I(\theta^{(t)})}{}^{-1} \nabla_\theta l(\theta)|_{\theta^{(t)}} = \theta^{(t)} + \hat{I(\theta^{(t)})}{}^{-1} \sum_{i=1}^n \nabla_\theta l_i(\theta)|_{\theta^{(t)}}\\
& {} = \textstyle \theta^{(t)} + \left[\sum_{i=1}^n \nabla_\theta \log f(x_i|\theta)\big|_{\theta^{(t)}}\left(\nabla_\theta \log f(x_i| \theta)\big|_{\theta^{(t)}}\right)^T \right]^{-1} \left[\sum_{i=1}^n \nabla_\theta \log f(x_i|\theta)\big|_{\theta^{(t)}}\right]\\
& {} \overset{\text{or}}{=} \textstyle  \theta^{(t)} - \left[\sum_{i=1}^n J \nabla_\theta \log f(x_i|\theta) \big|_{\theta^{(t)}} \right]^{-1} \left[\sum_{i=1}^n \nabla_\theta \log f(x_i|\theta)\big|_{\theta^{(t)}}\right]\\
&{} \quad \text{approximating the negative expected }\textbf{Hessian }\text{with }\textbf{observed information}\\
& {} \approx \theta^{(t)} - H_{l(\theta)}(\theta^{(t)})^{-1}\nabla_\theta l(\theta^{(t)})  = \theta^{(t)} - J(\nabla_\theta l)(\theta^{(t)})^{-1}\nabla_\theta l(\theta^{(t)}) \\
\end{align*}$$

- Adding a learning rate to **Fisher scoring** or **Newtons method** makes them a "**damped**" version
- These matrices could also be **Tikhonov regularized** which would make them a "**modified**" version


<!--

$$\begin{align*}
  M = -I(\theta^{(t)}) = {} & -E[\nabla_\theta l(\theta^{(t)})\nabla_\theta l(\theta^{(t)})^T] = E[H_{l(\theta)}l(\theta^{(t)})]\\
  \approx {} & -\sum_{i=1}^n \nabla_\theta log f_{x_i}(\theta^{(t)})\left(\nabla_\theta log f_{x_i}(\theta^{(t)})\right)^T \approx  H_{l(\theta)}l(\theta^{(t)}) 
  \end{align*}$$



> ***Fisher information*** $I(\theta^{(t)})$ is [***positive semi-definite***](https://stats.stackexchange.com/questions/49942/why-is-the-fisher-information-matrix-positive-semidefinite), so for some small step size factor $\alpha^{(t)}>0$
   > - the update $\theta^{(t+1)} = \theta^{(t)} + \underbrace{\alpha^{(t)}[I(\theta^{(t)})]^{-1}\nabla_\theta l(\theta^{(t)})}_{\text{will have the same sign as }\nabla_\theta g(\theta^{(t)})}$ 
   > - guarantees that $f(x^{(t+1)}) > f(x^{(t)})$
   >
   > and at a (local) maximum $\theta^*$ where $\nabla_\theta l(\theta^*)=0$, both $H_{l(\theta)}(\theta^*)$ and $E[H_{l(\theta)}(\theta^*)] = -I(\theta^*)$ will be ***negative semi-definite***.

-->

- **Gauss-Newton** above has the form $\displaystyle \underline{\theta^{(t+1)} = \theta^{(t)} + H_{g_x(\theta^{(t)})}^{-1} \sum_{i=1}^n \nabla_\theta f_{x_i}(\theta^{(t)}) ({y^{(t)}_i - f_{x_i}(\theta^{(t)})})}$

    - Adding a learning rate to **Gauss Newton** is called **modified Gauss Newton**

- **Gradient descent** generalized to ***stochastic gradient descent***
has the form $\displaystyle \underline{\theta^{(t+1)} = \theta^{(t)} - \alpha I \sum_{i=1}^m \nabla_\theta f_{x_i}(\theta^{(t)})}$

- **Batch norm** (considering $L_2^2$ for a single layer) and **RMSprop/Adam** (ignoring momentum) are $\displaystyle \underline{\theta^{(t+1)} = \theta^{(t)} - \alpha \text{Diag}(H_{f_x(\theta^{(t)})}^{-1}) \sum_{i=1}^m \nabla_\theta f_{x_i}(\theta^{(t)})}$

