Math087 - Mathematical Modeling
===============================
[Tufts University](http://www.tufts.edu) -- [Department of Math](http://math.tufts.edu)  
[Arkadz Kirshtein](https://math.tufts.edu/people/facultyKirshtein.htm) <arkadz.kirshtein@tufts.edu>  
*Fall 2023*

Course material (Class 17): Calculus and nonlinear optimization background
----------------------------------------------------------------

In this notebook we review the basic concepts required to understand and implement nonlinear optimization algorithms. Let us start with the review of some vector calculus concepts.

Directional derivative and the gradient
------------------------------

In single variable calculus the derivative of a function $f$ at a point $p$ is defined as a limit of the ration of difference between the value at $p$ and another point taken at distance $h$ from $p$ to the distance $h$: 
$$\lim_{h\rightarrow0}\frac{f(p+h)-f(p)}{h}$$
However if we consider a point in multiple dimensions, at a distance h instead of 2 options we get a circle, sphere or hypersphere depending on dimension. And consequently there are infinitely many directions in which we could step a distance h. This introduces a concept of directional derivative:

Consider a unit (length=1) vector $\mathbf{v}$. Then directional derivative of a function $f:\ \mathbb{R}^{n}\mapsto \mathbb{R}$ at a point $\mathbf{p}$ in the direction of $\mathbf{v}$ is a limit $$\partial_{\mathbf{v}}f=\lim_{h\rightarrow0}\frac{f(\mathbf{p}+h\mathbf{v})-f(\mathbf{p})}{h}.$$

Next step is to compute the directional derivative. and as it turns out, we can utilize a vector of derivatives in the directions of each axis called partial derivatives. To take partial derivative one would treat all variables except the one in focus as constants and take "regular" derivative for the variable in question.

For example to take the derivative with respect to $x$ of $f(x,y)=x^2y+xy^2$ one would treat $y$ as a constant and obtain 
$$
\partial_x(x^2y+xy^2)=\partial_x(x^2)y+\partial_x(x)y^2=(2x)y+(1)y^2=2xy+y^2.
$$

In vector calculus, the gradient of a scalar-valued differentiable function f of several variables is the vector field (or vector-valued function) $\nabla f$ whose value at a point $p$ is the vector whose components are the partial derivatives of $f$ at $p$. That is, for $f:\ \mathbb{R}^{n}\mapsto \mathbb{R}$, its gradient $\nabla f:\ \mathbb{R}^{n}\mapsto \mathbb{R} ^{n}$ is defined at the point $p=(x_{1},\dots ,x_{n})$ in n-dimensional space as the vector:
$$
\nabla f=
\left[\begin{array}{c}
\partial_{x_1}f\\
\partial_{x_2}f\\
\vdots\\
\partial_{x_n}f
\end{array}\right]
$$

Then to obtain directional derivative in the arbitrary direction one would take the inner product of the direction vector with the gradient field:
$$
\partial_{\mathbf{v}}f=\mathbf{v}\cdot\nabla f.
$$

Additionally, because the inner product is related to the angle between vectors, the gradient vector incidentally provides the direction of the highest directional derivative (and it's opposite would be the lowest), thus being the base for the nonlinear minimization technique.

Gradient descent
---------

Gradient descent is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the gradient (or approximate gradient) of the function at the current point, because this is the direction of steepest descent. Conversely, stepping in the direction of the gradient will lead to a local maximum of that function; the procedure is then known as gradient ascent.

*Gradient descent is generally attributed to Cauchy, who first suggested it in 1847. Hadamard independently proposed a similar method in 1907. Its convergence properties for non-linear optimization problems were first studied by Haskell Curry in 1944, with the method becoming increasingly well-studied and used in the following decades, also often called steepest descent.*

Since for $f$ the negative of the gradient provides the direction of the fastest decrease, the following iteration should theoretically lead to a local minimum:
$$
\mathbf{x}^{n+1}=\mathbf{x}^{n}-\gamma\nabla f(\mathbf{x}^{n}).
$$

This works very similar to Newton's method. However since we are not looking for a zero of a function, we don't really know how far to step, thus introducing additional parameter $\gamma$, choice of which is an independent topic for discussion.

See [wiki](https://en.wikipedia.org/wiki/Gradient_descent) for gradient descent

Nonlinear least squares
-------

While linear version of least squares also utilizes gradient, it can be solved for the optimal value directly. In the case of nonlinear version one would implement gradient descent or similar algorithm to find the optimum. 

Specifically, consifer a vector valued function $\mathbf{F}:\ \mathbb{R}^{n}\mapsto \mathbb{R} ^{m}$
$$
\mathbf{F}(\mathbf{x})=\left[\begin{array}{c}
F_1(\mathbf{x})\\
F_2(\mathbf{x})\\
\vdots\\
F_m(\mathbf{x})
\end{array}\right],\quad \mathbf{x}=\left[\begin{array}{c}
x_1\\
x_2\\
\vdots\\
x_n
\end{array}\right].
$$

Then to find the best version of $\mathbf{x}$ such that $\mathbf{F}(\mathbf{x})\approx \mathbf{y}$, one would minimize $\left\|\mathbf{F}(\mathbf{x})- \mathbf{y}\right\|^2$ leading to the following descent algorithm:
$$
\mathbf{x}^{n+1}=\mathbf{x}^{n}-\gamma\nabla(\left\|\mathbf{F}(\mathbf{x}^n)- \mathbf{y}\right\|^2)=\mathbf{x}^{n}-\gamma\nabla\mathbf{F}(\mathbf{x}^n)\left(\mathbf{F}(\mathbf{x}^n)- \mathbf{y}\right),
$$
where $\nabla\mathbf{F}$ is an $n\times m$ matrix of partial derivatives:
$$
\nabla\mathbf{F}(\mathbf{x})=\left[\begin{array}{cccc}
\partial_{x_1}F_1(\mathbf{x}),&\partial_{x_1}F_2(\mathbf{x}),&\dots&\partial_{x_1}F_m(\mathbf{x})\\
\partial_{x_2}F_1(\mathbf{x}),&\partial_{x_2}F_2(\mathbf{x}),&\dots&\partial_{x_2}F_m(\mathbf{x})\\
\vdots&\vdots&\vdots&\vdots\\
\partial_{x_n}F_1(\mathbf{x}),&\partial_{x_n}F_2(\mathbf{x}),&\dots&\partial_{x_n}F_m(\mathbf{x})
\end{array}\right]
$$