In [1]:
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

# VANA Summary

What to include:
* Definition Partial Differentiation
* Chain Rule
* Hessian matrix
* Tangentialebene Gleichung
* Lagrange Multipliero
* Taylor Polynom
* Definition Definitheit einer Matrix
* Klassifikation kritische punkte
* Local min/max
* Least square formula
* Richtungsableitung

# Equations of Planes

https://youtu.be/YBajUR3EFSM

Assume a plane through the origin with normal vector $\vec n = <1, 5, 10>$. That means a point $P = (x,y,z)$ which is in the plane has to be perpenticular to the normal vector $\vec n \cdot \vec{OP} = 0$ which leads us to the plane equation:

$$
1x + 5y + 10y = 0
$$

Now assume a plane with the same normal vector $\vec n$ but going through the point $P_0 = (2,1,-1)$. That means for a point $P$ which is in the plane, the vector $\overline{P_0 P}$ has to be perpenticular to $\vec n$, so $\vec n \cdot \overline{P_0 P} = 0$. Now $\overline{P_0 P} = <x-2, y-1, z+1>$ which leads us to the equation:

$$
(x-2) + 5(y-1) + 10(z+1) = 0
$$

Which is equal to:

$$
x + 5y + 10z = -3
$$

What that means, is that a parallel plane does not change the left hand side of the equation, but the right hand side. That is so because the left hand side represents the normal vector, and a parallel plane shares the same normal vector. The bigger/smaller the right hand side of the equation, the farther away the plane is from the origin.

A $3 \times 3$ linear system is nothing more than multiple plane equations, that demand a point to be in all planes at the same time. If a point has to be in two planes, you take the planes intersection and what you'll get is a line. If you then intersect this line with a third plane, you will get a point. If two planes are parallel, you will get infinitely many solutions (planes are identical) or no solutions (planes are not identical).

So assume the following linear system:

$$
A \cdot X = B
$$

The solution to this system is:

$$
X = A^{-1} \cdot B
$$

But what if there is no or infinitely many solutions, because the planes are parallel? The problem is you can not always calculate $A^{-1}$. Recall:

$$
A^{-1} = \frac{1}{det(A)} \cdot adj(A)
$$

$adj(A)$ can always be calculates but if $det(A) = 0$ then the division becomes invalid.

A homogeneous linear system is defined as planes going through the origin:

$$
A \cdot X = 0
$$

There is always an obvious solution to homogeneous systems which is $(0,0,0)$ (also called the trivial solution), as all planes go through the origin. In general, a homogeneous system has either one solution $(0, 0, 0)$ or infinitely many solutions. The case of no solution does not exist because if the planes are parallel to each other they have to intersect at least in one point becaue they all go through the origin, but if two parallel planes intersect each other they have to be equal.

# Equations of Lines

https://youtu.be/57jzPlxf4fk

One way to define a line is by the intersection of two planes, which can be given by two equations. Thinking of planes this way is not always handy because it takes effort to solve the equations. Instead think of a line as the trajectory of a moving point, which is a **parametric equation**.

Parametric equations are of the form

$$
Q(t) = Q_0 + t \cdot \vec{Q_0 Q_1} \\
Q(t) = \begin{cases}
q_1 + x(t) \\
q_2 + y(t) \\
q_3 + z(t)
\end{cases}
$$

With $Q_0 = <q_1, q_2, q_3>$

## Intersection of a Line With a Plane

Consider the plane $x + 2y + 4z = 7$ and the line given by the points $Q_0 = (-1,2,2)$ and $Q_1 = (1,3,-1)$. Where does the line intersect with the plane? In order to answer this question, one could wonder, are these two points on the same side of the plane, on opposite sides or on the plane intself? The last case, being in the plane, can be checked by putting the points into the equation:

$$
Q_0: x + 2y + 4z = -1 + 4 + 8 = 11 > 7 \\
Q_1: x + 2y + 4z = 1 + 6 - 4 = 3 < 7
$$

The result shows that neither of the points are in the plane. In order to answer the question whether they lay on the same or opposite side of the plane, consider that the constant term of a plane equation defines how far away the plane is from the origin. Increasing the constant term would move the plane into one direction of the normal vector, decreasing it would move the plane in the other direction. As a consequence, if one point is smaller than the constant term of the plane and the other is bigger, then they lay on opposite directions which is the case in this example. Another way to think of it is when you move along the line from $Q_0$ to $Q_1$, you will do this by changing $x, y, z$ only, hence you have to pass the value 7 as going from $3$ to $11$ and that is the intersection point.

To find the exact intersection point, one can plug in the functions for $x, y, z$ of the parametric equation into the plane equation.

$$
Q(t) = \begin{cases}
x(t) = -1 + 2t \\
y(t) = 2 + t \\
z(t) = 2 - 3t
\end{cases} \\
x + 2y + 4z = (-1 + 2t) + 2(2+t) + 4(2-3t) = 7
$$

Solved for $t$, this will give the intersection point $Q(t)$.

## General Parametric Equations

More generally, parametric equations can be used for arbitrary motion in the plane or in space.

Consider a cycloid which is the trajectory of a point which is fixed on a moving wheel (check [wikipedia](https://en.wikipedia.org/wiki/Cycloid) for a visualization of the cycloid). The wheel has a radius of $a$ and is rolling on the x-axis. The point $P$ which is fixed on the wheel is initially at the origin but as time goes by it moves together with the rotating wheel. Can we find the position of this point in relation to time, which is $x(t), y(t)$?

Instead of using time as a parameter of the motion, we can also use the distance or more particularly the the angle which the wheel was rotated by.

So let's find the parametric equation

$$
Q(\theta) = \begin{cases}
x(\theta) \\
y(\theta)
\end{cases}
$$

Now consider $P$ being a point on the rotatet wheel and $A$ being the point where the wheel touches the x-axis and $B$ being the center of the wheel.

$$
\vec{OP} = \vec{OA} + \vec{AB} + \vec {BP} \\
$$

$\vec{OA}$ is the distance travelled on the x-axis, which is equal to the arc length on the wheel from $A$ to $P$ (remember $a$ represents the wheel radius):

$$
\vec{OA} = <a \cdot \theta , 0>
$$

$\vec{AB}$ points straight up from $A$ to the center of the wheel:

$$
\vec{AB} = <0, a>
$$

And $\vec{BP}$ is

$$
\vec{BP} = <-a\ sin\ \theta, -a\ cos\ \theta>
$$

Which gives

$$
\vec{OP} = < a \theta - a\ sin\ \theta, a - a\ cos\ \theta>
$$

Which is the answer to the parametric equation

$$
Q(\theta) = \begin{cases}
x(\theta) = a \theta - a\ sin\ \theta \\
y(\theta) = a - a\ cos\ \theta
\end{cases}
$$

## Velocity Vector

https://youtu.be/0D4BbCa4gHo

Now if you want to know the velocity of the point at any given time, you take the derivative of each component of the parametric line equation $Q$:

$$
\vec v(t) = <1 - cos\ t, sin\ t>
$$

What you'll get is a parametric equation for velocity, the velocity vector. This vector will give you the speed and the direction at any given point in time.

Now ask yourself, what's the fastest/slowest velocity of the point? When $t = 0$ then $\vec v(t) = <0, 0>$, so the velocity is the slowest when the point touches the ground, with no velocity at all (which may be counter intuitive). On the other hand, if $t = \pi$ then $\vec v(\pi) = <1, 1>$ which tells the point moves the fastest when he is $2a$ above the ground, actually turned $180˚$ from its initial position. Why is that so? Well, as the point is $2a$ above ground, the velocity of the wheel moving forward and the velocity of the rotation add up as they perfectly move into the same direction. On the other hand, when the point is on the ground, the wheel moves into the opposite direction of the wheel rotation and the speeds cancel out to be 0.

## Acceleration Vector

$$
\vec a = \frac{d\vec v}{dt}
$$

For our example:

$$
\vec a(t) = <sin\ t, cos\ t>
$$

Now at the initial position $t = 0$ you will see $\vec a(t) = <0, 1>$ which means that the acceleration points straight up, perpenticular to the ground.

## Kepler's Second Law (1609)

The motion of planets is in a plane and the area swept out by the line from the sun to the planet is at constant rate. That means as soon as you know the orbit of a planet, it tells you how fast it's going to move on that orbit. Because the area swept out is always equal for the same amount of time passed by, hence the planet has to move faster as he is closer to the sun and slower as he is farther away from the sun.

Newton later explained this using formulas for gravitational attraction.

But what's Kepler's law in terms of vector?

Let $\vec r$ be the position vector of the planet (its current place on the orbit). If the planet moves by a distance $\vec{\Delta r} = \vec v \cdot \Delta t$ (distance travelled = velocity times time), the area swept out by that movement es equal to half of the parallelogram given by the crossproduct $\vec r \times \vec{\Delta r}$ which leads us to the formula:

$$
Area \approx \frac{1}{2} |\vec r \times \Delta \vec r| \approx \frac{1}{2} |\vec r \times \vec v| \Delta t
$$

Given that $\Delta t$ is small. Now the law says:

$$
|\vec r \times \vec v| = constant
$$

What Keplers law says is:

$$
\begin{equation}
\begin{split}
|\vec r \times \vec v| = constant & \iff \frac{d}{dt} (\vec r \times \vec v) = 0 \\
& \iff \frac{d \vec r}{dt} \times \vec v + \vec r \times \frac{d \vec v}{dt} = 0\ (product\ rule)  \\
& \iff \vec v \times \vec v + \vec r \times \vec a = 0 \\
& \iff \vec r \times \vec a = 0\ (because\ \vec v \times \vec v = 0) \\
& \iff \vec r \parallel \vec a
\end{split}
\end{equation}
$$

$\vec r$ is parallel to $\vec a$ because that's the rule for vector's whose cross product is equal to 0. What Kepler's $2_{nd}$ law ultimately says, is that the acceleration vector of a planet always looks into the direction of the position vector. What that means is that the gravitational force alsways looks towards the sun (according to Kepler's law could the opposite direction would work as well but as we know it's not the truth).

# Partial Derivative

The partial derivative is defined as:

$$
\frac{\partial f}{\partial x}(x_0, y_0) = \lim\limits_{\Delta x \rightarrow 0} \frac{f(x_0 + \Delta x, y_0 - f(x_0, y_0}{\Delta x}
$$

Which means $y$ is treated constant, its rate of change is not considered. The same is done for $x = constant$ which gives your $n$ partial derivatives for a function $f: \mathbb R^n \rightarrow \mathbb R$.

An alternative way of writing a partial is the subscript notion:

$$
\frac{\partial f}{\partial x} = f_x
$$

## Tangient Plane Approximation

One important formula is the approximation formula. How much does $f$ change, when I change both $x, y$ (or any amount of variables) at the same time? Well the effects on $f$ of both variables just add up. Mathematically spoken, consider $z = f(x,y)$ the approximate change in $z$ can be described as:

$$
\Delta z \approx f_x \cdot \Delta x + f_y \cdot \Delta y
$$

The formula can be justified by the tangient plane approximation. We know that $f_x$ and $f_y$ are slopes of tangient lines to the graph of $f$. Assume:

$$
\frac{\partial f}{\partial x}(x_0, y_0) = a \implies L_1 = \begin{cases}
z = z_0 + a \cdot (x - x_0) \\
y = y_0
\end{cases}
\\
\frac{\partial f}{\partial y}(x_0, y_0) = b \implies L_2 = \begin{cases}
z = z_0 + b \cdot (y - y_0) \\
x = x_0
\end{cases}
$$

Both of these lines $L_1, L_2$ are going to be tangient to $f$ at $(x_0, y_0)$ and together they determine a plane:

$$
z = z_0 + a \cdot (x - x_0) + b \cdot (y - y_0)
$$

The approximation formula says the graph of $f$ is close to its tangient plane.

## Min / Max

https://youtu.be/UYe98CcxPbs

Min/max points of a function happen where all partials are equal to 0 because then there is no change happening when moving $x$ or $y$ both by a small amount. Another way of thinking about it: A minimum/maximum requires that the slope changes its sign, from positive to negative or vice versa (it has to go up and then down, that is basically the definition of min/max or a hill in general) in order to do that the slope has to cross 0 and that is the actual top of the hill, the point where the slope will change from pos to neg (or vice versa). This works exactly the same for $n$ variables, the variables may flip signs differently but they all have to flip at that particular min/max point, i.e. the partials have to be 0.

$$
f_x(x_0, y_0) = 0 \\
f_y(x_0, y_0) = 0
$$

In fact, the 0 equality condition is required but not sufficient as there are points who fulfill the condition but are not a min/max. Points fulfilling the condition are called critical points. The tangient plane at a critical point is horizontal, because the factors $a$ and $b$ are both $0$ and what's being left is $z_0$:

$$
(x_0,y_0)\ is\ critical\ point \implies z = z_0 + a \cdot (x - x_0) + b \cdot (y - y_0) = z_0
$$

## Least Square Interpolation

The least square interpolation is used to fit a function to some real world data. It allows to make predictions based on past experiences (data).

Given experimental data $(x_1, y_1), ..., (x_n, y_n)$ find the "best fit" line of the form $y = ax + b$. The unknowns in the equations will be $a, b$ and not $x, y$. The task is to find the "best" $a, b$, this is done by minimizing a function of $a,b$ that measures the error from the experimental data (i.e. how far the points are away from the line). There are many ways to measure the error. One way is to square the difference between $y_i$ and $f(x_i)$ and sum up over all data points:

$$
D(a, b) = \sum_{i=1}^n [y_i - (a \cdot x_i + b)]^2
$$

Minimizing the total square error/deviation means minimizing $D$, means settings its derivatives to 0.

$$
\begin{equation}
\begin{split}
\frac{\partial D}{\partial a} & = \sum_{i=1}^n 2 \cdot [y_i - (a \cdot x_i + b)] \cdot (-x_i)\ & = 0 \\
\frac{\partial D}{\partial b} & = \sum_{i=1}^n 2 \cdot [y_i - (a \cdot x_i + b)] \cdot (-1) & = 0
\end{split}
\end{equation}
$$

The equations can be simplified by dividing by 2 and expanding:

$$
\begin{equation}
\begin{split}
\frac{\partial D}{\partial a} & = \sum_{i=1}^n x_i^2a + x_ib - x_iy_i & = 0 \\
\frac{\partial D}{\partial b} & = \sum_{i=1}^n x_ia + b - y_i & = 0
\end{split}
\end{equation}
$$

Another re-grouping leads us to:

$$
\begin{equation}
\begin{split}
\frac{\partial D}{\partial a} & = \left(\sum_{i=1}^n x_i^2 \right) a + \left(\sum_{i=1}^n x_i \right) b & = \sum_{i=1}^n x_iy_i \\
\frac{\partial D}{\partial b} & = \left(\sum_{i=1}^n x_i\right) a + nb & = \sum_{i=1}^n y_i
\end{split}
\end{equation}
$$

This leads us to a $2 \times 2$ linear system which can be solved for $(a, b)$. Remember that these sums over $x_i$ and $y_i$ are just actual numbers because $(x_i, y_i)$ represent the given experimental data

The least square interpolation is not easy to do for any kind of function. Consider experimental data with an exponential tendency:
$$
\begin{equation}
\begin{split}
y & = c e^{ax} \\
\iff ln(y) & = ln(c) + ax
\end{split}
\end{equation}
$$

Doing the least square interpolation with $c e^{ax}$ would be hard to solve with complicated terms, but doing it for $ln(c) + ax$ becomes a linear problem.

On the other hand, fitting a $y = ax^2 + bx + c$ function is already a linear $3 \times 3$ problem:

$$
D(a,b,c) = \sum_{i=1}^n [y_i - (ax_i^2 + bx_i + c)]^2
$$

# Implicit Differentiation

https://youtu.be/7eZVshlT33Q

In general, implicit differentiation means:

$$
\begin{split}
y &= f(x) \\
dy &= f'(x) dx
\end{split}
$$

https://youtu.be/sL6MC-lKOrw

Explicit differentiation means that one variable is given *explicitely*, like:

$$
y = x^2 + 2x + 3
$$

Given $x$, the $y$ can be calculated explicitely by evaluating the right hand side of the equation. Differentiating is as regular:

$$
\begin{split}
\frac{d}{dx} y & = \frac{d}{dx} (x^2 + 2x + 3) \\
\frac{dy}{dx} & = 2x + 2 
\end{split}
$$

Implicit differentiation on the other hand deals with an equation where $y$ is implicitely given:

$$
x^2 + y^2 = 100
$$

This equation can be *implicitely* differentiated:

$$
\begin{split}
\frac{d}{dx} (x^2 + y^2) &= \frac{d}{dx} 100 \\
\frac{d}{dx} (x^2) + \frac{d}{dx} (y^2) &= 0 \\
2x + 2y \frac{dy}{dx} &= 0 \\
\frac{dy}{dx} &= - \frac{x}{y} \\
\end{split}
$$

Now $\frac{d}{dx} (y^2) = 2y \frac{dy}{dx}$ comes from the chain rule. You have to imagine $y^2$ as a separate function $g$:

$$
g = y^2
$$

If you differentiate $\frac{dg}{dx}$ the chain rule says:

$$
\frac{dg}{dx} = \frac{dg}{dy} \cdot \frac{dy}{dx}
$$

Chain rule refresher: $dy$ stands for an infinitely small fraction, but yet still a fraction, so $dy$ cancels out in the right hand side of the equation.

# Difference Derivative and Differential

The derivate stands for the **rate** of change of a function:

$$
\frac{df}{dx} = f'(x)
$$

The differential represents the **change** in a function with respect to a change in the variables:

$$
df = f'(x) dx
$$

In other words, the derivative measures the rate of  change while the differential calculates the actual change (using the rate of change, also known as the slope).

# Total Differential

The total differential includes all the contributions of the variables that cause the value of the function to change.

Given $f(x,y,z)$ the total differential is:

$$
\begin{split}
df &= f_x \cdot dx + d_y \cdot dy + f_z \cdot d_z \\
   &= \frac{\partial f}{\partial x} dx + \frac{\partial f}{\partial y} dy + \frac{\partial f}{\partial z} dz
\end{split}
$$

The equation creates a relation between $x, y, z$ and $f$.

It is important that these objects have their own rules of manipulation, because they are not scalars, or matrices or vectors. Important: $df$ is NOT $\Delta f$. All the $d$-something can be imagined as placeholders where you put values to get a tangient approximation.

What $df$ can do:
1. Encode how change in $x, y, z$ affect $f$
2. Placeholder for small variations $\Delta x, \Delta y, \Delta z$ to get an approximation formula $\Delta f \approx f_x \Delta x + f_y \Delta y + f_z \Delta z$
3. Divide by something like $dt$ to get a rate of change when $x=x(t), y=y(t), z=z(t)$: $\frac{df}{dt} = f_x \frac{dx}{dt} + f_y \frac{dy}{dt} + f_z \frac{dz}{dt}$ (chain rule)

## Justify Product/Quotient Rule

The product rule can be justified using the total differential. Given:

$$
f = uv \\
u=u(t) \\
v=v(t)
$$ 

Calculate $\frac{d}{dt}f$. By definition of the total differential (and the rule that allows us to divide a differential equation by $dt$):

$$
\begin{split}
\frac{d(uv)}{dt} &= f_u \frac{du}{dt} + f_v \frac{dv}{dt} \\
              &= v \frac{du}{dt} + u \frac{dv}{dt} \\
              &= v u' + u v'
\end{split}
$$

The same for the quotient rule.

$$
\begin{split}
g &= \frac{u}{v} \\
u &= u(t) \\
v &= v(t) \\
\frac{d(u/v)}{dt} &= \frac{1}{v} \frac{du}{dt} + (-\frac{u}{v^2} \frac{dv}{dt}) \\
&= \frac{vu' - uv'}{v^2}
\end{split}
$$

## Chain Rule with more Variables

Let

$$
w = f(x,y) \\
x = x(u,v) \\
y = y(u,v)
$$

One way to deal with this situatino is to substitute $x$ and $y$, such that $w = f(x(u,v), y(u,v))$ but sometimes that leads to hard to solve problems where it can be easier to work with derivatives and chain rule in the first place.

Question: Express $\frac{\partial w}{\partial u}$, $\frac{\partial w}{\partial v}$ in terms of $\frac{\partial w}{\partial x}$, $\frac{\partial w}{\partial y}$, $\frac{\partial x}{\partial u}$, $\frac{\partial x}{\partial v}$, $\frac{\partial y}{\partial u}$, $\frac{\partial y}{\partial v}$.

Start by applying the total differential to $w$:

$$
dw = f_x dx + f_y dy
$$

The total differential can now be applied to $dx$ and $dy$ as well:

$$
dx = x_u du + x_v dv \\
dy = y_u du + y_v dv
$$

Which will lead us to:

$$
\begin{split}
dw &= f_x dx + f_y dy \\
   &= f_x (x_u du + x_v dv) + f_y (y_u du + y_v dv) \\
   &= (f_x x_u + f_y y_u) du + (f_x x_v + f_y y_v) dv
\end{split}
$$

From that we can conclude:

$$
\frac{\partial f}{\partial u} = f_x x_u + f_y y_u \\
\frac{\partial f}{\partial v} = f_x x_v + f_y y_v
$$

# Gradient and Level Curves

The gradient of a function is defined as:

$$
\nabla f(x, y, z) = <\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}, \frac{\partial f}{\partial z}>
$$

Of course you can take any amount of variables.

For any given point $(x, y, z)$ a vector is defined by the gradient. I.e. the gradient creates a vector field. If $f: \mathbb R^n \rightarrow \mathbb R$ then $\nabla f: \mathbb R^n \rightarrow \mathbb R^n$.

Level curves of a function $f$ are given by the points for which the function equals a constant value $c$:

$$
f = c
$$

## Gradient Perpenticular Level Curves

https://ocw.mit.edu/courses/mathematics/18-02sc-multivariable-calculus-fall-2010/2.-partial-derivatives/part-b-chain-rule-gradient-and-directional-derivatives/session-36-proof/MIT18_02SC_notes_19.pdf

Intuitive explanation: Consider a curve defined by a parameter $t$ that always stays on a specific level curve of a function, i.e. all points on this curve return the same constant value, when put into the function. As a consequence, no matter how $t$ is changed, the effect on the function in terms of rate of change is always 0 - it does not change the function value at all. Using the chain rule, one can show that a change in $t$ will cause a change in all variables of the function and a change in the variables of the function causes a change in the functions value with respect to $t$, but that change will be equal to 0 and that sum of changes can be written as a dot product of the gradient vector and the level curve.

Let $f: \mathbb R^n \rightarrow \mathbb R$ and $f(x) = c, x \in \mathbb R^n$ be a level set.

Claim: All points $P$ in the level set are perpenticular to the gradient $\nabla f(P)$, i.e. perpenticular to lines that are tangent to the surface of the graph at $P$.

Let $\vec{r(t)} = <x_1(t), ..., x_n(t)>$ be a curve on the level surface, such that $g(t) = f(x_1(t),...,x_n(t)) = c$, i.e. no matter the change in $t$, $g(t)$ will always stays constant $c$. In other words, its differentiation is equal to 0. Based on the chain rule:

$$
\frac{\partial g}{\partial t} = \frac{\partial f}{\partial x_1} \frac{d x_1}{d t} + ... + \frac{\partial f}{\partial x_n} \frac{d x_n}{d t} = 0
$$

This sum can be written as a vector product:

$$
\frac{\partial g}{\partial t} = <\frac{\partial f}{\partial x_1}, ..., \frac{\partial f}{\partial x_n}> \cdot <\frac{d x_1}{d t}, ..., \frac{d x_n}{d t}> = 0
$$

This is true for any curve, on any level set, hence the gradient is always perpenticular to the level set.

# Differentiation Rules

Product rule

$$
f'(x) = u(x) \cdot v(c) \implies f'(x) = (uv)' = u'v + uv'
$$

Quotient rule

$$
f'(x) = \frac{u(x)}{v(x)} \implies \left(\frac{u}{v}\right)' = \frac{u'v - uv'}{v^2}
$$

Chain rule

$$
f(x) = u(v(x)) \implies f'(x) = u'(v(x)) \cdot v'(x)
$$

## Derivation of Special Functions

* $(ln(x))' = \frac{1}{x}$
* $(sin\ x)' = cos\ x$
* $(cos\ x)' = -sin\ x$
* $(tan\ x)' = \frac{1}{cos^2\ x} = 1 + tan^2\ x$

# Definition Partial Derivative

$$
\frac{\partial f}{\partial x} (x_0, y_0) = \lim\limits_{\Delta x \rightarrow 0} \frac{f(x_0 + \Delta x, y_0) - f(x_0, y_0)}{\Delta x}
$$

# Directional Derivative

Let $\vec e$ be the unit vector pointing into the direction you want to differentiate. The directional derivative can be calculated as follows:

$$
\nabla f \cdot  \vec e 
$$

Using scalar product rules, one can make the following transformation:

$$
\begin{split}
\nabla f \cdot  \vec e &= \nabla f \cdot  \vec e \\
                       &= \left|\nabla f \right| \cdot \left|\vec e\right| \cdot cos\ \varphi \\
                       &= \left|\nabla f \right| \cdot cos\ \varphi
\end{split}
$$

$\left|\nabla f \right| \cdot cos\ \varphi$ becomes the largest when $\varphi = 0$, in other words when $\vec e$ and $\nabla f$ point into the same direction. Implicitely $\nabla f$ always points into the direction of the function's largest increase. Similarly $\left|\nabla f \right| \cdot cos\ \varphi$ becomes the smallest when $\varphi = \pi = 180˚$, i.e. the opposite direction of the gradient is the function's largest decrease.

# Chain Rule

Let $x = x(u,v)$, $y=y(u,v)$ and $f = f(x,y) = f(x(u,v), y(u,v))$.

Then the chain rule says.

$$
\frac{\partial f}{\partial u} = \frac{\partial f}{\partial x} \cdot \frac{\partial x}{\partial u} + \frac{\partial f}{\partial y} \cdot \frac{\partial y}{\partial u}
$$

And for $v$ respectively.

$$
\frac{\partial f}{\partial v} = \frac{\partial f}{\partial x} \cdot \frac{\partial x}{\partial v} + \frac{\partial f}{\partial y} \cdot \frac{\partial y}{\partial v}
$$

# Hessian Matrix

The hessian matrix $H_f$ is a square matrix that contains all second order derivatives of a function $f: \mathbb R^2 \rightarrow \mathbb R$.

$$
H_f = \begin{pmatrix}  
f_{xx} & f_{xy} \\ 
f_{yx} & f_{yy} 
\end{pmatrix}
$$

Likewise the hessian matrix of a function $f$ with 3 variables would be a $3 \times 3$ matrix and so forth.

According to schwartz's theorem:

$$
f_{xy} = f_{yx}
$$

# Definition Matrix Definiteness

Let $M$ be a symmetric matrix. Let $\forall d$ represent the diagonal elements of $M$.

Definiteness of $M$ is defined as:
* $\forall d > 0 \implies positive\ definite$
* $\forall d < 0 \implies negative\ definite$
* $\forall d \geq 0 \implies positive\ semi\ definite$
* $\forall d \leq 0 \implies negative\ semi\ definite$
* $Indefinite$ if the diagonal elements are positive and negative

# Critical Point

$p \in \mathbb R^n$ is a critical point of $f: \mathbb R^n \rightarrow \mathbb R$ if:

$$
\nabla f(p) = 0
$$

# Critical Point Classification

Let $f: \mathbb R^n \rightarrow \mathbb R$. Let $p \in \mathbb R^n$ be a critical point, i.e. $\nabla f(p) = 0$. Let $H$ be the hessian matrix of $f$, i.e. $H(p)$ is the evaluated hessian matrix at point $p$.

The critical point can be classified by looking at the definiteness of the evaluated hessian matrix.

* $H(p)\ is\ positive\ definite \implies local\ minimum$
* $H(p)\ is\ negative\ definite \implies local\ maximum$
* $H(p)\ is\ indefinite \implies saddle\ point$
* Otherwise there's lack of information for classification

# Function Convexity (Global Min/Max)

Let $f: D \subset \mathbb R^n \rightarrow \mathbb R$, let $H_f$ be the hessian matrix of $f$.

$$
\begin{split}
\forall p \in D:\ &H(p)\ is\ positive\ semi\ definite &\implies f\ is\ convex \\
\forall p \in D:\ &H(p)\ is\ positive\ definite &\implies f\ is\ strict\ convex \\
\forall p \in D:\ &H(p)\ is\ negative\ semi\ definite &\implies f\ is\ concave \\
\forall p \in D:\ &H(p)\ is\ negative\ definite &\implies f\ is\ strict\ concave \\
\end{split}
$$

By implication of the function's convexity:

$$
\begin{split}
p\ is\ a\ critical\ point\ of\ f\ and\ f\ is\ convex &\implies p\ is\ the\ global\ min \\
p\ is\ a\ critical\ point\ of\ f\ and\ f\ is\ concave &\implies p\ is\ the\ global\ max \\
\end{split}
$$


# Tangent Plane

* Let $f: \mathbb R^n \rightarrow \mathbb R$
* Let $x \in \mathbb R^n$
* Let $f_{T,P}$ be the tangential plane of $f$ at point $P \in \mathbb R^n$.


$$
z = f_{T,P}(x) = f(p) + \nabla f(P) \cdot (x-P)
$$

Note that $\nabla f(P)$ represent the slopes into each direction. At the same time they will be the coefficients of the plane variables $x,y,z,...$ (represented by the dot product in the equation), which means $\nabla f(P)$ is the normal vector of the plane. Now the plane is tangential to the function's graph, hence the normal vector stays perpenticular on the function's graph at $f(P)$ (because he is also perpenticular to the plane).

Another way to think about it: Since $\nabla f(P)$ represent slopes into each directoin $x,y,z,...$ (each treating changes into all other directions constant, i.e. discards them), each slope defines a tangient line to the functions graph at $f(P)$. These lines together from a plane.

# Lagrange Multiplier

Lagrange multipliers are used to find critical points of a function $f$ with a constraint $g$.

* Let $f: \mathbb R^n \rightarrow \mathbb R$ be the function
* Let $g: \mathbb R^n \rightarrow \mathbb R$ be the constraint

Critical points $P \in \mathbb R^n$ of $f$ are those who fulfill the following equation:

$$
\nabla f(P) = \lambda \cdot \nabla g(P)
$$

The equation will give you $n$ equations but with $\lambda$ we have $n+1$ unknowns. The last equation is provided by the constraint $g(x)$ itself.

That means the gradients of the function and the constraint are proportional to each other. That happens when the level curves of $f$ and $g$ are tangential to each other (i.e. they have the same value) because the gradient is always perpendicular to the level curves.

## Example

https://youtu.be/15HVevXRsBA

Consider the function $f = xy = 3$, now minimize it such that the distance to the origin becomes the smallest. How do you measure the distance to the origin? Well $distance = g = x^2 + y^2$, now $g$ has to become as small as possible.

Here's a plot af various $g$ values (black circles) as well as the hyperbola $y = \frac{3}{x}$.

![](lagrange_visualization.png)

At $x^2 + y^2 = 20$ you can see that two solutions exist which intersect with $xy = 3$. But you can actually decrease $g$ down to $x^2 + y^2 = 6$ and still get a solution. Going further down to $x^2 + y^2 = 3$ will lead to no intersections, hence no solutions. But the minimum can be found where both functions $g$ and $f$ are tangent to each other. Now as we know, the gradient of a function is perpenticular to the level curves. The blue line is the $f=3$ level curve while the black circle is the $g=6$ levelcurve, now if they both are tangent to each other at the given point $(x,y)$, we can conclude that their gradient vectors at $(x,y)$ are proportional to each other too. That is what $\nabla f(P) = \lambda \cdot \nabla g(P)$ expresses.

# Taylor Polynom

Taylor polynoms are approximations to a function.

## Taylor Approximation 1d

Let $f: \mathbb R \rightarrow \mathbb R$ and $x_0 \in \mathbb R$ be the Entwicklungspunkt. $n$-th taylor polynom is given by the series:

$$
f(x) = \sum_{k=0}^n \frac{f^{(k)}(x_0)}{k!} \cdot (x - x_0)^k + R_{k,x_0}(x)
$$

$R$ is the langrangian restglied and balances the deviation of the approximation. Imagine, that if $k \rightarrow \infty$ the taylor approximation is *equal* to the actual function $f$. But as $k$ is a constant number, there will be infinitely many sums be left out in the taylor series and the restglied trys to compensate this error.

$$
R_{k,x_0}(x) = \frac{(x-x_0)^{k+1}}{(k + 1)!} \cdot f^{(k+1)}(\hat x (x))
$$

$\hat x$ depends on $x$ and is unknown. $\hat x$ is always between $x_0$ and $x$.

## Taylor Approximation Multivariat

Let $f: D \subset \mathbb R^n \rightarrow \mathbb R$ be a 3 times differentiable function and $p \in D$ be the entwicklungspunkt. Second degree taylor series for $f$ is:

$$
f(x) = f(p) + \nabla f(p) \cdot (x-p) + \frac{1}{2} (x-p)^T \cdot H(p) \cdot (x-p) + R_p^{(3)}(x)
$$

$H(p)$ being the hessian matrix evaluated at $p$ and $R_p^{(3)}$ is called the 3. restglied.

Explicit form for a function with 2 variables $f: \mathbb R^2 \rightarrow \mathbb R$, being approximated at $p=(x_0,y_0) \in \mathbb R^2$:

$$
f(x) \approx f(p) + f_x(p) (x-x_0) + f_y(p) (y-y_0) + \frac{1}{2} (f_{xx}(p) (x-x_0)^2 + 2 f_{xy}(p) (x-x_0)(y-y_0) + f_{yy}(p) (y-y_0)^2)
$$

## Restglied

Wenn man die Taylor Reihe für eine Funktion gegen unendlich laufen lässt, so erhält man die exakte Funktion zurück. Limitiert man die Reihe jedoch auf endlich viele Summen, sodass ein Taylor Polynom $n$-ten Grades entsteht, so fehlen unendlich viele Summanden für eine exakte Abbildung der Funktion (deshalb eine Approximation ist es nur ein Approximation). Das Restglied $R$ in der Taylor Approximation ist eine Funktion die den exakten Fehler der Approximation angibt. Man benötigt für die Restgliedfunktion die $(n+1)$te Ableitung der Funktion, wenn das $n$-te Taylor Polynom vorliegt.

$$
R_n(x) = \frac{f^{n+1}(\hat x)}{(n+1)!}(x - x_0)^{n+1}
$$

Das heisst addiert man den Fehler auf die Approximation, so hat man nicht mehr nur eine Approximation sondern die exakte Funktion. 

$$
f(x) = \sum_{k=0}^n \frac{f^{(k)}(x_0)}{k!}(x - x_0)^k + R_n(\hat x)
$$

Die Schwierigkeit liegt darin, das richtige $\hat x$ zu finden. Es gilt, $\hat x$ muss zwischen $x$ und $x_0$ (dem Entwicklungspunkt) liegen. Dies erlaubt eine obere Fehlerschranke zu ermitteln, in dem man schätzt, welchen Wert $R(x_0)$ nicht überschreitet. Besteht $R(x)$ z.B. nur aus cosinus und sinus Multiplikationen, so ist gewiss, dass nie ein Wert $\geq 1$ angenommen werden kann.

# Sigmoid Function

The sigmoid function is defined as:

$$
\sigma(x) = \frac{1}{1 + e^{-x}} = \frac{e^x}{e^x + 1}
$$

The derivative is:

$$
\sigma'(x) = \sigma(x) \cdot (1 - \sigma(x))
$$

# Softplus Function

The softplus function is defined as:

$$
s(x) = ln(1+e^x)
$$

The derivative is the sigmoid function:

$$
s'(x) = \sigma(x)
$$

# Kroenecker delta

The kroenecker delta function is a function of two variables:

$$
\delta_{ij} = \begin{cases}
0\ if\ i \neq \ j \\
1\ else
\end{cases}
$$

# Grenzwert e

$$
\lim\limits_{n \rightarrow \infty} \left(1 + \frac{x}{n}\right)^n = e^x
$$

# Logarithmus Funktion

Die Logarithmus funktion ist definiert auf $\mathbb R^+$ (exklusive 0). Der Wertebereich liegt zwischen $]-\infty, \infty[$.

$$
f(x) = ln\ x
$$

![](logarithmus_funktion.png)

Der Logarithmus von $x$ beschreibt: $e$ hoch welche Zahl ergibt $x$?

Negative Werte können erreicht werden, sobald $x$ eine rationale zahl ist:

$$
ln\ \frac{1}{e} = ln\ e^{-1} = -1 \cdot ln\ e = -1
$$