# Partials etc involving matricies

## Preliminaries

In [4]:
#%matplotlib widget
%matplotlib inline
%load_ext autoreload
%autoreload 2

In [5]:
import numpy as np
import matplotlib.pyplot as plt
import sympy

### A few ways to get test numpy arrays

In [6]:
np.arange(3), np.arange(4,8), np.arange(5,1,-2)

(array([0, 1, 2]), array([4, 5, 6, 7]), array([5, 3]))

For experiments with multiplication, arrays of primes may be helpful:

In [7]:
def arangep(n, starting_index=0):
    sympy.sieve.extend_to_no(starting_index + n)
    return np.array(sympy.sieve._list[starting_index:starting_index + n])

In [8]:
arangep(5), arangep(4,2)

(array([ 2,  3,  5,  7, 11]), array([ 5,  7, 11, 13]))

In [27]:
M = arangep(4).reshape(2,2)
x = arangep(2,4)
# x = np.arange(2)+1
M,x

(array([[2, 3],
        [5, 7]]),
 array([11, 13]))

## Einstein summation notation

Numpy provides [Einstein summation](https://mathworld.wolfram.com/EinsteinSummation.html) operations with [einsum](https://numpy.org/devdocs/reference/generated/numpy.einsum.html)
1. Repeated indices are implicitly summed over.
1. Each index can appear at most twice in any term.
1. Each term must contain identical non-repeated indices.

In [28]:
es = np.einsum

 $$a_{ik}a_{ij} \equiv \sum_{i} a_{ik}a_{ij}$$

In [29]:
es('ij,j', M, x), es('ij,i', M, x)

(array([ 61, 146]), array([ 87, 124]))

___

# Partials

## Preliminaries

A matrix __M__ multiplies a (column) vector __x__ to its right to produce a (column) vector __y__:
$$ \normalsize \mathbf{M} \mathbf{x} = \mathbf{y} $$
where
$$ \normalsize
\mathbf{x} = \sum_{j=1}^{n} x_j \mathbf{\hat{x}}_j \\
\mathbf{y} = \sum_{i=1}^{m} y_i \mathbf{\hat{y}}_i
$$
and $\mathbf{M}$ can be written
$$ \normalsize
\begin{bmatrix}
    m_{1,1} & \dots & m_{1,n} \\
    \vdots & \ddots & \vdots \\
    m_{m,1} & \dots & m_{m,n}
\end{bmatrix} \\
$$

A `python` example:

In [30]:
y = M @ x
y

array([ 61, 146])

Using Einstein summation notation, $y_i = m_{ij}x_j$

In [31]:
np.einsum('ij,j', M, x)

array([ 61, 146])

## Partial derivative of a matrix multiply of a vector

Wikipedia [defines](https://en.wikipedia.org/wiki/Partial_derivative#Formal_definition) the partial derivative thus: \
Let _U_ be an open subset of $\mathbb{R}^n$ and ${\displaystyle f:U\to \mathbb {R} }$ a function. The partial derivative of _f_ at the point ${\displaystyle \mathbf {a} =(a_{1},\ldots ,a_{n})\in U}$ with respect to the _i_-th variable $x_i$ is defined as

$$
\begin{align}
\frac{\partial }{\partial x_i }f(\mathbf{a}) & = \lim_{h \to 0} \frac{f(a_1, \ldots , a_{i-1}, a_i+h, a_{i+1}, \ldots ,a_n) -
f(a_1, \ldots, a_i, \dots ,a_n)}{h} \\ 
& = \lim_{h \to 0} \frac{f(\mathbf{a}+he_i) -
f(\mathbf{a})}{h} \tag{2.1}
\end{align}
$$

Where $f(\mathbf{a})$ is linear, $f(\mathbf{a}+he_i) = f(\mathbf{a}) + f(he_i) = f(\mathbf{a}) + h f(e_i)$, and we have
$$ \begin{align} \\
\frac{\partial }{\partial x_i }f(\mathbf{a}) &= \lim_{h \to 0} \frac{f(\mathbf{a}+he_i) - f(\mathbf{a})}{h} \\
 & = \lim_{h \to 0} \frac{f(\mathbf{a}) + h f(e_i) - f(\mathbf{a})}{h} \\
 & = \lim_{h \to 0} \frac{h f(e_i)}{h} \\
 & = \lim_{h \to 0} {f(e_i)} \\
 &= f(e_i) \tag{2.2}
\end{align}
$$

### $\partial\mathbf{y} / \partial\mathbf{x}$

How does vector $\mathbf{y}$ vary with vector $\mathbf{x}$, with $M$ held constant? I.e. what is $\partial\mathbf{y}/\partial\mathbf{x}$?

With
$$ %\normalsize
\mathbf{x} = \sum_{j=1}^{n} x_j \mathbf{\hat{x}}_j, \;\;
\mathbf{y} = \sum_{i=1}^{m} y_i \mathbf{\hat{y}}_i
$$

The matrix equation $\mathbf{y} = \mathbf{M} \mathbf{x}$ can be written as
$$ \normalsize
\begin{align}
\mathbf{y} &= \sum_i y_i \mathbf{\hat{y}}_i 
  = \mathbf{M}\mathbf{x}  \tag{2.3} \label{mmul}
\end{align}
$$
where
$$ \normalsize
\begin{align}
y_i &= f_i(x_1, x_2, \dots x_n) \\[6pt]
  &= \sum_j m_{ij}x_j \tag{2.4}
\end{align}
$$

We have
$$ \normalsize
\begin{align}
 \frac{\partial\mathbf{y}}{\partial\mathbf{x}}
 &= \frac{\partial\sum_{i=1}^{m} y_i \mathbf{\hat{y}}_i}{\partial\mathbf{x}} \\[10pt]
 &= \frac{\partial\sum_{i=1}^{m} f_i(x_1, x_2, \dots x_n) \mathbf{\hat{y}}_i}{\partial\mathbf{x}} \\[10pt]
 &= \sum_{i=1}^{m} \frac{\sum_{j=1}^{n} \partial(m_{ij}x_j) \mathbf{\hat{y}}_i}{{\partial x_j} \mathbf{\hat{x}_j}} \\[10pt]
 &= \sum_{i=1}^{m}
     \sum_{j=1}^{n} 
      \frac{\partial(m_{ij}x_j)}
           {\partial x_j} 
        \frac{\mathbf{\hat{y}}_i}{\mathbf{\hat{x}_j}}  \\[10pt]
 &= \sum_{i=1}^{m}
     \sum_{j=1}^{n} m_{ij}
      \frac{\partial x_j}
           {\partial x_j} 
        \frac{\mathbf{\hat{y}}_i}{\mathbf{\hat{x}_j}}  \\[10pt]
 &= \sum_{i=1}^{m}
     \sum_{j=1}^{n} m_{ij}
      \frac{\mathbf{\hat{y}}_i}{\mathbf{\hat{x}_j}}  \\[10pt]
\end{align}
$$

The basis vectors for $\partial\mathbf{y} / \partial\mathbf{x}$ are $\mathbf{\hat{y}}_i / \mathbf{\hat{x}_j}$. We can array the components in a matrix to say \
\
$$ \normalsize
\frac{\partial \mathbf{y}}{\partial \mathbf{x}} =
%\large
\begin{bmatrix}
m_{1,1}\frac{\mathbf{\hat{y}}_1}{\mathbf{\hat{x}_1}} & \cdots &
m_{1,n}\frac{\mathbf{\hat{y}}_1}{\mathbf{\hat{x}_n}} \\
\vdots & \ddots & \vdots \\
m_{m,1}\frac{\mathbf{\hat{y}}_n}{\mathbf{\hat{x}_1}} & \cdots &
m_{m,n}\frac{\mathbf{\hat{y}}_m}{\mathbf{\hat{x}_n}}
\end{bmatrix}
$$

Then
\
$$ \normalsize
\partial \mathbf{y} =
%\large
\begin{bmatrix}
m_{1,1}\frac{\mathbf{\hat{y}}_1}{\mathbf{\hat{x}_1}} & \cdots &
m_{1,n}\frac{\mathbf{\hat{y}}_1}{\mathbf{\hat{x}_n}} \\
\vdots & \ddots & \vdots \\
m_{m,1}\frac{\mathbf{\hat{y}}_n}{\mathbf{\hat{x}_1}} & \cdots &
m_{m,n}\frac{\mathbf{\hat{y}}_m}{\mathbf{\hat{x}_n}}
\end{bmatrix}
\partial \mathbf{x}
$$
and
$$ \normalsize
\begin{align}
\partial \mathbf{x} &=
%\large
\begin{bmatrix}
m_{1,1}\frac{\mathbf{\hat{y}}_1}{\mathbf{\hat{x}_1}} & \cdots &
m_{1,n}\frac{\mathbf{\hat{y}}_1}{\mathbf{\hat{x}_n}} \\
\vdots & \ddots & \vdots \\
m_{m,1}\frac{\mathbf{\hat{y}}_n}{\mathbf{\hat{x}_1}} & \cdots &
m_{m,n}\frac{\mathbf{\hat{y}}_m}{\mathbf{\hat{x}_n}}
\end{bmatrix}^\mathsf{T}
\partial\mathbf{y} \\[10pt]
&=
%\large
\begin{bmatrix}
m_{1,1}\frac{\mathbf{\hat{x}}_1}{\mathbf{\hat{y}_1}} & \cdots &
m_{m,1}\frac{\mathbf{\hat{x}}_1}{\mathbf{\hat{y}_m}} \\
\vdots & \ddots & \vdots \\
m_{1,n}\frac{\mathbf{\hat{x}}_n}{\mathbf{\hat{y}_1}} & \cdots &
m_{m,n}\frac{\mathbf{\hat{x}}_n}{\mathbf{\hat{y}_m}}
\end{bmatrix}
\partial\mathbf{y}
\end{align}
$$

Approximating ([2.1](#mjx-eqn-partial)) numerically with our example:

In [32]:
M, (M@(x + np.array([0.001, 0])) - M@x) / 0.001, (M@(x + np.array([0, 0.001])) - M@x) / 0.001

(array([[2, 3],
        [5, 7]]),
 array([2., 5.]),
 array([3., 7.]))

Test (2.5) numerically:

In [33]:
max(err.dot(err)
    for err in (((M@(x + veps) - M@x) - M@veps)
              for M,x,veps in ((np.random.randn(2,2), np.random.randn(2), np.random.randn(2) * 0.001)
                          for i in range(1000))))

1.4982115870801231e-30

### $\partial\mathbf{y} / \partial\mathbf{M}$

How does vector $\mathbf{y}$ vary with matrix $M$, with vector $\mathbf{x}$ held constant? I.e. what is $\partial\mathbf{y}/\partial\mathbf{M}$?

From (2.3):
$$\begin{align}
 y_i &= \sum_j m_{ij}x_j \\
 \partial y_i &= \sum_j \partial m_{ij}x_j \\
% \frac{\partial y_i}{\partial M_{ij}} &= 2
\end{align}
$$

Then _[explain]_
$$
 \partial\mathbf{y} = \partial\mathbf{M}\mathbf{x} \\
 \frac{\partial\mathbf{y}}{\partial\mathbf{M}} = \mathbf{x}
$$

Numeric demonstration

In [34]:
M, x, M@x

(array([[2, 3],
        [5, 7]]),
 array([11, 13]),
 array([ 61, 146]))

In [35]:
k11 = np.array([[1, 0], [0, 0]])
k12 = np.fliplr(k11)
k21 = np.flipud(k11)
k22 = np.fliplr(k21)
singles = (k11, k12, k21, k22)
singles

(array([[1, 0],
        [0, 0]]),
 array([[0, 1],
        [0, 0]]),
 array([[0, 0],
        [1, 0]]),
 array([[0, 0],
        [0, 1]]))

In [36]:
[((M+(e*0.001))@x - M@x) / 0.001 for e in singles]

[array([11.,  0.]), array([13.,  0.]), array([ 0., 11.]), array([ 0., 13.])]

In [37]:
[e@x for e in singles]

[array([11,  0]), array([13,  0]), array([ 0, 11]), array([ 0, 13])]

Test numerically: Create random vector x and random M and dM matricies. Use an approximation of (2.1) to estimate
$\partial\mathbf{y}/\partial\mathbf{M}$ numerically, and compare to $\partial\mathbf{M}\mathbf{x}$. Find the maximum squared error in a number of random trials.

In [38]:
max(v.dot(v)
    for v in (dM@x - (((M+(dM*0.001))@x - M@x) / 0.001)
              for M,dM,x in ((np.random.randn(2,2), np.random.randn(2,2), np.random.randn(2))
                          for i in range(1000))))

2.4641463207114085e-24

## Gradient

From [Wikipedia](https://en.wikipedia.org/wiki/Gradient):

In vector calculus, the **gradient** of a scalar-valued differentiable function $f$ of several variables is the vector field (or vector-valued function) $\nabla f$ whose value at a point $p$ is the vector whose components are the partial derivatives of $f$ at $p$.

That is, for $f \colon \mathbf{R}^n \to \mathbf{R}$, its gradient $\nabla f \colon \mathbf{R}^n \to \mathbf{R}^n$ is defined at the point $p = (x_1,\ldots,x_n)$ in *n-*dimensional space as the vector:

$$\nabla f(p) = \begin{bmatrix}\frac{\partial f}{\partial x_1}(p) \\ \vdots \\ \frac{\partial f}{\partial x_n}(p) \end{bmatrix}.$$

Strictly speaking, the gradient is a vector field $f \colon \mathbf{R}^n \to T\mathbf{R}^n$, and the value of the gradient at a point is a tangent vector in the tangent space at that point, $T_p \mathbf{R}^n$, not a vector in the original space $\mathbf{R}^n$. However, all the tangent spaces can be naturally identified with the original space $\mathbf{R}^n$, so these do not need to be distinguished.

${\displaystyle \nabla f(p)\cdot \mathrm {v} = {\tfrac {\partial f}{\partial \mathbf {v} }}(p)=df_{\mathrm {v} }(p)}$

Computationally, given a tangent vector, the vector can be _multiplied_ by the derivative (as matrices), which is equal to taking the dot product with the gradient: \
${\displaystyle (df_{p})(v)={\begin{bmatrix}{\frac {\partial f}{\partial x_{1}}}(p)\cdots {\frac {\partial f}{\partial x_{n}}}(p)\end{bmatrix}}{\begin{bmatrix}v_{1}\\\vdots \\v_{n}\end{bmatrix}}=\sum _{i=1}^{n}{\frac {\partial f}{\partial x_{i}}}(p)v_{i}={\begin{bmatrix}{\frac {\partial f}{\partial x_{1}}}(p)\\\vdots \\{\frac {\partial f}{\partial x_{n}}}(p)\end{bmatrix}}\cdot {\begin{bmatrix}v_{1}\\\vdots \\v_{n}\end{bmatrix}}=\nabla f(p)\cdot v}
$

In Euclidian 3-space,
$$ \nabla\phi(x, y, z) =
\frac{\partial\phi}{\partial x}\mathbf{\hat{x}} +
\frac{\partial\phi}{\partial y}\mathbf{\hat{y}} +
\frac{\partial\phi}{\partial z}\mathbf{\hat{z}}
$$

# Numerical Approximations

In [46]:
if __name__ == '__main__':
    class VC():
        def grad(f, x, eps=1e-6):
            epsihat = np.eye(x.size) * eps
            yp = np.apply_along_axis(f, 1, x + epsihat)
            ym = np.apply_along_axis(f, 1, x - epsihat)
            return (yp - ym)/(2 * eps)

## Examples

### gradient

of a constant scalar $f(x) = c$

In [47]:
VC.grad(lambda x: 42, np.array([3]))

array([0.])

of a scalar polynomial $x(1-x) = -x^2 + x$

In [48]:
VC.grad(lambda x: x * (1-x), np.array([3]))

array([[-5.]])

of an element-wise multiply by a constant vector:

In [49]:
f = lambda v: np.multiply(v, np.arange(v.size) + 1)
VC.grad(f, np.arange(4))

array([[1., 0., 0., 0.],
       [0., 2., 0., 0.],
       [0., 0., 3., 0.],
       [0., 0., 0., 4.]])

of a matrix multiply. Here's a non-square matrix:

In [50]:
v = np.random.rand(3)
np.arange(v.size * (v.size+1)).reshape((v.size, v.size+1))

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

The gradient of the matrix multiplication at a given point:

In [55]:
f = lambda v: v @ np.arange(v.size * (v.size+1)).reshape((v.size, v.size+1))
x = np.arange(3)
y = f(x)
g = VC.grad(f, x)
x, y, g

(array([0, 1, 2]),
 array([20, 23, 26, 29]),
 array([[ 0.,  1.,  2.,  3.],
        [ 4.,  5.,  6.,  7.],
        [ 8.,  9., 10., 11.]]))

The gradient of an affine transformation:

In [52]:
f = lambda v: v @ np.arange(v.size * (v.size+1)).reshape((v.size, v.size+1)) + np.arange(v.size+1)
x = np.arange(3)
y = f(x)
g = VC.grad(f, x)
x, y, g

(array([0, 1, 2]),
 array([20, 24, 28, 32]),
 array([[ 0.,  1.,  2.,  3.],
        [ 4.,  5.,  6.,  7.],
        [ 8.,  9., 10., 11.]]))

In [56]:
x @ g, x.dot(g), g @ y

(array([20., 23., 26., 29.]),
 array([20., 23., 26., 29.]),
 array([162.00000006, 554.00000003, 946.00000009]))

### Gradient back-propagation

Consider a loss function $\displaystyle loss(\mathbf{x}) = \frac{\| \mathbf{x} - \mathbf{x}_{ideal}\|}{2}$

In [111]:
ideal = np.array([2,3,5])
loss = lambda v: (v - ideal).dot(v - ideal) / 2.0

The gradient of this loss function at $\mathbf{x} = (-2, 0, 1) \circ \mathbf{\hat x}$

In [113]:
y = np.array([-2,0,1])
loss_at_y = loss(y)
g = VC.grad(loss, y)
y, loss_at_y, g

(array([-2,  0,  1]), 20.5, array([-4., -3., -4.]))

In [114]:
f = lambda v: v @ np.arange(v.size * (v.size+1)).reshape((v.size, v.size+1)) + np.arange(v.size+1)
x = np.array([-1,1])
y = f(x)
loss_at_y = loss(y)
print(f"x = {x}, y = {y}, loss at y = {loss_at_y}")
print(f"∇𝑙𝑜𝑠𝑠(𝑦) = {VC.grad(loss, y)}")
print(f"∇𝑙𝑜𝑠𝑠(𝑥) = {VC.grad(lambda x:loss(f(x)), x)}")
g_at_x = VC.grad(f, x)

x = [-1  1], y = [3 4 5], loss at y = 1.0
∇𝑙𝑜𝑠𝑠(𝑦) = [1. 1. 0.]
∇𝑙𝑜𝑠𝑠(𝑥) = [1. 7.]


$\nabla loss(x)$

# END
---