But the loss function is tensor-valued?

What is the derivative of a tensor operation?

But the loss function is tensor-valued? What is the derivative of a tensor operation?

An affine transformation of $(x_1, x_2)^T$ is 

\begin{eqnarray}
y = \left(\begin{array}{c} y_1 \\
                       y_2 \end{array} \right)
=
\left( \begin{array}{c} w_{11} & w_{12} \\ 
                        w_{21} & w_{22} \end{array} \right)
\left( \begin{array}{c} x_1  \\ 
                        x_2  \end{array}  \right)
+
\left( \begin{array}{c} b_1  \\ 
                        b_2         \end{array}  \right)
\end{eqnarray}

For example:
\begin{eqnarray} 
\left( \begin{array}{c} 1 & -2 \\ 
                                 2 & 1  \end{array}  \right)
\left( \begin{array}{c} 1  \\ 
                        2  \end{array}  \right)
+
\left( \begin{array}{c} 0  \\ 
                        -1\end{array}  \right)
=
\left( \begin{array}{c}-3  \\ 
                        3         \end{array}  \right)
\end{eqnarray}


                   

We will demonstrate tensor operator differentiation by considering a 2D affine transformation. What is the derivative - the gradient function - of an affine transformation? And how can we use this gradient to move our point 'downhill'? 

The affine transformation has six parameters - four w's and two b's. The slide shows the transformation function and the result of transforming the point $x_1 = 1, x_2 = 2$.  

In [1]:
import tensorflow as tf
w = tf.Variable([[1., -2.], [2., 1.]])
b = tf.Variable([[0], [-1.]])
x = tf.Variable([[1.], [2.]])

y = w @ x + b

print(y.numpy())

[[-3.]
 [ 3.]]


Here is the example as TensorFlow code. The 'at' between $w$ and $x$ is the matrix multiplication operator.

Differentiation rules


|$f$ |$\frac{df}{dx}$ | | 
|:---|:---|:---|
|$x^n$|$nx^{n-1}$| Power |
|$g(x) + h(x)$ | $\frac{dg}{dx} + \frac{dh}{dx}$| Linearity |
|$af(x)$ | $a\frac{df}{dx}$ | Linearity |
|$f(y(x))$ | $\frac{df}{dy}\frac{dy}{dx}$ | Chain rule | 



We just need to say a few more things about differentiation. The first line of this table repeats the rule we saw previously - the derivative of a power. Differentiation is linear: the second and third row tell us that the derivative of a sum of functions is the sum of the derivatives and the derivative of a constant times a function is a constant times the derivative. The chain rule in the last row tells us how to differentiate a composite function. Let's clarify with examples.  

Differentiation examples

|$f$ |$\frac{df}{dx}$ |
|:---|:---|
| $5x^3$ | $15 x^2$
| $5x^2 - 2x + 3$ | $10x - 2$ |
| $(3x + 2x^2)^2$ | $2(3x + 2x^2)\times(3 + 4x)$ |

The derivative of $5x^3$ is $15x^2$. 

The second row shows the derivative of the sum of three functions, $5x^2$, $- 2x$ and the constant function $3$. The derivative of $5x^2$ is $10x$, the derivative of $-2x$ is $-2$ and the derivative of $3$ is 0 because $3 = 3x^0$. We see that the derivatives of the separate parts have been summed.


$(3x + 2x^2)^2$ is a composition of $f = y^2$ and $y = 3x + 2x^2$. The next slide spells out the calculation.

Writing $(3x + 2x^2)^2$ as a composition:

\begin{align*}
f &= y^2 \text{ where } y = 3x + 2x^2  \\
\end{align*}

$f$ is differentiated with respect to y, and then $y$ is differentiated with respect to $x$: 

\begin{align*}
\frac{df}{dx} &= \frac{df}{dy}\frac{dy}{dx} \\
               &= 2y \  \times \  (3 + 4x) \\
               &= 2(3x + 2x^2)(3 + 4x)
\end{align*}

Writing $(3x + 2x^2)^2$ as a composition:

\begin{align*}
f &= y^2 \text{ where } y = 3x + 2x^2  \\
\end{align*}

$f$ is differentiated with respect to y, and then $y$ is differentiated with respect to $x$: 

\begin{align*}
\frac{df}{dx} &= \frac{df}{dy}\frac{dy}{dx} \\
               &= 2y \  \times \  (3 + 4x) \\
               &= 2(3x + 2x^2)(3 + 4x)
\end{align*}

The gradient of a two variable function, $f = f(x, y)$, with respect to either variable, is written $\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}...$

We almost have all we need to explain automatic gradient descent - how the optimiser tweaks layer weights and biases in an efficient implementation known as backpropagation. The final bit if theory is partial differentiation. 

Partial differentiation means the differentiation of a function of several variables. Here, $f$ is a function of two variables. The gradient with respect to either function is written with 'curly' d's. 

Differentiation follows the single-variable rules 

The undifferentiated variable is held constant

| $f$|$\frac{\partial f}{\partial x}$ |$\frac{\partial f}{\partial y}$ | 
|:---|:---|:---|
|$xy$|$y$| $x$ |
| $ax^n + by^m$|$anx^{n-1}$| $bmy^{m-1}$ | 

In fact there is not too much extra: the undifferentiated variables are treated as constants. Then the normal rules of differentiation - powers, linearity and the chain rule - are applied. The table shows the partial derivatives with respect to $x$ and $y$.

Now we can demonstrate tensor differentiation.

Suppose we have a very simple network that implements a single 2D affine transformation

Input: $(x_1, x_2)$, prediction: $(y_1, y_2)$, target: $(0, 0)$ 

Loss: $(y_1 - 0)^2 + (y_2 - 0)^2 = y_1^2 + y_2^2$ 

Consider a very simple layer that performs a 2D affine transformation on the input $(x_1, x_2)$. We suppose that the target is the vector $(0, 0)$. The loss is the sum of the squares of the output.

We wish to adjust the weights in order to lower the loss on the input $(1, 2)^T$ 

We need to find the gradients in $w, b$-space 

Differentiate with respect to $w_{11}, w_{12}, w_{21}, w_{22}, b_1, b_2$

We wish to adjust the weights in order to lower the loss on the input $(1, 2)^T$. We need to find the gradients in $w, b$-space and move in the direction of the negative gradient - differentiate with respect to $w_{11}, w_{12}, w_{21}, w_{22}, b_1, b_2$

\begin{align*}
y_1 &= w_{11}x_1 + w_{12}x_2 + b_1 \\
y_2 &= w_{21}x_1 + w_{22}x_2  + b_2 \\
\\
f &= y_1^2 + y_2^2 \\
  &= (w_{11}x_1 + w_{12}x_2 + b_1)^2 + (w_{21}x_1 + w_{22}x_2  + b_2)^2
\end{align*}

The loss is a fairly horrible function of the weights and biases.

\begin{align*}
\frac{\partial f}{\partial w_{11}} &= \frac{\partial f}{\partial y_1} \frac{\partial y_1}{\partial w_{11}} 
                                   = 2y_1 \frac{\partial y_1}{\partial w_{11}} 
                                   = 2y_1 x_1
\end{align*}

\begin{eqnarray}
x = \left( \begin{array}{c} 1  \\ 
                            2  \end{array}  \right),\quad y = \left( \begin{array}{c} -3 \\ 3 \end{array} \right)
\end{eqnarray}
then
\begin{align*}
\frac{\partial f}{\partial w_{11}} = -6
\end{align*}

The gradient of the loss with respect to the weight $w_{11}$ is easily calculated by the chain rule. In this case we find that $\frac{\partial f}{\partial w_{11}} = -6$. $f$ is differentiated with respect to $y_1$ and then multiplied by the derivative of $y_1$ with respect to $w_{11}$.

\begin{align*}
\frac{\partial f}{\partial w}
&=
\left( 
\begin{array}{c} \frac{\partial f}{\partial w_{11}} & \frac{\partial f}{\partial w_{12}} \\
                 \frac{\partial f}{\partial w_{21}} & \frac{\partial f}{\partial w_{22}} \end{array} 
\right) \\
&=
\left( 
\begin{array}{c} 2 y_1 \frac{\partial y_1}{\partial w_{11}} & 2y_1\frac{y_1}{\partial w_{12}} \\
                 2y_2\frac{\partial y_2}{\partial w_{21}}   & 2y_2\frac{\partial y_2}{\partial w_{22}} \end{array} 
\right) 
=
\left( 
\begin{array}{c} -6 \frac{\partial y_1}{\partial w_{11}} & -6\frac{y_1}{\partial w_{12}} \\
                  6 \frac{\partial y_2}{\partial w_{21}} &  6\frac{\partial y_2}{\partial w_{22}} \end{array} 
\right) \\
&=
\left( 
\begin{array}{r} -6 x_1 & -6 x_2 \\
                  6 x_1 &  6 x_2 \end{array} 
\right) 
=
\left( 
\begin{array}{r} -6  & -12 \\
                  6  &  12 \end{array} 
\right)
\end{align*} 

The weights $w_{11}, w_{12}, w_{21} \text{ and } w_{22}$ are elements of the weight tensor - in this case a matrix. The tensor derivative is just the tensor of partial derivatives. I've written it out here but you will not never need to write code for this thing - that's why we have TensorFlow and other similar deep learning frameworks.

In [2]:
with tf.GradientTape(persistent=True) as tape:
  y = w @ x + b
  f = tf.reduce_sum(y**2)

[df_dw, df_db] = tape.gradient(f, [w, b])
print(df_dw.numpy())

[[ -6. -12.]
 [  6.  12.]]


Here is how TensorFlow can be used to find tensor derivatives. Don't panic: we won't need this code - everything happens automatically underneath `network.fit`.  