## Backpropagation w zapisie tensorowym

Mam sieć jak składającą się z trzech  warstw: 
 - 3,4,2,3, 
 - aktywacja - sigmoid, 
 - funkcją straty jest błąd średniokwadratowy.

Obliczymy pochodne po dowolnej wadze lub bias-ie w dwóch przypadkach:

 - pojedyńczego przykładu
 - wielu przykłdów (batch)

## Backpropagation in tensor notation

I have a network like three layers:
 - 3,4,2,3,
 - activation - sigmoid,
 - the function of the loss is the mean square error.

We will calculate derivatives after any weight or bias in two cases:

 - a single example
 - many examples (batch)

In [None]:
import numpy as np 

W1,b1,W2,b2,W3,b3 = [np.array([[0.1, 0.1, 0.1, 0.1],
        [-0.1, 0.1, 0.2, 0.1],
        [0.1, 0.2, 0.1, 0.1]]),
 np.array([0.2, -0.2, 0.2, 0.2]),
 np.array([[-0.2, 0.2],
        [0.2, 0.2],
        [0.2, -0.2],
        [0.12, 0.2]]),
 np.array([0.4, 0.4]),
 np.array([[-0.3, 0.3, 0.13],
        [0.3, 0.3, 0.3]]),
 np.array([0.16, -0.6, 0.6] )]

s = lambda x:1/(1+np.exp(-x))

## 1. Przetwarzanie pojedyńczego przykładu

## 1. Processing of a single example

In [None]:
X = np.array([1,.3,.2])
y_ = np.array([.1,.2,.1])

### Forward pass

$$
z^1 = X^1 \cdot W^1 + b^1\\
y^1 = \sigma(z^1)
$$

$$
z^2 = y^1\cdot W^2 + b^2\\
y^2 = \sigma(z^2)
$$

$$
z^3 = y^2\cdot W^3 + b^3\\
y^3 = \sigma(z^3)
$$


### Forward pass

$$
z^1 = X^1 \cdot W^1 + b^1\\
y^1 = \sigma(z^1)
$$

$$
z^2 = y^1\cdot W^2 + b^2\\
y^2 = \sigma(z^2)
$$

$$
z^3 = y^2\cdot W^3 + b^3\\
y^3 = \sigma(z^3)
$$

In [None]:
z1 = X.dot(W1) + b1
y1 = s(z1)

z2 = y1.dot(W2) + b2
y2 = s(z2)

z3 = y2.dot(W3) + b3
y3 = s(z3)

y3

### Backward pass

Propagujemy błąd (czułość) od ostatniej warstwy do pierwszej.

Przypuśćmy, że chcemy obliczyć:

$$\frac{\partial L}{\partial \mathbf{w^1}}$$



$$\frac{\partial L}{\partial w^1_{\alpha\beta}}=
\underbrace{\frac{2}{N^3}(y^3_i-\hat y_i) 
\frac{\partial y^3_i}{\partial z^3_i}  \frac{\partial z^3_i}{\partial y^2_j}\cdot
\frac{\partial y^2_j}{\partial z^2_j}  \frac{\partial z^2_j}{\partial y^1_k}\cdot
  }_{\Delta^1_k}\frac{\partial y^1_k}{\partial z^1_k} x^1_{\alpha} \delta_{k\beta}  
$$

$$\frac{\partial L}{\partial w^1_{\alpha\beta}}=
 \sum_k \Delta^1_k \frac{\partial y^1_k}{\partial z^1_k}  x^1_{\alpha} \delta_{k\beta}   = x^1_{\alpha} \frac{\partial y^1_\beta}{\partial z^1_\beta}  \Delta^1_\beta 
$$

$$\frac{\partial L}{\partial \mathbf{w^1}}= \underbrace{\mathbf{x^1}}_{n^1}\otimes \underbrace{\mathbf{\Delta^1  \mathbf{\sigma'}}}_{n^1}
$$



1.  Obliczamy czułość wyniku na aktywację ostatniej warstwy:
$$
\mathrm{error_i} =\frac{2}{N^3}(y^3_i-\hat y_i)  
$$





### Backward pass

We propagate the error (sensitivity) from the last layer to the first.

Suppose we want to calculate:

$$\frac{\partial L}{\partial \mathbf{w^1}}$$



$$\frac{\partial L}{\partial w^1_{\alpha\beta}}=
\underbrace{\frac{2}{N^3}(y^3_i-\hat y_i) 
\frac{\partial y^3_i}{\partial z^3_i}  \frac{\partial z^3_i}{\partial y^2_j}\cdot
\frac{\partial y^2_j}{\partial z^2_j}  \frac{\partial z^2_j}{\partial y^1_k}\cdot
  }_{\Delta^1_k}\frac{\partial y^1_k}{\partial z^1_k} x^1_{\alpha} \delta_{k\beta}  
$$

$$\frac{\partial L}{\partial w^1_{\alpha\beta}}=
 \sum_k \Delta^1_k \frac{\partial y^1_k}{\partial z^1_k}  x^1_{\alpha} \delta_{k\beta}   = x^1_{\alpha} \frac{\partial y^1_\beta}{\partial z^1_\beta}  \Delta^1_\beta 
$$

$$\frac{\partial L}{\partial \mathbf{w^1}}= \underbrace{\mathbf{x^1}}_{n^1}\otimes \underbrace{\mathbf{\Delta^1  \mathbf{\sigma'}}}_{n^1}
$$



1. We calculate the sensitivity of the result to the activation of the last layer:
$$
\mathrm{error_i} =\frac{2}{N^3}(y^3_i-\hat y_i)  
$$

In [None]:
error = None

### BEGIN SOLUTION
error = 2/(y3.shape[-1])*(y3-y_)
### END SOLUTION
error

$$
\mathrm{error_j} \to \underbrace{\frac{2}{N^3}(y^3_i-\hat y_i)}_{\mathrm{error_i}}  
\underbrace{\frac{\partial y^3_i}{\partial z^3_i}  \frac{\partial z^3_i}{\partial y^2_j}}_{M(n_3\times n_2)}= \mathrm{error_i}\frac{\partial y^3_i}{\partial z^3_i} \cdot
(w^{3\,T})_{ij}$$
$$
\frac{\partial z^3_i}{\partial y^2_j} =  w^3_{ji} = (w^{3\,T})_{ij}
$$

In [None]:
# error z wyjscia warstwy 3 propagujemy wejscia warstwy 3

### BEGIN SOLUTION
error = (y3*(1-y3)*error).dot(W3.T)
### END SOLUTION

error


$$
\mathrm{error_j} \to   
= \mathrm{error_i}\frac{\partial y^2_i}{\partial z^2_i} \cdot
(w^{2\,T})_{ij}$$

błąd z wyjscia warstwy 2 propagujemy wejscia warsty 1


$$
\mathrm{error_j} \to   
= \mathrm{error_i}\frac{\partial y^2_i}{\partial z^2_i} \cdot
(w^{2\,T})_{ij}$$

error from layer 2 output we propagate layer 1 inputs

In [None]:
### BEGIN SOLUTION
error = (y2*(1-y2)*error).dot(W2.T)
### END SOLUTION

error

Obliczamy pochodną po wagach  warstwy 3 jako iloczyn zewnętrzny wektora 
$$x^1_j$$
 oraz $$\mathrm{error_i} \frac{\partial y^1_i}{\partial z^1_i}$$



We calculate the derivative over layer 3 weights as the external product of the vector
$$x^1_j$$
 and $$\mathrm{error_i} \frac{\partial y^1_i}{\partial z^1_i}$$

In [None]:
dw1 = None

### BEGIN SOLUTION
dw1 = X[:,None]*( error*(y1*(1-y1)) )[None,:]
dw1 = np.outer(X, error*(y1*(1-y1)) )
### END SOLUTION

dw1

In [None]:
np.testing.assert_allclose(dw1,[[0.0006476156413555145,
  0.0006892113015055656,
  -0.0006380165577866137,
  0.0006632386357523501],
 [0.0001942847011378035,
  0.00020676340500358492,
  -0.0001914049789775163,
  0.00019897159654647112],
 [0.000129523134091869,
  0.00013784226030111313,
  -0.00012760331446770579,
  0.00013264773588161916]] ,rtol=1e-3)

Pochodna po bias-ie wynosi:

$$\mathrm{error_i} \frac{\partial y^1_i}{\partial z^1_i}$$

The derivative over bias is:

$$\mathrm{error_i} \frac{\partial y^1_i}{\partial z^1_i}$$

In [None]:
db = None

### BEGIN SOLUTION
db1 = error*(y1*(1-y1)) 
### END SOLUTION

db1

In [None]:
np.testing.assert_allclose(db1,[0.0006476156413555145,
 0.0006892113015055656,
 -0.0006380165577866137,
 0.0006632386357523501], rtol=1e-3)

## 2. Batch processing

Co jeśli mamy wiele przykładów?

## 2. Batch processing

What if we have many examples?

In [None]:
X = np.array([[1,.3,.2],
              [.1,.2,.1]])
y_ = np.array([[.1,.2,.1],
               [.2,.1,.2]])



### Forward pass

Zastosujmy te same wzory:

### Forward pass

Let's apply the same formulas:

In [None]:
z1 = X.dot(W1) + b1
y1 = s(z1)

z2 = y1.dot(W2) + b2
y2 = s(z2)

z3 = y2.dot(W3) + b3
y3 = s(z3)

y3

### Backward pass

### Backward pass

In [None]:
error = None

### BEGIN SOLUTION
error = 2/(y3.shape[-1])*(y3-y_)
### END SOLUTION
error

$$
\mathrm{error_j} \to \underbrace{\frac{2}{N^3}(y^3_i-\hat y_i)}_{\mathrm{error_i}}  
\underbrace{\frac{\partial y^3_i}{\partial z^3_i}  \frac{\partial z^3_i}{\partial y^2_j}}_{M(n_3\times n_2)}= \mathrm{error_i}\frac{\partial y^3_i}{\partial z^3_i} \cdot
(w^{3\,T})_{ij}$$
$$
\frac{\partial z^3_i}{\partial y^2_j} =  w^3_{ji} = (w^{3\,T})_{ij}
$$

błąd z wyjscia warstwy 3 propagujemy wejscia warstwy 3


error from layer 3 output we propagate layer 3 inputs

In [None]:

### BEGIN SOLUTION
error = (y3*(1-y3)*error).dot(W3.T)
### END SOLUTION

error


$$
\mathrm{error_j} \to   
= \mathrm{error_i}\frac{\partial y^2_i}{\partial z^2_i} \cdot
(w^{2\,T})_{ij}$$

 error z wyjscia warstwy 2 propagujemy wejscia warsty 1

$$
\mathrm{error_j} \to   
= \mathrm{error_i}\frac{\partial y^2_i}{\partial z^2_i} \cdot
(w^{2\,T})_{ij}$$

 error from layer 2 output we propagate layer 1 inputs

In [None]:
### BEGIN SOLUTION
error = (y2*(1-y2)*error).dot(W2.T)
### END SOLUTION

error

Obliczamy pochodną po wagach  warstwy 3 jako iloczyn zewnętrzny wektora 
$$x^1_j$$
 oraz $$\mathrm{error_i} \frac{\partial y^1_i}{\partial z^1_i}$$

Jeśli zastosujemy reguły boadcastingu do obliczania iloczynu zewnętrzego, to otrzymamy wartości dla każdego przykładu z osobna:


We calculate the derivative over layer 3 weights as the external product of the vector
$$x^1_j$$
 and $$\mathrm{error_i} \frac{\partial y^1_i}{\partial z^1_i}$$

If we use boadcasting rules to calculate the external product, we get the values ​​for each example separately:

In [None]:
dw1 = X[:,:,None]*( error*(y1*(1-y1)) )[:,None,:]
dw1.shape

Aby uzystać pochodną po wszytkich wartościach wag w tej warstwie należy uśrednić wynik po przykładach (czyli osi zerowej).

To obtain a derivative over all weight values in this layer, average the result after the examples (i.e. the zero axis).

In [None]:
### BEGIN SOLUTION
dw1 = np.mean(X[:,:,None]*( error*(y1*(1-y1)) )[:,None,:],axis=0)
### END SOLUTION

dw1

In [None]:
np.testing.assert_allclose(dw1,[[0.0003496380231808871,
   0.00038124978891573846,
   -0.0003446616174187511,
   0.00036582283792085946],
  [0.0001488027919549495,
   0.00017666998610366136,
   -0.00014700925385113806,
   0.00016789280925877392],
  [9.059178410097957e-05,
   0.0001055652683135122,
   -8.945503941504285e-05,
   0.00010052736615762115]] ,rtol=1e-3)

Pochodna po bias-ie wynosi:

$$\mathrm{error_i} \frac{\partial y^1_i}{\partial z^1_i}$$

The derivative after bias is:

$$\mathrm{error_i} \frac{\partial y^1_i}{\partial z^1_i}$$

In [None]:
db1 = None
### BEGIN SOLUTION
db1 = np.mean(( error*(y1*(1-y1)) )[:,:],axis=0)
### END SOLUTION

db1

In [None]:
np.testing.assert_allclose(db1,[0.0005821100203320384,
  0.0007110470905900002,
  -0.0005755421589128673,
  0.0006736543728038669], rtol=1e-3)

## Dodatek iloczyn zewnętrzny

$$P_{ij} =  a_i b_j$$

## Addendum outer product

$$P_{ij} =  a_i b_j$$

In [None]:
a = np.array([1,2,3])
b = np.array([5,4])

In [None]:
np.outer(a,b)

In [None]:
a[:,None]*b[None,:]

In [None]:
[[a[i]*b[j] for j in range(b.shape[0])] for i in range(a.shape[0])] 