# Neural Networks 2

Today we'll talk about backpropagation and more complex neural networks.



In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")


In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/intro-stat-learning/ISLP/main/ISLP/data/Credit.csv")

In [3]:
df = df[["Income", "Limit", "Balance"]]
df = df.rename(columns=str.lower)
df = (df - df.mean()) / df.std()
df.shape

(400, 3)

## Where we ended last time

We finished our last lecture with this neural network:

<img src="one_feature_one_neuron.png" width="1000">


The loss of this network is:

$$
\mathcal{L}\bigl(w^{(1)},b^{(1)},w^{(2)},b^{(2)}\bigr)
  = \sum_{i=1}^{m}
    \left(
      w^{(2)}\,\sigma\!\bigl(w^{(1)}x_i + b^{(1)}\bigr)
      + b^{(2)}
      - y_i
    \right)^{2}
$$



In [4]:
X = df["limit"].values
y = df["balance"].values
X.shape

(400,)

In [5]:
# activation
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

# loss
def sum_of_squares(a, y):
    return np.sum((a - y) ** 2)


In [6]:
w1 = np.float32(1)
b1 = np.float32(1)
w2 = np.float32(1)
b2 = np.float32(1)


In [7]:
z1 = X * w1 + b1
a1 = sigmoid(z1)
z2 = a1 * w2 + b2
L = sum_of_squares(z2, y)
L

np.float64(1444.3526409019069)

Now we want to perform gradient descent, for which we have to find these four gradients:

$$
\frac{\partial\mathcal{L}}{\partial w^{(2)}};
\frac{\partial\mathcal{L}}{\partial b^{(2)}};
\frac{\partial\mathcal{L}}{\partial w^{(1)}};
\frac{\partial\mathcal{L}}{\partial b^{(1)}}
$$


The problem is that network now contains functions of functions - $z^{(2)}$ is a function of $a^{(1)}$, $w^{(2)}$ and $b^{(2)}$, and $a^{(1)}$ is itself a function of $w^{(1)}$ and $b^{(1)}$. When we have to find partial derivatives of a composite function $f(g(x))$, we turn to the chain rule.

Given the derivative of the sigmoid activation function:

$$

(a^{(1)}_i)' = a^{(1)}_i * (1 - a^{(1)}_i)
\\

$$

The formulas for the gradients are:

$$

\frac{\partial\mathcal{L}}{\partial w^{(2)}} = 2\sum_{i=1}^{m}\!\Bigl( w^{(2)}a^{(1)}  + b^{(2)} - y_i \Bigr)\,a^{(1)}
\\
\frac{\partial\mathcal{L}}{\partial b^{(2)}} = 2\sum_{i=1}^{m}\!\Bigl( w^{(2)}a^{(1)}  + b^{(2)} - y_i \Bigr)
\\
\frac{\partial\mathcal{L}}{\partial w^{(1)}} = 2\sum_{i=1}^{m}\!\Bigl( w^{(2)}a^{(1)}  + b^{(2)} - y_i \Bigr)\,w^{(2)}\,(a^{(1)})'x_i
\\
\frac{\partial\mathcal{L}}{\partial b^{(1)}} = 2\sum_{i=1}^{m}\!\Bigl( w^{(2)}a^{(1)}  + b^{(2)} - y_i \Bigr)\,w^{(2)}\,(a^{(1)})'

$$


We can actually ignore the 2 in the front in the formulas above, since it just scales the gradients, and we will scale them anyways using the learning rate.


In [10]:
def d_sigmoid(x):
    return x * (1 - x)


### Task

Finish the code for the backward pass below.


In [9]:
# Initialize weights
w1 = np.float32(2)
b1 = np.float32(2)
w2 = np.float32(2)
b2 = np.float32(2)

lr = 0.001

# forward pass
z1 = X * w1 + b1
a1 = sigmoid(z1)
z2 = a1 * w2 + b2

# loss
L = sum_of_squares(z2, y)

# backward pass (backpropagation)
err = z2 - y
grad_w2 = ...
grad_b2 = ...
grad_w1 = ...
grad_b1 = ...


### Task

Complete the code below using the code you wrote for the backward pass. Also implement the gradient descent part.

Print out the weights and visualize the function that your model is fitting to the data.

In [16]:
# Initialize weights
w1 = np.float32(2)
b1 = np.float32(2)
w2 = np.float32(2)
b2 = np.float32(2)

lr = 0.001

for i in range(100000):
    # forward pass
    z1 = X * w1 + b1
    a1 = sigmoid(z1)
    z2 = a1 * w2 + b2

    # loss
    L = sum_of_squares(z2, y)

    # backprop

    # gradient descent
    


In [None]:
# plot


## More neurons

Let's stick with one feature, but add more neurons to that hidden layer. The complication now is that we have more weights, and even though an individual weight is still scalar, writing backprop for every individual weight will quickly become unmanageable and inefficient. Therefore, from now on we will perform matrix operations.

<img src="one_feature_two_neurons.png" width="1000">



In [11]:
X = df[["limit"]].values
y = df[["balance"]].values

### Task

Implement the forward pass of the illustrated network.


In [21]:
# Relevant numpy operations
X1 = np.ones((3, 3))
X2 = np.ones((3, 1))
X1 @ X2


array([[3.],
       [3.],
       [3.]])

### Task

Implement a single step of gradient descent for the neural network. Since we are no longer working with scalars, our gradient calculations became a little more complex as well. The exact formulas are given below.

Run the cell that performs gradient descent several times, while printing out the weights. What do you notice?

### Task

Implement a function `gradient_descent` that performs gradient descent of the neural network. Use it to perform gradient descent, then print out the weights. What do you notice?


In [None]:
def gradient_descent(X, y, W1, b1, W2, b2, lr=0.0001, n_iter=10000):
    pass


### Task

Fix the problem above by choosing weights randomly. Plot the resulting fit.


In [13]:
np.random.randn(5, 5)

array([[ 1.81777698,  1.1174691 ,  0.87424499,  0.04300084, -0.0602238 ],
       [ 0.72956651, -0.71185215,  0.95446663, -0.06446925, -0.20007131],
       [-1.41299271,  0.05206524,  0.05042646,  0.12670112, -0.69544944],
       [-0.4858928 , -1.237632  , -0.25682642, -0.10415388, -1.86087575],
       [ 1.07840139,  0.37102853,  0.41772486, -1.77804709,  0.85468556]])

### Task

Implement a network with 5 neurons in the hidden layer, and run gradient descent for 20k iterations.


## Two input features, single neuron


<img src="two_features_one_neuron.png" width="1000">



### Task

Implement the forward pass of the neural networks above. Then, perform gradient descent and print the weights. Use random initialization for the weights. 


## Two features, two neurons

Now let's look at the most complex neural network we will fit today.

<img src="two_features_two_neurons.png" width="1000">





### Task

Implement the forward pass of the network shown above. Train the network.

### Task

So far we worked with neural networks that are suited for regression tasks. Write a forward pass of a neural network that is suited for classification.

Also consider and answer the following - do we need to change the procedure for gradient descent? If so, in what ways?

## Recap

Congrats, you just implemented and trained a bunch of neural networks!

Here are the key components and concepts to remember:
- A feedforward neural network is essentially a series of matrix transformations and nonlinear transformations stacked together.
- Even though we only looked at networks with one hidden layer, we can have as many hidden layers as we like (or as many as your machine allows). Same goes for neurons within each layer.
- Nonlinear activation functions are what allow the neural network to fit complex functions to the data. Sigmoid is one such function, but there are many more.
- The trickiest part (at least conceptually) about deep neural networks is training them, since we have to figure out partial derivatives of the loss function with respect to every parameter, which requires the application of the chain rule.
- We can use these derivatives to calculate the  of the loss function with respect to each weight, via the process called backpropagation.
- The process of adjusting weights according to these gradients is called gradient descent - we move weights in the direction that reduces the loss the fastest.
- Learning rate controls the speed of gradient descent.

You do not need to remember the formulas of the derivatives or how to derive them.

