<img src="https://drive.google.com/uc?export=view&id=1x-QAgitB-S5rxGGDqxsJ299ZQTfYtOhb" width=180, align="center"/>

Master's degree in Intelligent Systems

Subject: 11754 - Deep Learning

Year: 2022-2023

Professor: Miguel Ángel Calafat Torrens

# Lab 2 - The perceptron

The perceptron is the smallest unit of neural network; that is, the one that only has one neuron. Its structure, as seen in the theoretical contents, is the one that appears below.

<img src="https://drive.google.com/uc?export=view&id=1as6Vm-uivPHatB_ly73LlxEr2n9wHPDw" width=600>


See that the number of inputs represented corresponds to the number of parameters that are needed in the equation of a line in a two-dimensional space.

$$ w_{1} x_{1} + w_{2} x_{2}+ w_{3}=0 $$

In [None]:
# This cell connects to your Drive. This is necessary because we are going to
# import files from there
from google.colab import drive
drive.mount('/content/gdrive')

In [None]:
# Replace the string in the following line with the path where you have this
# file. If you have your account in spanish, "MyDrive" is 'Mi unidad'.
%cd '/content/gdrive/MyDrive/Colab Notebooks/2022-2023-Lab.DL'
%ls -l

In [None]:
# Here the path of the project folder (which is where this file is) is inserted
# into the python path. There's nothing to do; just execute the cell.
import pathlib
import sys

PROJECT_DIR = str(pathlib.Path().resolve())
sys.path.append(PROJECT_DIR)

In [None]:
# And here we import a few more libraries, among them the one for custom helper
# functions
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import helper_PR2 as hp

# from ipywidgets import interact, interactive, fixed, interact_manual
# import ipywidgets as widgets
# from IPython.display import display

## Problem to solve
Imagine you have a cloud of dots. Each dot symbolizes two given characteristics of an individual that are to be used by a bank to determine whether to grant credit. The value of the abscissa could indicate the flow of monthly income, while the value of the ordinate would indicate the amount of money accumulated in a savings account. In this way, the position of the dots in the plane would determine a given pair of features of the economic situation of an individual.

The following graph shows the dots that correspond to the last credit requests. The blue dots are accepted requests, while the red ones correspond to denied requests. It can be seen that in general, the larger the values of abscissa and ordinate, the more likely it is that credit will be granted, although this is not a rigid rule.

<img src="https://drive.google.com/uc?export=view&id=1B2LW8iyDCYVlOzYnCJ26dR0RYh2ZA01a" width="400" align="center">

If one wanted to assess the possibility of granting a loan based on a simple linear model (that is, with the equation of a line) that separated the data, one of the lines that would best fit the model would be the one shown below. .

<img src="https://drive.google.com/uc?export=view&id=1uaPg44DatxAl9seD-fmCwpHhdQXLUW67" width="400" align="center">

In this way, although there would be exceptions, it could be said that the model generalizes reasonably well when it comes to predicting whether a loan will be granted to a person with given characteristics of monthly income and savings.

Next, a model is going to be assembled to be able to fit any given set of points.

In [None]:
# Seed for random numbers fixed to ensure reproducibility
np.random.seed(42)  # The answer to the great question of “life, the universe
                    # and everything” is 42, but you can choose any value.

In [None]:
# Generation of random dots with slight separation between groups
n = 20

x1 = (0.4 * np.ones((1, n)) + 0.5 * np.random.random((1, n))).flatten()
y1 = (0.4 * np.ones((1, n)) + 0.5 * np.random.random((1, n))).flatten()
labels1 = n * [0]

p1 = [(xs, ys) for xs, ys in zip(x1, y1)]

x2 = (0.6 * np.ones((1, n)) + 0.5 * np.random.random((1, n))).flatten()
y2 = (0.6 * np.ones((1, n)) + 0.5 * np.random.random((1, n))).flatten()
labels2 = n * [1]

p2 = [(xs, ys) for xs, ys in zip(x2, y2)]

features = p1 + p2
correct_outputs = labels1 + labels2

In [None]:
# Now change the values of weight1, weight2 and bias, and execute this cell
# Do it as many times as you consider necessary in order to understand how
# the parameters work. Try to find a combination of parameters
# that returns the best possible fit

weight1 = 0.1
weight2 = 1.0
bias = -0.6

hp.evaluate(weight1, weight2, bias, features, correct_outputs, extended=False)

What you just did in the previous point is basically what the perceptron training algorithm is intended to do. That is, it is about starting from an initial value that gives a poorly adjusted model. Then, it has to modify the parameters to get closer to the result that is considered optimal.

In this case, the way to do it is by observing the number of points that are well classified; however, a more objective and precise measure could be reached. Let's see how the calculations are made.

## Forward pass

In [None]:
# It starts with some initial weights. For example:
weight1 = 0.1
weight2 = 1.0
bias = -0.6

# Matrix of weights is called W
W = np.array([[weight1, weight2, bias]])

# Dots are converted to an array
X = np.array(features)

# The first point to be considered is selected.
x = X[0]

In [None]:
# Observe the dimensions of each array
print('Array W: {}\nShape: {}\n'.format(W, W.shape))
print('Array x:\n{}\nShape: {}\n'.format(x, x.shape))

In [None]:
# Note that with these dimensions you cannot perform
# the multiplication directly. They must be adapted first.
x = np.expand_dims(np.concatenate((x, np.array([1]))), axis=1)
print(x.shape)

In [None]:
# Now you can multiply matrices directly
h = np.dot(W, x)
print(h)

In [None]:
# In fact, this multiplication could be done by selecting several dots
# at once. For example, we want to select the first 10 dots
Xb = X[:10]

# A vector of 1's is appended to it to be able to do matrix multiplication
Xb = np.concatenate((Xb.T, (np.array([len(Xb) * [1]]))), axis=0)

In [None]:
# The matrix multiplication is performed and at this moment we already have
# the results of h for the first 10 dots
h = np.dot(W, Xb)
print(h)

Following the scheme in the figure below, now we just have to apply the activation function to obtain the first 10 values of ŷ.

<img src="https://drive.google.com/uc?export=view&id=1b_furS1eCvOlv947S8A69ErvzMbkDrWj" width="600" align="center">


In this case the activation function will be a step function; that is, a function that returns the value 1 in case h is greater than or equal to zero, and returns zero otherwise.

In [None]:
def stepFcn(h):
    out = np.zeros_like(h)
    out[h>=0] = 1
    return out

In [None]:
# The result is:
y_hat = stepFcn(h)
print(y_hat)

# And it should have been:
print(correct_outputs[:10])

In [None]:
# Comparing the result obtained in the prediction with the ground truth
# you can know how many hits you have
print('You got {} out of {} wrong answers'.format(
     np.size(y_hat) - np.sum(y_hat == correct_outputs[:10]), np.size(y_hat)))

Now that the _forward pass_ calculations have been done for a first batch of dots, let's see how we can compile it into functions for clarity.

In [None]:
# The activation function will be the previously defined step function
activFcn = stepFcn

In [None]:
def fwdPass(X, W, activFcn):
    # A vector of 1's is appended to be able to do matrix multiplication
    X = np.concatenate((X.T, (np.array([len(X) * [1]]))), axis=0)
    # Matrix multiplication is performed
    h = np.dot(W, X)
    # Activation function is applied
    y_hat = activFcn(h)
    
    return y_hat

In [None]:
# Note that in this way it can be applied to any length of the batch
print(fwdPass(X[:10], W, activFcn))

print(fwdPass(X[:20], W, activFcn))

# Or it can be done with all the dots at once
y_hat = fwdPass(X, W, activFcn)
print('You got {} out of {} wrong answers'.format(
    np.size(y_hat) - np.sum(y_hat == correct_outputs), np.size(y_hat)))

## Backpropagation

Here comes the important point of training neural networks. At this point it is necessary to do the backward propagation, which is nothing more than updating the values of the weights in order to reduce the error (_loss_).

When reviewing the formulas, it can be seen that derivatives are going to be used. For this reason, it is highly recommended that activation functions that can cause problems when deriving are not used (as is the case of the step function). So before we start with the first steps of back propagation, let's update the activation function. We will use the sigmoid function instead of the step function.

Sigmoid function:

$$\sigma(x) = \frac{1}{1+e^{-x}}$$

In [None]:
# Sigmoid function
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

# It is assigned as activation function
activFcn = sigmoid

# A forward pass is performed with the new activation function
y_hat = fwdPass(X, W, activFcn)

# Check for results
print('You got {} out of {} wrong answers'.format(
    np.size(y_hat) - np.sum(np.around(y_hat) == correct_outputs), np.size(y_hat)))

Until now, errors have been counted simply by seeing if a dot was or was not in the corresponding area; but now it is necessary to be more precise because derivatives are going to be used to calculate how much it is appropriate to vary the weights W.

The first thing to do is define and calculate a loss function. In this case it will be calculated as the squared difference between the actual output and the estimated output. That is to say:

$$ \mathcal{loss}=\frac{1}{2}\sum_{i=1 }^{n}\left ( y_i-\hat{y}_i \right )^{2} $$

The derivative of the loss function with respect to the weights W is what tells us how much the losses vary with the modification of the weights. In this case, we will have the following:

$$ \frac{\partial loss}{\partial W}=\frac{\partial loss}{\partial \hat{y}}·\frac{\partial \hat{y}}{\partial h}·\frac{\partial h}{\partial W} $$

The derivative of the activation function (in this case the sigmoid) is $f'(h)=\sigma'(h)=\sigma(h)\,(1-\sigma(h))$, therefore, it is equivalent to:

$$ \frac{\partial loss}{\partial W}=\frac{\partial loss}{\partial \hat{y}}·\frac{\partial \hat{y}}{\partial h}·\frac{\partial h}{\partial W}=
\overbrace{-(y-\hat{y})}^{\frac{\partial loss}{\partial \hat{y}}} \cdot \overbrace{\sigma(h)\,(1-\sigma(h))}^{\frac{\partial \hat{y}}{\partial h}} \cdot \overbrace{X}^{\frac{\partial h}{\partial W}} $$

The updating of the weights would be calculated in such a way that it is proportional to the derivative of the loss function. The proportionality factor is what we call _learning rate_, and it is represented by the Greek letter eta ($\eta$):

$$ \Delta W = \eta\, \frac{\partial loss}{\partial W} $$

Finally, once the increments to be applied to the weights have been calculated, they are applied and the process starts again with the next epoch.

$$ W^{(epoch+1)}=W^{(epoch)}-\Delta W^{(epoch)}=W - \eta\, \frac{\partial loss}{\partial W} = W + \eta\,  (y-\hat{y}) \cdot \sigma(h)\,(1-\sigma(h)) \cdot X   $$

In some texts, the derivative of the loss function with respect to h is referred to with the lowercase letter delta ($\delta$). In this way, the previous formula would be as follows:

$$ W^{(epoch+1)}=W^{(epoch)}-\Delta W^{(epoch)}=W - \eta\, \delta\, X   $$

In short, the complete scheme would be as seen in the following figure:

<img src="https://drive.google.com/uc?export=view&id=1eRjgfHYnWFPuFpD0CCWiVYbLqnwEUHzD" width="600" align="center">


## The training

It's time for the first training. The following code is thoroughly commented. Pay particular attention to the dimensions of each array.

In order to better emulate subsequent training, the dots will be delivered to the training algorithm in 10 at a time. Since there are 40 points in this example, this implies that we will have 4 batches in each epoch.

In [None]:
# Preparation data

# Batch size
batch_size = 10

# Initial values of the weights (3 x 1)
W = np.array([[weight1, weight2, bias]]).T

# X is a 40 x 2 array of dots. In the first column are the abscissas and in the
# second are the ordinates; but now it will be delivered in batches, so it is
# convenient to have it sized in 4 batches. I mean:
# 4 batches x 10 dots x 2 coordinates.
X = np.array(features).reshape(4, -1, 2)

# Following the same criteria, the correct labels will be
# arranged in a 4 x 40 x 1 array
Y = np.array([correct_outputs]).T.reshape(4, -1, 1)

In [None]:
# Training step
def trainStep(X, Y, W, lr):
    loss = 0.0
    for Xb, Yb in zip(X, Y):
        # Add the ones column to Xb (due to bias)
        # 10x3
        Xb = np.concatenate((Xb, np.ones((10, 1))), axis=1)
        
        # Forward pass
        # Calculate H. (10 x 3)·(3 x 1)==>(10 x 1)
        H = np.dot(Xb, W)
        # Activation function is applied. 10 x 1
        Yp = sigmoid(H)
        # Calculate the derivative of the activation function,
        # since it will be used in the backprop. It is dŶ/dh. 10 x 1
        dYp_dh = Yp * (1 - Yp)
        
        # Accumulated losses are calculated
        loss += np.square(Yb - Yp).sum()

        # Backward pass
        # Calculate dloss/dỳ
        dloss_dYp = -(Yb - Yp)

        # Delta error term is calculated. (10 x 1)
        delta = dloss_dYp * dYp_dh
        
        # The increments of W are calculated
        incW = lr * np.dot(Xb.T, delta)
        
        # Weights update
        W -= incW
        
    return W, loss

In [None]:
# This cell executes a training step. Feel free to run it as many times as you
# wish, in order to observe how the weights and losses change with different
# values of lr (learning rate)

# The learning rate is initialized
lr = 0.1

W, loss = trainStep(X, Y, W, lr)
print('Loss: {}'.format(loss))
print('W:\n{}\n'.format(W))

hp.evaluate(W[0], W[1], W[2], features, correct_outputs, extended=False)

Finally, look at what happens with different values of _lr_, whether they are large values or small values. You can also experiment with the number of epochs, and even with different starting values.

In [None]:
# Basic training of the network
# Hyperparameters' values
lr = 0.1
num_epochs = 800

# Training
for epoch in range(num_epochs):
    W, loss = trainStep(X, Y, W, lr)

# Results
print('Epoch: {} Loss: {}'.format(epoch, loss))
hp.evaluate(W[0], W[1], W[2], features, correct_outputs, extended=False)