# Introduction to neural networks optimization

* The complexity of the problem indicates how deep a neural network has to be to succeed. But with deep neural networks comes optimization problems.

* Gradient descent: a way to avoid brute forcing the way through optmization, which is not viable, speccialy for complex scenarios. The operation is made in the cost function surface. The original algorithm is supcetible to getting locked in a local minima and never reach the true minima of the surface.

* In a scenario that the parameters are not continuous and well behaved, the gradient descent method may fall short, since it is a method based on the mathematical definition of the gradient. 

* Optimization libraries/methods: gridsearch, hyperopt, optuna, NAS (Neural Architecture Search).

* A computational graph is used to store all the results from the calculus involved in the backpropgation procedure, in other words, the chain rule calculus results.


## Pytorch perceptron implementation:

The following code corresponds to the implementation of a regressor perceptron in pytorch. Here, it is given enphasis to the feed foward and gradient descent backpropagation steps.

In [1]:
import numpy as np
import torch

In [2]:
class Perceptron():
    # constructor: receives self, number of inputs (num_imputs) and a learning rate value (lr)
    def __init__(self, num_inputs, lr=.01):
        # w -> sampled from a normal distribution with mean 0 and standard deviation 1
        self.w = torch.normal(mean=0, std=1, size=(num_inputs,1), requires_grad=True)
        # b -> constant value (1 or 0) or even 1/total number of classes
        self.b = torch.zeros(1, requires_grad=True)
        # class attribute to store inputs' number
        self.num_inputs = num_inputs
        # class attribute to store the learning rate value
        self.lr = lr  

    # Define a ReLU activation function
    def activation_relu(self, x):
        a = torch.zeros_like(x)
        return torch.max(x, a)

    # foward -> processes an input X value through the output
    def foward(self, X):
        linear = X@self.w + self.b
        return self.activation_relu(linear)
    
    # backward -> obtains the prediction error running back the layer
    def backward(self, X, y):
        y_hat = self.foward(X) # foward pass
        errors = (y.reshape(y_hat.shape) - y_hat) # differentiate
        return errors

    # quadratic loss
    def loss(self, y_hat, y):
        l = (y.reshape(y_hat.shape) - y_hat)**2/2
        return l.mean()
    
    # train
    def train_step(self, X, y):
        for i in range(y.shape[0]):
            error = self.backward(X[i].reshape(1, self.num_inputs), y[i]).reshape(-1) # transforms the unitary tensor to a scalar value
            # gradient descent
            self.w = self.w + self.lr*error*X[i].reshape(self.num_inputs, 1)
            self.b = self.b + self.lr*error

In [3]:
model = Perceptron(3, lr=0.001)
model.w, model.b, model.lr

(tensor([[1.1306],
         [1.6895],
         [1.8499]], requires_grad=True),
 tensor([0.], requires_grad=True),
 0.001)

In [4]:
# Generating example data
X = torch.arange(30, dtype=torch.float32).reshape((10,3)) + torch.normal(0, 2, (10,3))
print(X)
y = torch.arange(10, dtype=torch.float32)
print(y)

tensor([[ 1.8904,  6.7750,  0.6223],
        [ 1.4300,  4.4493,  5.1651],
        [ 3.2495,  3.9364,  9.0632],
        [11.2349,  8.8244, 11.9049],
        [11.8409, 16.5702, 17.1188],
        [12.9469, 17.1069, 13.7759],
        [16.4468, 19.7310, 19.8193],
        [20.7280, 22.3385, 21.6008],
        [24.9066, 22.9328, 26.6466],
        [28.6048, 29.0486, 29.3416]])
tensor([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])


In [5]:
model.foward(X), y, model.loss(model.foward(X), y)

(tensor([[ 14.7349],
         [ 18.6889],
         [ 27.0908],
         [ 49.6344],
         [ 73.0513],
         [ 69.0243],
         [ 88.5946],
         [101.1361],
         [116.1990],
         [135.6982]], grad_fn=<MaximumBackward0>),
 tensor([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.]),
 tensor(2782.2212, grad_fn=<MeanBackward0>))

In [6]:
model.train_step(X, y)
print('Parameters:', model.w, model.b)
print(f'Squared loss:, {model.loss(model.foward(X), y).item():.6f}')

Parameters: tensor([[-0.0887],
        [ 0.1888],
        [ 0.1955]], grad_fn=<AddBackward0>) tensor([-0.1467], grad_fn=<AddBackward0>)
Squared loss:, 0.224431


In [7]:
epochs = 100
for _ in range(epochs):
    model.train_step(X, y)
    print('Parameters:', model.w.detach().numpy(), model.b.item())
    print(f'Squared loss:, {model.loss(model.foward(X), y).item():.6f}')

Parameters: [[-0.08677812]
 [ 0.18390043]
 [ 0.1922458 ]] -0.1477540284395218
Squared loss:, 0.243705
Parameters: [[-0.08296393]
 [ 0.18144888]
 [ 0.19160113]] -0.14859594404697418
Squared loss:, 0.237253
Parameters: [[-0.07922538]
 [ 0.17907315]
 [ 0.19094062]] -0.14942799508571625
Squared loss:, 0.231081
Parameters: [[-0.07556455]
 [ 0.17676663]
 [ 0.19026239]] -0.15025046467781067
Squared loss:, 0.225222
Parameters: [[-0.07198013]
 [ 0.17452696]
 [ 0.1895689 ]] -0.1510632187128067
Squared loss:, 0.219657
Parameters: [[-0.06847092]
 [ 0.17235175]
 [ 0.18886238]] -0.15186618268489838
Squared loss:, 0.214367
Parameters: [[-0.06503566]
 [ 0.17023873]
 [ 0.1881449 ]] -0.1526593267917633
Squared loss:, 0.209335
Parameters: [[-0.06167302]
 [ 0.16818589]
 [ 0.18741846]] -0.1534426063299179
Squared loss:, 0.204546
Parameters: [[-0.05838189]
 [ 0.16619098]
 [ 0.18668461]] -0.1542159914970398
Squared loss:, 0.199988
Parameters: [[-0.05516097]
 [ 0.16425216]
 [ 0.18594512]] -0.15497945249080658

In [8]:
model.foward(X), y

(tensor([[0.6885],
         [1.0952],
         [1.7376],
         [3.1673],
         [4.7335],
         [4.3700],
         [5.7473],
         [6.5504],
         [7.6272],
         [8.8702]], grad_fn=<MaximumBackward0>),
 tensor([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.]))

In [10]:
# Test:
# Generating example data
X = torch.arange(15, dtype=torch.float32).reshape((5,3)) + torch.normal(0, 2, (5,3))
print(X)
y = torch.arange(5, dtype=torch.float32)
print(y)

tensor([[-0.9571,  3.3450,  2.8526],
        [ 1.8246,  3.6858,  6.0044],
        [ 4.3695,  5.0810, 11.6046],
        [ 9.7113, 10.4942, 13.4864],
        [10.1406, 14.9435, 15.0168]])
tensor([0., 1., 2., 3., 4.])


In [11]:
model.foward(X), y

(tensor([[0.4884],
         [1.1695],
         [2.2974],
         [3.4617],
         [4.1520]], grad_fn=<MaximumBackward0>),
 tensor([0., 1., 2., 3., 4.]))

In [12]:
model.loss(model.foward(X), y)

tensor(0.0592, grad_fn=<MeanBackward0>)

## Autograd example in pytorch

[torch.autograd](https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html) is PyTorch’s automatic differentiation engine that powers neural network trainin.

In [13]:
import torch
from torch.autograd import grad
import torch.nn.functional as F

In [14]:
x = torch.tensor([5.])
w = torch.tensor([1.2], requires_grad=True)
b = torch.tensor([0.5], requires_grad=True)

act = F.relu(w*x + b)
print(act)

tensor([6.5000], grad_fn=<ReluBackward0>)


In [15]:
grad(act, w, retain_graph=True)

(tensor([5.]),)

## Designing dense neural networks (MLP) with pytorch: layers, gradient and optmizers

* Pytorch pre-implemented layers
* Parameter access options
* Gradient use
* Backpropagation and and optimization algorithms

In [1]:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F

We need to define the ```foward``` and ```backward``` functions (where the gradients are computed)

In [2]:
class Network(nn.Module):

    def __init__(self):
        # Calls the __init__ dunder method (constructor) from the superclass nn.Module
        super(Network, self).__init__()
        # Creates the network project
        # The problem of interest is an image classification problem (handwritten numbers from 0 to 9)
        # 28x28 pixels images = 724 values
        # Network output = 10 classes
        self.fc1 = nn.Linear(784, 32) # fully conected layer (first dense layer)
        self.fc2 = nn.Linear(32, 10) # fully conected layer (second dense layer)

    def forward(self, X):
        # X is an image 28x28 -> does not work for the conneccted layer, which expects a 1d vector
        # Receives minibatches
        X = torch.flatten(X, 1) # flatten the dimentions with the exception of the batches one
        X = F.relu(self.fc1(X)) # connected linear layer + relu
        X = self.fc2(X) # connected linear layer
        return X


In [3]:
net = Network()
print(net)

Network(
  (fc1): Linear(in_features=784, out_features=32, bias=True)
  (fc2): Linear(in_features=32, out_features=10, bias=True)
)


In [4]:
input_random = torch.randn(1, 1, 28, 28)
print(input_random.shape)
output = net(input_random)
print(output)

torch.Size([1, 1, 28, 28])
tensor([[ 0.0847,  0.1732,  0.1617, -0.0896,  0.0639, -0.2462, -0.0805, -0.1427,
         -0.1804,  0.0198]], grad_fn=<AddmmBackward0>)


In [5]:
output = net(input_random) # executes the foward method
print(output)

tensor([[-0.3431, -0.2508,  0.2336,  0.3299, -0.1642,  0.0559,  0.0749, -0.1185,
          0.1411,  0.2383]], grad_fn=<AddmmBackward0>)


In [5]:
params = list(net.parameters())
print(len(params))
# first layer parameters
print(params[0])
print(params[0].size())

4
Parameter containing:
tensor([[-0.0242,  0.0085, -0.0324,  ...,  0.0316,  0.0316,  0.0337],
        [ 0.0020, -0.0196, -0.0106,  ..., -0.0182,  0.0049,  0.0075],
        [ 0.0319,  0.0310,  0.0191,  ...,  0.0058,  0.0356,  0.0270],
        ...,
        [-0.0209,  0.0206, -0.0283,  ..., -0.0212,  0.0146, -0.0142],
        [ 0.0266, -0.0299, -0.0311,  ..., -0.0237, -0.0317,  0.0179],
        [-0.0087, -0.0064,  0.0016,  ...,  0.0048, -0.0063, -0.0286]],
       requires_grad=True)
torch.Size([32, 784])


In [6]:
# Backpropagation
#initialize gradients buffer
net.zero_grad()
output.backward(torch.randn(1, 10))

In [7]:
output = net(input_random)
target = torch.randn(10) # random target just for example purposes
print(target)
target = target.view(1,-1)
print(target)

criterion = nn.MSELoss()
loss = criterion(output, target)
print(loss)

tensor([-0.7661,  0.6755, -0.0827,  1.9172, -1.2256,  0.2290, -0.0954,  0.7112,
         1.1470, -1.3312])
tensor([[-0.7661,  0.6755, -0.0827,  1.9172, -1.2256,  0.2290, -0.0954,  0.7112,
          1.1470, -1.3312]])
tensor(1.1268, grad_fn=<MseLossBackward0>)


In [8]:
print(loss.grad_fn)
print(loss.grad_fn.next_functions[0][0]) # linear function
print(loss.grad_fn.next_functions[0][0].next_functions[0][0]) # relu 

<MseLossBackward0 object at 0x000001672599BE20>
<AddmmBackward0 object at 0x0000016734144B20>
<AccumulateGrad object at 0x0000016734145960>


In [9]:
# Backpropagation
net.zero_grad()

print('antes')
print(net.fc1.bias.grad)

loss.backward()
print('depois')
print(net.fc1.bias.grad)

antes
None
depois
tensor([ 0.0033,  0.0217,  0.0000,  0.0000, -0.0581,  0.0713,  0.0346,  0.0000,
         0.0440,  0.0000,  0.1349,  0.0000, -0.0099,  0.1201,  0.0000,  0.1233,
        -0.0656,  0.0000,  0.0000,  0.0000,  0.0598, -0.0395,  0.0000,  0.0000,
         0.0093,  0.0000,  0.0000,  0.0376,  0.0000, -0.0149,  0.0000,  0.0079])


In [13]:
print(params[0][0][:10])   

tensor([-0.0242,  0.0085, -0.0324, -0.0352, -0.0321,  0.0334,  0.0240,  0.0271,
         0.0334,  0.0179], grad_fn=<SliceBackward0>)


In [16]:
import torch.optim as optim

# creates the optimizer object
optimizer = optim.SGD(net.parameters(), lr=0.05)

# for each training loop

## initialize gradient buffer
optimizer.zero_grad()
## generate output and compute gradients with relation to the loss function
output = net(input_random)
loss = criterion(output, target)
loss.backward()
## wieght adaptation
optimizer.step()

In [17]:
print(params[0][0][:10])

tensor([-0.0239,  0.0083, -0.0323, -0.0352, -0.0321,  0.0329,  0.0236,  0.0276,
         0.0337,  0.0185], grad_fn=<SliceBackward0>)
