# Artificial Intelligence Workshop Day 1

By Juan-Pablo Silva (https://github.com/juanpablos) jpsilva@dcc.uchile.cl, Alexandre Bergel (abergel@dcc.uchile.cl) and Alejandro Infante (ainfante@dcc.uchile.cl).

## Pytorch Basics

We will be importing our basic packages. Pytorch, Numpy (for some auxiliary functions) and matplotlib (to plot our results).

In [0]:
import numpy as np
import torch
import math
from matplotlib import pyplot as plt

### Tensors

Tensors are the basic structure in Pytorch, and work just like numpy arrays. If you are not familiar with numpy arrays, think of matlab's matrices.

A tensor is just a N-dimensional matrix. A vector is 1 dimensional, a matrix is 2 dimensional.

In [0]:
# We create tensors with torch.tensor(...)
tensor_1 = torch.tensor(10)
print("Printing a tensor: ", tensor_1)
print(f"You can ask the shape of a tensor with .shape or .size(): {tensor_1.shape}")
print(f"Ask for a tensor data type with .dtype: {tensor_1.dtype}")
print(f"Single element tensors can have their value retrieved with .item(): {tensor_1.item()}")

Printing a tensor:  tensor(10)
You can ask the shape of a tensor with .shape or .size(): torch.Size([])
Ask for a tensor data type with .dtype: torch.int64
Single element tensors can have their value retrieved with .item(): 10


In [0]:
# We can define an tensor from a sequence, for example a list
tensor_1 = torch.tensor([1,2,3,4,5,6])
# The ones(shape) constructor let us define a tensor with all ones
tensor_2 = torch.ones(6)

# Lets print our new tensors
print(tensor_1, tensor_2)
# There are their shapes
print(tensor_1.shape, tensor_2.shape)

# But their types are different?
print(tensor_1.dtype, tensor_2.dtype)

tensor([1, 2, 3, 4, 5, 6]) tensor([1., 1., 1., 1., 1., 1.])
torch.Size([6]) torch.Size([6])
torch.int64 torch.float32


The types of our new tensors are different. Pytorch does not allow any type of computation between different types of data. In this case we have that tensor_1 es an interger tensor, and that tensor_2 is a float tensor. Since we are more interested in floating points, lets convert tensor_1 to a FloatTensor.

In [0]:
# Here we make tensor_1 into a FloatTensor.
tensor_1 = tensor_1.type(torch.FloatTensor)
# We could also have defined it like that in the beggining with the dtype parameter
# tensor_1 = torch.tensor([1,2,3,4,5,6], dtype=torch.float)
print(tensor_1.dtype)

torch.float32


We can do simple operations with the tensors, just like matrices in matlab and numpy arrays.

In [0]:
# Element-wise addition
print(tensor_1 + tensor_2)
# Element-wise multiplication
print(tensor_1 * tensor_2)
# Dot product, scalar product
print(tensor_1 @ tensor_2)

tensor([2., 3., 4., 5., 6., 7.])
tensor([1., 2., 3., 4., 5., 6.])
tensor(21.)


Many times we would like to change a tensor's dimension. To do this we can use the .view() method.
In this method we pass the shape we want the tensor to be. If we pass a -1 value in a dimension, pytorch can infer the value based on the remaining elements.
You could use the .reshape() method with the same parameters and get the same result. They are not the same method internally, but most of the time you would want to use view. You can look at the Pytorch documentation for more details on this.

In [0]:
# Let's create a matrix.
# Create an array of 9 elements, from 1 to 9
tensor_1 = torch.tensor(list(range(1,10)), dtype=torch.float)
print(tensor_1)

# Right now our tensor is 1x9
# We can resize it to be a 3x3 matrix with .view()
# We ask for a tensor with 3 rows and let pytorch infer how many columns (3 in this case)
# because we have 9 elements, 3xX=9 => X=3
tensor_1 = tensor_1.view(3,-1)

# We create a 3x3 tensor with random values.
tensor_2 = torch.randn(3,3)

# You can see the results here. Both tensors have the same shape.
print(tensor_1, tensor_1.shape)
print(tensor_2, tensor_2.shape)

tensor([1., 2., 3., 4., 5., 6., 7., 8., 9.])
tensor([[1., 2., 3.],
        [4., 5., 6.],
        [7., 8., 9.]]) torch.Size([3, 3])
tensor([[ 1.9460, -1.8746,  0.2707],
        [-0.7172, -1.7842, -2.1367],
        [-2.7372, -1.2388, -1.7430]]) torch.Size([3, 3])


Same as before, when we showed operations with one dimensional tensors (arrays), here we work with matrices.

In [0]:
# Element-wise addition
print(tensor_1 + tensor_2)
# Element-wise multiplication
print(tensor_1 * tensor_2)
# Dot product
print(tensor_1 @ tensor_2) # matrix multiplication

tensor([[2.9460, 0.1254, 3.2707],
        [3.2828, 3.2158, 3.8633],
        [4.2628, 6.7612, 7.2570]])
tensor([[  1.9460,  -3.7491,   0.8121],
        [ -2.8687,  -8.9208, -12.8200],
        [-19.1604,  -9.9101, -15.6871]])
tensor([[ -7.7000,  -9.1592,  -9.2317],
        [-12.2252, -23.8516, -20.0586],
        [-16.7504, -38.5441, -30.8855]])


Clearly, as we know, there are conditions for the matrix multiplication. Let A be a MxN matrix, and B a KxL matrix. We can calculate the dot product of A and B only if N=K, and the result will be a MxL size matrix.
For the element-wise operations, the size of the tensors must be the same.

In [0]:
# We have a 2x5 matrix here
tensor_1 = torch.tensor(list(range(1,11)), dtype=torch.float).view(2,-1)
print(tensor_1)
# and a 5x9 matrix here
tensor_2 = torch.randn(5,9)

print(f"tensor_1: {tensor_1.shape}")
print(f"tensor_2: {tensor_2.shape}")

# print(tensor_1 + tensor_2) not the same size
# print(tensor_1 * tensor_2) not the same size
# the resulting shape is a 2x9 matrix
print((tensor_1 @ tensor_2).shape)

tensor([[ 1.,  2.,  3.,  4.,  5.],
        [ 6.,  7.,  8.,  9., 10.]])
tensor_1: torch.Size([2, 5])
tensor_2: torch.Size([5, 9])
torch.Size([2, 9])


### Gradients

A gradient is just the multi-variable generalization of the derivative. In the case of a matrix, we end up calculating the Jacobian matrix.

Pytorch takes care of all the derivation for us, this is what is called an _automatic differentiation engine_.

We will start with a simple example:

$
y= x ^3 \\
\frac{\partial y}{\partial x} = \frac{\partial x^3}{\partial x} = 3x^2
$

In [0]:
# We define a single value tensor
tensor_1 = torch.tensor(10, requires_grad=True, dtype=torch.float)
# cube it
tensor_2 = tensor_1 ** 3
print(f"Lets call this x, x={tensor_1}")
print(f"Lets call this y(x)=x^3, y(10)={tensor_2}")

# We have the ecuation y = x ^ 3
# and its derivative is dy = 3x^2
# Lets let Pytorch take care of the gradient
tensor_2.backward()

# We know that the answer should be 3x^2 with x=10 so
# x=10 => dy=300
# Pytorch thinks the same!
print(f"dy/dx|x=10={tensor_1.grad}")

Lets call this x, x=10.0
Lets call this y(x)=x^3, y(10)=1000.0
dy/dx|x=10=300.0


Okay, maybe that was too easy. Lets try with matrices now.

Let $x$ and $y$ be our variables.

$
x = 
\begin{pmatrix}
1 & 2 & 3 \\
4 & 5 & 6 \\
7 & 8 & 9
\end{pmatrix}\\
y = 
\begin{pmatrix}
2 & 3 & 4 \\
3 & 4 & 5 \\
4 & 5 & 6
\end{pmatrix}
$

The function we want to calculate is: $f(x, y) = sum((x \cdot y)^2)$. We have $x$ that is a $3x3$ matrix, and $y$ that is another $3x3$ matrix. The reason that we are using the $sum(\cdot)$ function is because there is a slight complication in _machine learning_: all we do is optimize. The traditional way to optimize is for a single variable, multi-objective optimization very hard. If we would not sum the result of the multiplication we would be entering a multi-objective optimization, so by summing it we output a scalar. It could be the sum, max, mean, etc. It does not need to be the sum function, but we will stick with that.

Because the derivative of the sum is the sum of the derivative, we extract the sum.
Lest's calculate the gradient with respect to $x$ and $y$.

$
\begin{align}
f(x, y) & =  \sum(x \cdot y)^2 \\
\frac{\partial f(x, y)}{\partial x} & = \frac{\partial\sum(x \cdot y)^2}{\partial x} \\
& = \sum\frac{\partial(x \cdot y)^2}{\partial x}
\end{align}
$

For easier comprehension, lets remove the sum:

$
\begin{align}
& = \frac{\partial(x \cdot y)^2}{\partial x} \\
& = 2x \cdot y\frac{\partial(x \cdot y)}{\partial x} \\
& = 2x \cdot y \cdot y
\end{align}
$

Then for with respect to y we have:

$
\begin{align}
& = \frac{\partial(x \cdot y)^2}{\partial y} \\
& = 2 \frac{\partial(x \cdot y)}{\partial y} \cdot x \cdot y \\
& = 2x^T \cdot x \cdot y
\end{align}
$

Remember that we are working with matrices, here the order in which they are multiplied matters.
Let's replace our variables with the known values.

$
\begin{align}
\frac{\partial f(x, y)}{\partial x} & = 2x \cdot y \cdot y \\
    &= 2
\begin{pmatrix}
1 & 2 & 3 \\
4 & 5 & 6 \\
7 & 8 & 9
\end{pmatrix}
*
\begin{pmatrix}
2 & 3 & 4 \\
3 & 4 & 5 \\
4 & 5 & 6
\end{pmatrix}
*
\begin{pmatrix}
2 & 3 & 4 \\
3 & 4 & 5 \\
4 & 5 & 6
\end{pmatrix}
\\
& = 
\begin{pmatrix}
492 & 648 & 804 \\
1176 & 1548 & 1920 \\
1860 & 2448 & 3036
\end{pmatrix}
\\
\frac{\partial f(x, y)}{\partial y} & = 2x^T \cdot x \cdot y \\
&= 2
\begin{pmatrix}
1 & 2 & 3 \\
4 & 5 & 6 \\
7 & 8 & 9
\end{pmatrix}^T
*
\begin{pmatrix}
1 & 2 & 3 \\
4 & 5 & 6 \\
7 & 8 & 9
\end{pmatrix}
*
\begin{pmatrix}
2 & 3 & 4 \\
3 & 4 & 5 \\
4 & 5 & 6
\end{pmatrix}
\\
     &= 2
\begin{pmatrix}
1 & 4 & 7 \\
2 & 5 & 8 \\
3 & 6 & 9
\end{pmatrix}
*
\begin{pmatrix}
1 & 2 & 3 \\
4 & 5 & 6 \\
7 & 8 & 9
\end{pmatrix}
*
\begin{pmatrix}
2 & 3 & 4 \\
3 & 4 & 5 \\
4 & 5 & 6
\end{pmatrix}
\\
& = \begin{pmatrix}
1452 & 1920 & 2388 \\
1734 & 2292 & 2850 \\
2016 & 2664 & 3312
\end{pmatrix}
\end{align}
$

Now lets check with Pytorch if we get the same result.

In [0]:
tensor_1 = torch.tensor([[1, 2, 3], [4, 5, 6],[7,8,9]], requires_grad=True, dtype=torch.float)
tensor_2 = torch.tensor([[2,3,4], [3,4,5], [4,5,6]], requires_grad=True, dtype=torch.float)
tensor_4 = ((tensor_1 @ tensor_2)**2).sum()

tensor_4.backward()
print(f"The gradient for tensor_1 is:\n {tensor_1.grad}")
print(f"The gradient for tensor_1 is:\n {tensor_2.grad}")

The gradient for tensor_1 is:
 tensor([[ 492.,  648.,  804.],
        [1176., 1548., 1920.],
        [1860., 2448., 3036.]])
The gradient for tensor_1 is:
 tensor([[1452., 1920., 2388.],
        [1734., 2292., 2850.],
        [2016., 2664., 3312.]])


Okay, so now that we have a basic understanding about how we calculate the gradients, we won't be calculating them by hand anymore in the workshop. Instead, we will just use the _automatic differentiation_ provided by Pytorch.

## Stochastic Gradient Descent (SGD)

The Gradient Descent, or Stochastic Gradient Descent, is an optimization algorithm for updating _machine learning_ model parameters. There are multiple other updating algorithms (namely optimizers), but SGD is a simpler one that is easy to implement.

For this optimizer we need a learning rate, usually a small float number. This will be a constant (in SGD at least) that will reflect on how abruptly we will be updating the network parameters. The bigger it is, the more we change in each iteration, the smaller, the more subtle the change.

In [0]:
# Lets set the learning rate to 0.5
lr = 0.5
# We generate 2 random tensors.
# One 1x2
_input = torch.randn(1,2, dtype=torch.float)
# and the other 2x1
tensor_2 = torch.randn(2,1)
# we need to explicitly tell pytorch that we want this tensor to hace its gradients calculated
tensor_2.requires_grad_()

print(f"Our tensor before the update:\n {tensor_2}")

# Multiply the input tensor with our middle tensor
y = _input @ tensor_2
print("The output {}".format(y))

# calculate the gradients
y.backward()

# this is just an indicator to pytorch to not listen to the following operations
# if we dont use this, the gradient could change unexpectedly
with torch.no_grad():
    # update the tensor with the learning rate and its gradient
    tensor_2 -= tensor_2.grad * lr
    # set the gradient to zero, we dont want this to accumulate
    tensor_2.grad.zero_()
    
    
print(f"Our tensor after the update:\n {tensor_2}")

Our tensor before the update:
 tensor([[0.6099],
        [0.5049]], requires_grad=True)
The output tensor([[0.4886]], grad_fn=<MmBackward>)
Our tensor after the update:
 tensor([[0.4669],
        [0.1937]], requires_grad=True)


### Exercise: train AND logic gate

In the following exercise you will be implementing a neural network capable of representing the AND logic gate. This is an ilustrative example for you to get familiar with Pytorch.

In [0]:
# We defined some functions for you.
# The sigmoid activation function
def sigmoid(z):
    return 1 / (1 + torch.exp(-z))
   
# A loss function. We want to minimize this.
# This measures the distance between the network's output, and out desired output.
def MSE_loss(output, target):
    return ((output - target)**2).mean()
    
# Here we select 1000x2 random boolean values
dataset = torch.from_numpy(np.random.choice(a=[0, 1], size=(1000, 2)))
# and convert them to FloatTensors so we can work with them
x_train_and = dataset.type(torch.FloatTensor)
# Here we calculate our desired outputs for each of the inputs.
y_train_and = (dataset[:,0] & dataset[:,1]).type(torch.FloatTensor).unsqueeze(1)

# samples could be useful if you train in batches
samples, _ = x_train_and.shape
# how many times the network will go over the whole dataset
epochs = 100
# if you use batches, the batch size
batch_size = 10
# the learning rate
lr = 0.1

Remember the formula for the neural networks: $$f_{\theta, b}(x) = \theta \cdot x + b$$
$f_{\theta, b}$ is commonly writen only as $f_{\theta}$, $\theta$ represents the networks parameters. In this case we represent $\theta$ as the network weights and $b$ as the bias. For the function $f$ representing the neural network, we need to find the parameters $\theta$ and $b$ such that the network gives the correct outputs.

In [0]:
# we want to keep track of the loss for each epoch, so we can visualize it later
# append your loss values to this list once per epoch
loss_values = []

# -------- Learn AND ---------
weights = torch.randn(2, 1)
weights.requires_grad_()
bias = torch.randn(1)
bias.requires_grad_()

def model(xb):
    # Define how the neural network works
    return "define me!"

for epoch in range(epochs):
    loss_epoch = []
    for i in range((samples - 1) // batch_size + 1):
        start_i = i * batch_size
        end_i = start_i + batch_size        
        xb = x_train_and[start_i:end_i]
        yb = y_train_and[start_i:end_i]
        
        # something should be done here with our inputs!
        #####################################
        
        loss_epoch.append(loss.item())

        # We need to calculate the gradients and update the parameters!
        ##################
            
    loss_values.append(np.mean(loss_epoch))
    
    if epoch % 10 == 0:
            print(f'Epoch [{epoch}/{epochs}] Loss: {loss_values[-1]:.4f}')
    

# ------- END Learn AND -------
print("Ready")

NameError: ignored

In [0]:
# Okay! Let's see how the network's loss (or error) changed over time
plt.plot(loss_values)

# Here we ask the network what it thinks there values should be
print(model(torch.tensor([[0,0],[0,1],[1,0],[1,1]], dtype=torch.float)))
# they look good!

## More Pytorch

Iterating over the dataset manually, setting the batch size and slicing the data, manually updating the parameters of the model, implementing the loss and activation functions... Pytorch have us covered here!

First we need to import some more packages

In [0]:
# There is a bunch of optimizers here, aside from SGD
from torch import optim
# Package containing lots of functions to build neural networks
from torch import nn
# To load and manage our datasets. Very useful when we have huge datasets
from torch.utils.data import DataLoader,TensorDataset

First, lets introduce the _TensorDataset_ and _DataLoader_. The _TensorDataset_ is just an object that will be storing our data so we don't have to slice it ourselves. The _DataLoader_ allow us to iterate over our data without we worring about it. It also implements parallelism, so if your dataset is too big, or you do some type of preprocessing _on the fly_ you can set multiple workers to load it at the same time.

In [0]:
# Create the dataset
train_ds_and = TensorDataset(x_train_and, y_train_and)
# Use the dataset as a data source for the dataloader
# Here we also pass the batch size to allow for mini-batches.
train_dl_and = DataLoader(train_ds_and, batch_size=batch_size, shuffle=True)

Leaving the network hanging around in our code is really messy and will surely raise some problems later. We can create a class to contain it, and this way add more layers and configurations when needed with not much effort.

In [0]:
# We just need to extend from the nn.Module class and implement the forward method
class Network(nn.Module):
    def __init__(self):
        # initialize the super class
        super().__init__()
        # create a single layex with 2 inputs and 1 output
        self.layer_1 = nn.Linear(2, 1)
        
    def forward(self, xb):
        # apply the input to the layer
        o = self.layer_1(xb)
        # apply an activation function
        o = torch.sigmoid(o)
        # we are done with the forward pass!
        return o
    

Our loss function is also implemented in Pytorch, lets just use it here.

In [0]:
loss_func = nn.MSELoss()

For us the use SGD, we need to tell Pytorch which parameters we want to be updated by the optimizer. In our case, we want Pytorch to take care of every parameter so we don't have to worry, but if some day you need to, you could input only some parameters here.

In [0]:
# Create a network object. This is just the class we just implemented
model = Network()
# Create the SGD optimized and pass the network parameters
opt = optim.SGD(model.parameters(), lr=lr)

To train it is best to create a function, generally called _fit_, so everything is tidy and we know where to look in case we want to change anything.

In [0]:
# The fit function needs the model we want to train, the optimizer we are using,
# the number of epochs we want it to run, the data we want it to train on,
# and the loss function we want to minimize.
def fit(model, optim, n_epochs, data, loss_func):
    # we will be storing out epoch loss here
    acc_loss = []
    
    # do n_epochs many epochs
    for epoch in range(n_epochs):
        # epoch loss stores the loss per mini-batch. We will be averaging this later
        epoch_loss = []
        # iterate over the dataset
        for xb, yb in data:
            # make a prediction. Here the network (our model) is calling the *forward* method
            pred = model(xb)
            # the loss compare how close we are with our predictions to the real value
            loss = loss_func(pred, yb)

            # mini-batch loss
            epoch_loss.append(loss.item())
            # calculate the gradients
            loss.backward()
            # make the SGD do its thing.
            # this will update all the parameters we passed in the constructor
            # just like we manually updated then before
            opt.step()
            # never forget to reset the gradients!
            opt.zero_grad()
            
        # average the mini-batch loses and append them to the epoch loss
        acc_loss.append(np.mean(epoch_loss))
        
        if epoch % 10 == 0:
            print(f'Epoch [{epoch}/{epochs}] Loss: {acc_loss[-1]:.4f}')
        
    return acc_loss

In [0]:
# Here we call the training function.
# The function returns a list will all the historic losses per epoch
# so we can visualize it and check if the model is performing as expected
loss_values = fit(model=model, optim=opt, n_epochs=100, data=train_dl_and, loss_func=loss_func)

# plot the losses
plt.plot(loss_values)
# check if the network really learnt the logic gate
print(model(torch.tensor([[0,0],[0,1],[1,0],[1,1]], dtype=torch.float)))

### Exercise: train the XOR logic gate

In this exercise you will be implementing a neural network that will be able to represent the XOR logic gate. Once you have implemented the network, modify the number of layers, learning rate and number of neurons to see if it affects the learning speed of the network, how the accuracy change over time, etc.

Hint: remember the logic gate tables and where AND and OR differs from XOR.

In [0]:
# data
dataset = torch.from_numpy(np.random.choice(a=[0, 1], size=(1000, 2)))
x_train_xor = dataset.type(torch.FloatTensor)
# xor calculation
y_train_xor = (dataset[:,0] ^ dataset[:,1]).type(torch.FloatTensor).unsqueeze(1)

epochs = 500
batch_size = 10
lr = 0.1

In [0]:
# Add your epoch losses here
loss_values = []


# -------- Learn XOR ---------
train_ds_xor = TensorDataset(x_train_xor, y_train_xor)
train_dl_xor = DataLoader(train_ds_xor, batch_size=batch_size, shuffle=True)

class Network(nn.Module):
    def __init__(self):
        super().__init__()
        # define the layers!
        #########################
        
    def forward(self, xb):
        # how the layers are going to be composed?
        ########################
        o = xb
        return o
    
# Remember you need a loss function, an optimizer and to instantiate the network
########################

# complete this function as we saw before
def fit(model, optim, n_epochs, data, loss_func):  
    # you can ignore the logging function, but it helps debug!
    ########################
        
# general fit parameters
fit(model=model, optim=opt, n_epochs=epochs, data=train_dl_xor, loss_func=loss_func)

# ------- END Learn XOR -------
print("Ready")

In [0]:
# Lets see how it went!
plt.plot(loss_values)
# Check the outputs
print(model(torch.tensor([[0,0],[0,1],[1,0],[1,1]], dtype=torch.float)))

### Exercise: train a polinomial function

In this exercise you will need a bit more things to take into account. We will be introducing a data scaler. This will take your data, and scale it to a defined range, generally between 0 and 1.

Why do you think we need to scale our data? Try to remove that lines and feed the unscaled data to the network. Does it work? What happens? Can you answer why is that the case?

In [0]:
# import the scaler from sklearn
from sklearn.preprocessing import MinMaxScaler

In [0]:
# Define your polinomial function here
def polinomio(x):
    return 10*x**5 + x + 5

# generate the data points
step = 0.01
x_train_poli_pre = np.arange(-50, 50 + step, step).reshape(-1, 1)
# create the real values
y_train_poli_pre = polinomio(x_train_poli_pre)

# scale the data
scaler_x = MinMaxScaler((0, 1)).fit(x_train_poli_pre)
scaler_y = MinMaxScaler((0, 1)).fit(y_train_poli_pre)

# create tensors to feed to the neural network
x_train_poli = torch.tensor(scaler_x.transform(x_train_poli_pre), dtype=torch.float)
y_train_poli = torch.tensor(scaler_y.transform(y_train_poli_pre), dtype=torch.float)

In [0]:
# -------- Learn poli ---------
epochs = 1000
batch_size = 100
lr = 0.1
loss_values = []
train_ds_poli = TensorDataset(x_train_poli, y_train_poli)
train_dl_poli = DataLoader(train_ds_poli, batch_size=batch_size, shuffle=True)

class Network(nn.Module):
    def __init__(self):
        super().__init__()
        # Try with different layers
        ########################
        
    def forward(self, xb):
        # how the layers are going to be composed?
        ########################
        o = xb
        return o
    
# Remember you need a loss function, an optimizer and to instantiate the network

# complete this function as we saw before
def fit(model, optim, n_epochs, data, loss_func):   
    # you can ignore the logging function, but it helps debug!
        
# general fit parameters
fit(model=model, optim=opt, n_epochs=epochs, data=train_dl_poli, loss_func=loss_func)

# -------- END Learn poli ---------
print("Ready")

In [0]:
# Lets compare our results with the real ones!
plt.plot(x_train_poli.view(-1).detach().numpy(), y_train_poli.view(-1).detach().numpy())
plt.plot(x_train_poli.view(-1).detach().numpy(), model(x_train_poli).view(-1).detach().numpy())

In [0]:
# And how the training went
plt.plot(loss_values)

## Classification

Uo to this point we have only been training on regression tasks. This means that we have a number we want to reach.
There is another type of task called _classification_. Here our network does not output a single number, but many of them. To be precise, it outputs how many classes we have. For example, if we have Cat and Dog pictures, and we would like to classify them, we would have 2 classes, so the network will have 2 outputs.

We introduce here the one-hot encoding, a very useful technique to represent nominal data. In the above example of Cat and Dog pictures, we have 2 possible classifications for an image, it is Cat or it is Dog. We could represent this as a single output value where values lower that 0.5 are Cat and higher are Dogs. But what if we had 10 different classes? We could set intervals of 0.1 to represent each of the classes, but this start to get confusing and in practice does not work very well.
The one-hot encoding solves this. If we have 2 possible classes, we create a 2 value vector [v1, v2]. If Our network thinks the picture is of a Cat, we would like v1 to be 1, and v2 to be 0. Likewise, if it thinks it is of a Dog, we want v1 to be 0 and v1 to be 1. This way we can encode how many different classes we want.

### Iris dataset

The iris dataset is a well known multi-variate dataset that consists of 150 samples of 4 different measures of 3 type of flowers. We will be using it to train a neural network to be able to predict which type of flower a datapoint is, based on the measurements and characteristics.

In [0]:
# First, lets import the data set from sklearn
from sklearn.datasets import load_iris
# We also want this very useful auxiliary function
# we use it to split out data in training and testing sets
# this way we can check if our model is capable of generalizing
# what we showed it
from sklearn.model_selection import train_test_split

In [0]:
# Load the dataset
iris = load_iris()
# here we have the datapoints
iris_data = iris["data"].astype('float32')
# here we have the type of flower each datapoint is
iris_target = iris["target"]

# split our data into training and testing sets
X_train_raw, X_test_raw, y_train, y_test = train_test_split(iris_data, iris_target, test_size=0.33, random_state=42)

# we scale our input data for it not to explode
# we should ONLY scale the data on the TRAINING data
# it is cheating if you also use the testing data
scaler_x = MinMaxScaler((0, 1)).fit(X_train_raw)
# scale the training and testing data with what is seen on the training data only
X_train, X_test = scaler_x.transform(X_train_raw), scaler_x.transform(X_test_raw)

# get the tensors to feed to the network
X_train, X_test, y_train, y_test = map(torch.from_numpy, [X_train, X_test, y_train, y_test])

In [0]:
# Same as before, create the network class
class Network(torch.nn.Module):
    def __init__(self):
        super().__init__()
        # Here we will be using 2 layers a 4x20 and a 20x3.
        # We have 4 inputs and 3 outputs
        self.layer_1 = torch.nn.Linear(4, 20)
        self.layer_2 = torch.nn.Linear(20, 3)
    def forward(self,xb):
        # apply the layers
        o = self.layer_1(xb)
        # relu is an activation function defined as max(0, xb).
        # basically clips all negative values
        o = torch.relu(o)
        o = self.layer_2(o)
        return o

In [0]:
# Same hiperparameters as always
epochs = 1500
batch_size = 100
lr = 0.1
loss_values = []

# the dataset and dataloader
train_ds_iris = TensorDataset(X_train, y_train)
train_dl_iris = DataLoader(train_ds_iris, batch_size=batch_size, shuffle=True)

# For classification problems we use a different loss function called
# Cross Entropy. This is also known as the negative log likelihood (nll).
# The Cross Entropy is indeed the nll with a softmax layer applied at the end.
# All these confusing names are needed so we optimize for a single class in the output
# meaning, decrease the values for the classes that are not correct, and increase
# the correct values.
loss_func = nn.CrossEntropyLoss()
model = Network()
# the optimizer is the same
opt = optim.SGD(model.parameters(), lr=lr)
def fit(model, optim, n_epochs, data, loss_func):    
    for epoch in range(n_epochs):
        epoch_loss = []
        for xb, yb in data:
            pred = model(xb)
            loss = loss_func(pred, yb)

            epoch_loss.append(loss.item())
            loss.backward()
            opt.step()
            opt.zero_grad()
            
        loss_values.append(np.mean(epoch_loss))
        
        if epoch % 50 == 0:
            print(f'Epoch [{epoch}/{epochs}] Loss: {loss_values[-1]:.4f}')
        
fit(model=model, optim=opt, n_epochs=epochs, data=train_dl_iris, loss_func=loss_func)

print("Ready")

In [0]:
# plot the loss
plt.plot(loss_values)

In [0]:
# This cool function let us create a good basic report with
# our model's accuracy
from sklearn.metrics import classification_report

In [0]:
# Our network have 3 outputs, but we only care about the highest value.
# The highest value means that the network thinks that is the correct class
# Because we used one-hot enconding, the index of the highest value is the
# predicted class. We use the max function to get the indices for each sample
_, y_test_pred = torch.max(model(X_test), dim=1)

# lets generate the report!
print(classification_report(y_test, y_test_pred, target_names=iris["target_names"]))

### Exercise: train on the MNIST dataset

The MNIST dataset is a classic dataset almost always used to teach neural networks as it presents an interesting task to solve.

The MNIST dataset is a set of hand-written numbers, from 0 to 9. As you know, everyone writes in a different way, but us humans are, almost all the time, able to differentiate from one number from another. Here we will be teaching a neural network to do the same and check how it performs.

The dataset consists of 70.000 samples, each of a 28x28 matrix. We will be using 60.000 samples to train our network, and the other 10.000 will be used for testing.
Since the numbers are in grey scale, we can just have a 28x28 matrix showing in each value how back the cell is. If the numbers had colours, we would have to take another approach to solve the problem, using Convolutional Neural Networks.

Before we start, we must note that we do not know how to handle a 28x28 input matrix, we just know how to handle vectors. What we can do to workaround this is reshaping the 28x28 matrix in a 784 element vector. Now he have everything to start training.

Okay, so let's begin.

In [0]:
# First lets import the dataset with a simple call
from torchvision import datasets
# This package will let us make some modifications to out data
import torchvision

In [0]:
# Download and load the dataset
train_data = datasets.MNIST('data', train=True, download=True)
test_data = datasets.MNIST('data', train=False, download=True)

# The dataset comes in a Byte format, so we convert it to Float
X_train, y_train = train_data.data.type(torch.FloatTensor), train_data.targets
X_test, y_test = test_data.data.type(torch.FloatTensor), test_data.targets

# The dataset comes in a (Samples x rows x columns) manner. We do not know how to
# handle the (rows x columns) input, so we just reshape it to a
# rows*columns long vector. In this case we will have a (Samples x (rows*columns))
# input tensor, being the first dimension the number of datapoints, and the second
# the input dimension.
# We use the handy -1 notation to keep the first dimension untouched, and
# reshape the rest of the matrix to a vector.
X_train, X_test = X_train.view(X_train.shape[0], -1), X_test.view(X_test.shape[0], -1)

In [0]:
# Let's take a look at some example points from the dataset, just to see how they look.
fig = plt.figure(figsize=(6, 4))
for i, index in enumerate(np.random.randint(0, X_train.shape[0], size=6)):
    ax = fig.add_subplot(2, 3, i+1, xticks=[], yticks=[])
    ax.imshow(X_train[index].reshape(28, 28), cmap='gray', interpolation='none')
    ax.set_title("Number: {}".format(y_train[index]))

In [0]:
# Time to train. Modify these parameters accordingly
import math
epochs = 20
batch_size = 10000 # best to be a big number to make it faster
lr = 0.1

In [0]:
# Add your epoch losses here
loss_values = []


# -------- Learn MNIST ---------
train_ds_mnist = TensorDataset(X_train, y_train)
test_ds_mnist = TensorDataset(X_test, y_test)
train_dl_mnist = DataLoader(train_ds_mnist, batch_size=batch_size, shuffle=True)
test_dl_mnist = DataLoader(test_ds_mnist, batch_size=batch_size, shuffle=True)

class Network(nn.Module):
    def __init__(self):
        super().__init__()
        # define the network!
        # with one layer it already works
        # remember we have a 28x28 input matrix (but we only know how to work with vectors!)
        # and we have a 10 possible outputs
        # so... ?
        
    def forward(self, xb):
        # define how the network works
        ###################
        o = xb
        return o
    
    
# Remember you need a loss function, an optimizer and to instantiate the network
# in this case the loss funtion can't be the MSE (why?), so which could be?

# complete this function as we saw before
def fit(model, optim, n_epochs, data, loss_func):   
    # you can ignore the logging function, but it helps debug!
        
# general fit parameters
fit(model=model, optim=opt, n_epochs=epochs, data=train_dl_mnist, loss_func=loss_func)

# -------- END Learn MNIST ---------
print("Ready")

In [0]:
# Lets plot our loss function!
plt.plot(loss_values)

# Make the predictions over the test set
# Remember we have not show these datapoints to the network
_, y_test_pred = torch.max(model(X_test), dim=1)

# Check how our network performs for each of the labels
print(classification_report(y_test, y_test_pred))

We can see that the model is not perfect, as it fail to recognize several samples. This number let us know that the network still have room for improvement, but we still have around a 90% success rate, that seems quite good.

Let's take a look at some predictions from our network. We will be putting correct label for each sample in parentheses. Keep running this cell until you see some red labels. That means the network failed to identify correctly the number in the picture. Do you find that number hard to identify? Do you think it is a valid mistake?

In [0]:
n = 20
fig = plt.figure(figsize=(25, 4))
for i, index in enumerate(np.random.randint(0, X_test.shape[0], size=n)):
    ax = fig.add_subplot(2, n/2, i+1, xticks=[], yticks=[])
    ax.imshow(X_test[index].reshape(28, 28), cmap='gray', interpolation='none')
    ax.set_title(f"pred: {y_test_pred[index]} (real: {y_test[index]})", 
                 color=("green" if y_test_pred[index]==y_test[index] else "red"))

### Now is time to test it on your own!

All the data in the MNIST dataset is provided. We can see some cool predictions and all but... does it really work?
Let's try something, let's draw some numbers ourselves, upload them to this notebook, and see if the network can identify then correctly!

In [0]:
# Import this package to upload files to the notebook
from google.colab import files

To draw our numbers, we could take a pencil and draw it in a papel sheet, then take a photo of it, tranfer it to the computer, resize it, and upload it here. But that takes too much time, so let's use an online image editor.

We recommend tested with https://www.favicon-generator.org/image-editor/ and it worked pretty well. Here are the instructions to generate your own number:
1. https://www.favicon-generator.org/image-editor/
2. Resize the canvas clicking on Image > Size... > enter a 28 x 28 value.
3. Now the canvas is quite small, so zoom in until you are confortable.
4. Click the Paint bucket and select the color black. Then click on the canvas to paint it black. We want to paint the whole background black.
5. Now select the Brush option and select the white color. Adjust the size to around 4.
6. Draw a number
7. Export your number with File > Save as... > and save it with jpg extension.

We will be using an easy and fast method to load out images our model. For that reason, follow the following instructions to generate a folder structure suitable for the method we will be using:
1. Create a folder called 'temp'. It can be any name you want, but you will have to modify the name in the following cells if you do use other name.
2. Create sub folders with the numbers you want to upload. For example, if I drew a 5 and 8, I will create a subfolder called 5, and other called 8.
3. Move the drawn numbers to their correspoding folder.
4. Compress the 'temp' folder into a zip file. If you are in windows, you can just right click the 'temp' folder > Send to > Compressed (zipped) folder. In MacOS, right click the folder > compress items. In linux just use the zip command `zip -r temp.zip temp`.
5. Run the next cell to upload the zipped folder and unzip it automatically into the notebook.

In [0]:
# Navigate and upload the zipped temo folder
files.upload()
# unzips
!unzip temp.zip 

We use ImageFolder to load and make automatic transformations to the images. In particular, the images are saved in RGB, that is a 3-channel format. This means that for each value in the matrix, we will have 3 color values. We don't know how to handle that, so we will be converting them into grey scale on a single channel. Then, because our original data ranges from 0 to 255, and the grey scale ranges from 0 to 1, we need to multiply our converted image by 255.
In this case we could have also scaled our original data to fit the rango 0 to 1, but that was not the case. Do you think that would have been an improvement to the network's training?
Finally because our data is a 28x28 matrix, we reshape it to a single 784 vector, with is the reshape(-1) instruction.

In [0]:
# Load the dataset based on a folder structure
user_made = torchvision.datasets.ImageFolder(
        root="temp/",
        transform=torchvision.transforms.Compose([
            torchvision.transforms.Grayscale(num_output_channels=1),
            torchvision.transforms.ToTensor(),
            torchvision.transforms.Lambda(lambda x: x * 255),
            torchvision.transforms.Lambda(lambda x: x.reshape(-1))
        ]) 
    )
# Ignore this. Just some limitations on Pytorchs end...
user_made.target_transform = lambda _id: int(list(user_made.class_to_idx.keys())[list(user_made.class_to_idx.values()).index(_id)])

# Create a loader to speed and simplify the process
user_data = DataLoader(user_made, batch_size=len(user_made.samples), shuffle=False)
# Just get all the data, but in the format we expect
user_x, user_y = next(iter(user_data))

# make predictions on our numbers!
_, y_user_pred = torch.max(model(user_x), dim=1)

Let's check if the network is capable of correctly identify the numbers we draw!

In [0]:
n = len(y_user_pred)
fig = plt.figure(figsize=(25, 2 * math.ceil(n/10)))
for i, user_tensor in enumerate(user_x):
    ax = fig.add_subplot(math.ceil(n/10), n/math.ceil(n/10), i+1, xticks=[], yticks=[])
    ax.imshow(user_tensor.reshape(28, 28), cmap='gray', interpolation='none')
    ax.set_title(f"pred: {y_user_pred[i]} (real: {user_y[i]})", 
                 color=("green" if y_user_pred[i]==user_y[i] else "red"))

----

That was for the day 1 of this workshop. If you want more details on what we have seen here are some resources:
* Pytorch documentation: https://pytorch.org/docs/stable/index.html
* An introduction to Deep Learning with Pytorch: https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html
* A tutorial on how autograd works: https://towardsdatascience.com/pytorch-autograd-understanding-the-heart-of-pytorchs-magic-2686cd94ec95
* More details on backpropagation and gradient calculus: http://cs231n.stanford.edu/handouts/derivatives.pdf
* More on matrix differentiation: https://www.comp.nus.edu.sg/~cs5240/lecture/matrix-differentiation.pdf
* The matrix calculus you neef for deep learning: https://explained.ai/matrix-calculus/index.html