# Week 4: Introduction to PyTorch

In [17]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim

#### Load the dataset, split into input (X) and output (y) variables

In [18]:
dataset = np.loadtxt('pima-indians-diabetes.csv', delimiter=',')
X = dataset[:,0:8]
y = dataset[:,8]

But these data should be converted to PyTorch tensors first. One reason is that PyTorch
usually operates in a 32-bit floating point while NumPy, by default, uses a 64-bit floating point. Mix-and-match is not allowed in most operations. Converting to PyTorch tensors can avoid the implicit conversion that may cause problems. You can also take this chance to correct the shape to fit what PyTorch would expect, e.g., prefer n×1 matrix over n-vectors.

**To convert, create a tensor out of NumPy arrays:**

In [19]:
X = torch.tensor(X, dtype=torch.float32)
y = torch.tensor(y, dtype=torch.float32).reshape(-1, 1)

You are now ready to define your neural network model.
A model can be defined as a sequence of layers. You create a Sequential model with the
layers listed out. The first thing you need to do to get this right is to ensure the first layer has the correct number of input features. In this example, you can specify the input dimension 8 for the eight input variables as one vector.

The other parameters for a layer or how many layers you need for a model is not an easy
question. You may use heuristics to help you design the model, or you can refer to other people’s designs in dealing with a similar problem. Often, the best neural network structure is found through a process of trial-and-error experimentation. Generally, you need a network large enough to capture the structure of the problem but small enough to make it fast. In this example, let’s use a fully-connected network structure with three layers. Fully connected layers or dense layers are defined using the Linear class in PyTorch. It simply means an operation similar to matrix multiplication. You can specify the number of inputs as the first argument and the number of outputs as the second argument. The number of outputs is sometimes called the number of neurons or number of nodes in the layer. You also need an activation function after the layer. If not provided, you just take the output
of the matrix multiplication to the next step, or sometimes you call it using linear activation, hence the name of the layer.

In this example, you will use the rectified linear unit activation function, referred to as ReLU, on the first two layers and the sigmoid function in the output layer.
A sigmoid on the output layer ensures the output is between 0 and 1, which is easy to map to either a probability of class 1 or snap to a hard classification of either class by a cut-off threshold of 0.5. In the past, you might have used sigmoid and tanh activation functions for all layers, but it turns out that sigmoid activation can lead to the problem of vanishing gradient in deep neural networks, and ReLU activation is found to provide better performance in terms of both speed and accuracy.

You can piece it all together by adding each layer such that:

- The model expects rows of data with 8 variables (the first argument at the first layer set to 8)
- The first hidden layer has 12 neurons, followed by a ReLU activation function
- The second hidden layer has 8 neurons, followed by another ReLU activation function
- The output layer has one neuron, followed by a sigmoid activation function.

In [20]:
class PimaClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.hidden1 = nn.Linear(8, 12)
        self.act1 = nn.ReLU()
        self.hidden2 = nn.Linear(12, 8)
        self.act2 = nn.ReLU()
        self.output = nn.Linear(8, 1)
        self.act_output = nn.Sigmoid()

    def forward(self, x):
        x = self.act1(self.hidden1(x))
        x = self.act2(self.hidden2(x))
        x = self.act_output(self.output(x))
        return x

model = PimaClassifier()
print(model)
        

PimaClassifier(
  (hidden1): Linear(in_features=8, out_features=12, bias=True)
  (act1): ReLU()
  (hidden2): Linear(in_features=12, out_features=8, bias=True)
  (act2): ReLU()
  (output): Linear(in_features=8, out_features=1, bias=True)
  (act_output): Sigmoid()
)


In this approach, a class needs to have all the layers defined in the constructor because you need to prepare all its components when it is created, but the input is not yet provided. Note that you also need to call the parent class’s constructor (the line `super().__init__()`) to bootstrap your model. You also need to define a `forward()` function in the class to tell, if an input tensor x is provided, how you produce the output tensor in return.

Once you decide on the loss function, you also need an optimizer. The optimizer is the
algorithm you use to adjust the model weights progressively to produce a better output.
There are many optimizers to choose from, and in this example, Adam is used. This popular
version of gr adient descent can automatically tune itself and gives good results in a wide range of problems.

The optimizer usually has some configuration parameters. Most notably, the learning rate
lr. But all optimizers need to know what to optimize. Therefore. you pass on model.parameters(), which is a generator of all parameters from the model you created.

In [21]:
loss_fn = nn.BCELoss() # Binary cross entropy loss
optimizer = optim.Adam(model.parameters(), lr=0.001)

You have defined your model, the loss metric, and the optimizer. It is ready for training by executing the model on some data.

Training a neural network model usually takes in epochs and batches. They are idioms for
how data is passed to a model:

- Epoch: Passes the entire training dataset to the model once
- Batch: One or more samples passed to the model, from which the gradient descent
algorithm will be executed for one iteration.

The size of a batch is limited by the system’s memory. Also, the number of computations
required is linearly proportional to the size of a batch. The total number of batches over
many epochs is how many times you run the gradient descent to refine the model. It is a
trade off that you want more iterations for the gradient descent so you can produce a better model, but at the same time, you do not want the training to take too long to complete. The number of epochs and the size of a batch can be chosen experimentally by trial and error.

The goal of training a model is to ensure it learns a good enough mapping of input data to
output classification. It will not be perfect, and errors are inevitable. Usually, you will see the amount o f error reducing when in the later epochs, but it will eventually level out. This is called model convergence.

In [22]:
n_epochs = 100
batch_size = 10

for epoch in range(n_epochs):
    for i in range(0, len(X), batch_size):
        Xbatch = X[i:i+batch_size]
        y_pred = model(Xbatch)
        ybatch = y[i:i+batch_size]
        loss = loss_fn(y_pred, ybatch)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    print(f'Finished epoch {epoch}, latest loss {loss}')

Finished epoch 0, latest loss 0.6111011505126953
Finished epoch 1, latest loss 0.5506048798561096
Finished epoch 2, latest loss 0.5479680299758911
Finished epoch 3, latest loss 0.5462202429771423
Finished epoch 4, latest loss 0.5351380109786987
Finished epoch 5, latest loss 0.5223537683486938
Finished epoch 6, latest loss 0.5195610523223877
Finished epoch 7, latest loss 0.5059084892272949
Finished epoch 8, latest loss 0.5092824697494507
Finished epoch 9, latest loss 0.5084242224693298
Finished epoch 10, latest loss 0.5107033848762512
Finished epoch 11, latest loss 0.5091471076011658
Finished epoch 12, latest loss 0.5129774808883667
Finished epoch 13, latest loss 0.5170820951461792
Finished epoch 14, latest loss 0.5171381235122681
Finished epoch 15, latest loss 0.5113753080368042
Finished epoch 16, latest loss 0.5145292282104492
Finished epoch 17, latest loss 0.5092077255249023
Finished epoch 18, latest loss 0.49967071413993835
Finished epoch 19, latest loss 0.504838228225708
Finished e