# Introducing PyTorch

_written by [Gene Kogan](https://www.genekogan.com), updated by Sebastian Quinard_

-----

In the next cell, we introduce [PyTorch](https://pytorch.org/), which is an open-source framework which impelments machine learning methodology, particularly that of deep neural networks, by optimizing the efficiency of the computation. We do not have to deal so much with the details of this. Most importantly, PyTorch efficiently implement backpropagation to train neural networks on the GPU.

To start, we will re-implement what we did in the last notebook, a neural network to to predict the sepal width of the flowers in the Iris dataset. In the last notebook, we trained a manually-coded neural network for this, but this time, we'll use PyTorch instead. 

Let's load the Iris dataset again.

In [2]:
import numpy as np
from sklearn.datasets import load_iris

iris = load_iris()
data, labels = iris.data[:,0:3], iris.data[:,3]

First we need to shuffle and pre-process the data. Pre-processing in this case is normalization of the data, as well as converting it to a properly-shaped numpy array.

In [3]:
num_samples = len(labels)  # size of our dataset
shuffle_order = np.random.permutation(num_samples)
data = data[shuffle_order, :]
labels = labels[shuffle_order]

# normalize data and labels to between 0 and 1 and make sure it's float32
data = data / np.amax(data, axis=0)
data = data.astype('float32')
labels = labels / np.amax(labels, axis=0)
labels = labels.astype('float32')

# print out the data
print("shape of X", data.shape)
print("first 5 rows of X\n", data[0:5, :])
print("first 5 labels\n", labels[0:5])

shape of X (150, 3)
first 5 rows of X
 [[0.62025315 0.5681818  0.65217394]
 [0.75949365 0.77272725 0.65217394]
 [0.721519   0.8636364  0.24637681]
 [0.7721519  0.59090906 0.8115942 ]
 [0.6329114  0.8181818  0.20289855]]
first 5 labels
 [0.68 0.64 0.12 0.56 0.08]


### Overfitting and validation

In our previous guides, we always evaluated the performance of the network on the same data that we trained it on. But this is wrong; our network could learn to "cheat" by overfitting to the training data (like memorizing it) so as to get a high score, but then not generalize well to actually unknown examples.

In machine learning, this is called "overfitting" and there are several things we have to do to avoid it. The first thing is we must split our dataset into a "training set" which we train on with gradient descent, and a "test set" which is hidden from the training process that we can do a final evaluation on to get the true accuracy, that of the network trying to predict unknown samples.

Let's split the data into a training set and a test set. We'll keep the first 30% of the dataset to use as a test set, and use the rest for training.

In [4]:
# let's rename the data and labels to X, y
X, y = data, labels

test_split = 0.3  # percent split

n_test = int(test_split * num_samples)

x_train, x_test = X[n_test:, :], X[:n_test, :]
x_train = torch.from_numpy(x_train)
x_test = torch.from_numpy(x_test)
y_train, y_test = y[n_test:], y[:n_test] 
y_train = torch.from_numpy(y_train)
y_test = torch.from_numpy(y_test)
print('%d training samples, %d test samples' % (x_train.shape[0], x_test.shape[0]))

105 training samples, 45 test samples


## Creating the Model

In PyTorch, to instantiate a neural network model, we inherit the class nn.Module which grants our class `Net` all the functionality of the nn.Module. We then instantiate it with the init() class. Note however, we must also instantiate the class we are inheriting from (i.e. nn.Module). This alone creates an empty neural network, so to populate it, we add the type of layer we want. In this case, we add a linear layer, a layer that is \"fully-connected,\" meaning all of its neurons are connected to all the neurons in the previous layer, with no empty connections. This may seem confusing at first because we have not yet seen neural network layers which are not fully-connected; we will see this in the next chapter when we introduce convolutional networks.

Next we see the addition of the forward method. We couple the forward method with the layer(s) above to apply a variety of activation functions onto the specified layer.

Finally, we will add the output layer, which will be a fully-connected layer whose size is 1 neuron. This neuron will contain our final output.

In [5]:
import torch
import torch.nn as nn
import torch.nn.functional as F


class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(3, 8) # we get 3 from input dimension, and 8 from desired output
        self.fc2 = nn.Linear(8, 1) # Output Layer
    
    def forward(self, x):
        x = F.sigmoid(self.fc1(x))
        x = self.fc2(x)
        return x

That may be a lot to take in, but once you fully understand the excerpt above, this structure will be used time and time again to build increasingly complex neural networks.

Next we instantiate a new object based on the class.

In [6]:
net = Net()

We can also get a readout of the current state of the network using `print(net)`:

In [7]:
print(net)

Net(
  (fc1): Linear(in_features=3, out_features=8, bias=True)
  (fc2): Linear(in_features=8, out_features=1, bias=True)
)


So we've added 9 parameters, 8x1 weights between the hidden and output layers, and 1 bias in the output. So we have 41 parameters in total.

Now we are finished specifying the architecture of the model. Now we need to specify our loss function and optimizer, and then compile the model. Let's discuss each of these things.

First, we specify the loss. The standard for regression, as we said before is sum-squared error (SSE) or mean-squared error (MSE). SSE and MSE are basically the same, since the only difference between them is a scaling factor ($\frac{1}{n}$) which doesn't depend on the final weights.

The optimizer is the flavor of gradient descent we want. The most basic optimizer is "stochastic gradient descent" or SGD which is the learning algorithm we have used so far. We have mostly used batch gradient descent so far, which means we compute our gradient over the entire dataset. For reasons which will be more clear when we cover learning algorithms in more detail, this is not usually favored, and we instead calculate the gradient over random subsets of the training data, called mini-batches.

Once we've specified our loss function and optimizer, the model is compiled.

In [8]:
from torch import optim
optimizer = optim.SGD(net.parameters(), lr=0.01)
criterion = nn.MSELoss()

We are finally ready to train. First we must zero our gradients, so as not to rely on the previously uncovered gradient in our solution. The opposite is imperitive to RNN, as we need the last result to influence the next.
Next we create a forward pass on our neural network. The loss function is then applied to determine the level of error. After, we complete a backwards pass to compute the new weights.  

Loss is also printed.

In [9]:

def fullPass(data, labels):

    running_loss = 0.0
    for i in range(0,data.size()[0]):
        optimizer.zero_grad()
        outputs = net(data[i])
        loss = criterion(outputs, labels[i])
        loss.backward()
        optimizer.step()

        # print statistics
        running_loss += loss
        if i % data.size()[0] == data.size()[0]-1:
            print('[%d, %5d] loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / data.size()[0]))
            running_loss = 0.0

net.train()
for epoch in range(400):
    fullPass(x_train, y_train);


print('Finished Training')


  return F.mse_loss(input, target, reduction=self.reduction)


[1,   105] loss: 0.105
[2,   105] loss: 0.088
[3,   105] loss: 0.086
[4,   105] loss: 0.085
[5,   105] loss: 0.084
[6,   105] loss: 0.082
[7,   105] loss: 0.080
[8,   105] loss: 0.079
[9,   105] loss: 0.077
[10,   105] loss: 0.076
[11,   105] loss: 0.074
[12,   105] loss: 0.072
[13,   105] loss: 0.070
[14,   105] loss: 0.068
[15,   105] loss: 0.067
[16,   105] loss: 0.065
[17,   105] loss: 0.063
[18,   105] loss: 0.061
[19,   105] loss: 0.059
[20,   105] loss: 0.057
[21,   105] loss: 0.054
[22,   105] loss: 0.052
[23,   105] loss: 0.050
[24,   105] loss: 0.048
[25,   105] loss: 0.046
[26,   105] loss: 0.044
[27,   105] loss: 0.042
[28,   105] loss: 0.040
[29,   105] loss: 0.038
[30,   105] loss: 0.036
[31,   105] loss: 0.034
[32,   105] loss: 0.033
[33,   105] loss: 0.031
[34,   105] loss: 0.029
[35,   105] loss: 0.028
[36,   105] loss: 0.026
[37,   105] loss: 0.025
[38,   105] loss: 0.023
[39,   105] loss: 0.022
[40,   105] loss: 0.021
[41,   105] loss: 0.020
[42,   105] loss: 0.019
[

As you can see above, we train our network down to a validation MSE < 0.01. Notice that both the training loss ("loss") and validation loss ("val_loss") are reported. It's normal for the training loss to be lower than the validation loss, since the network's objective is to predict the training data well. But if the training loss is much lower than our validation loss, it means we are overfitting and may not expect to receive very good results.

We can evaluate the training set one last time at the end using `eval`.

In [15]:
net.eval()
fullPass(x_test, y_test)

[400,    45] loss: 0.007


In [16]:
y_pred = net(x_test)

We can manually calculate MSE as a sanity check:

In [17]:
def MSE(y_pred, y_test):
    return (1.0/len(y_test)) * np.sum([((y1[0]-y2)**2) for y1, y2 in list(zip(y_pred, y_test))])

print("MSE is %0.4f" % MSE(y_pred, y_test))

MSE is 0.0062


We can also predict the value of a single unknown example or a set of them in th following way:

In [14]:
x_sample = x_test[0].reshape(1, 3)   # shape must be (num_samples, 3), even if num_samples = 1
y_prob = net(x_sample)

print("predicted %0.3f, actual %0.3f" % (y_prob[0][0], y_test[0]))

predicted 0.585, actual 0.680


We've now finished introducing PyTorch for regression. Note it is a far more powerful way of training neural networks than our own. PyTorch's strengths will become even more apparent when we introduce classification in the next lesson, as well as introduce convolutional networks and various other optimization tricks it enables for us.