# Ingredients of all machine learning

1. The model
1. The cost function
1. The optimizer
1. Training loop
1. Evaluation

In [1]:
import math
import numpy as np
import pandas as pd
import torch
import torch.utils.data

## The model

We're building a linear regression model. This takes in a feature vector $x$, multiplies it by a vector of weights $W$ to get a single value, and adds a bias term $b$, to obtain the predicted value.

In [2]:
class LinearRegressor(torch.nn.Module):

    def __init__(self, num_features, bias=True):
        super().__init__()
        self.model = torch.nn.Linear(num_features, 1)
        # Linear implicitly defines its own variables - weights W and a bias term b
        
    def forward(self, x):
        return self.model(x)  # this computes W*x + b

In [3]:
# initialize the linear regression model
# we'll use the Boston Housing dataset, which has 13 features
num_features = 13
model = LinearRegressor(num_features)

In [4]:
model

LinearRegressor(
  (model): Linear(in_features=13, out_features=1, bias=True)
)

Initialization also randomly initializes the weights and bias term:

In [5]:
list(model.parameters())

[Parameter containing:
 tensor([[-0.2593,  0.2445,  0.2290, -0.0083, -0.2726, -0.2167,  0.1122, -0.0568,
          -0.1211, -0.0373,  0.1343, -0.0145, -0.1811]], requires_grad=True),
 Parameter containing:
 tensor([0.1605], requires_grad=True)]

## The cost function

This is usually called the "criterion" in PyTorch contexts. For linear regression, this is the mean squared error. For many other applications, such as logistic regression, this is usually the cross-entropy loss.

In [6]:
criterion = torch.nn.MSELoss()

## The optimizer

The most basic optimizer is Stochastic Gradient Descent. However, a good default to choose is generally the Adam optimizer, so that's what we'll do:

In [7]:
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

## The train loop

In the train loop, we have to do the following:

1. Zero out the gradients
1. Take in the x's and y's, moving them to the GPU if necessary
1. Run the model forward
1. Calculate the loss
1. Back-propagate the loss to calculate the deltas of the model parameters
1. Update the parameters of the model

The following lines of code show how these work.

In [8]:
# zero out the gradients
optimizer.zero_grad()
optimizer

Adam (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.999)
    eps: 1e-08
    lr: 0.001
    weight_decay: 0
)

In [9]:
# Run the model forward on some random data

inputs = torch.rand(1, 13)
target = torch.rand(1, 1)

output = model(inputs)

print("Inputs: ", inputs)
print("Target: ", target)
print("Output: ", output)

Inputs:  tensor([[0.7741, 0.2144, 0.7263, 0.4072, 0.3888, 0.0608, 0.1825, 0.9052, 0.0596,
         0.5792, 0.7549, 0.3875, 0.9979]])
Target:  tensor([[0.2877]])
Output:  tensor([[-0.0888]], grad_fn=<AddmmBackward>)


In [10]:
# Calculate the loss

loss = criterion(output, target)
loss

tensor(0.1418, grad_fn=<MseLossBackward>)

In [11]:
# Back-propagate the loss
loss.backward()

In [12]:
# update the parameters
optimizer.step()

We repeat this for all the examples in the training set, and cycle through the training set some number of times. Each cycle through the training set is called an *epoch*.

### Obtaining the training data

Before we actually do our training, we'll need real data instead of some randomized data! A standard pattern is to build `Datasets` and then `DataLoaders`.

In [13]:
df = pd.read_csv("data/01_boston_housing_dataset.csv")

In [14]:
df.head()

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2


The first 13 are our features and the last column `medv` represents the dependent (or target) variable that we are trying to predict: the median value of houses in this area.

We start by defining a `DataSet` that will yield `(x, y)` pairs that the training loop can read in.

In [15]:
df[df.columns[:num_features]].astype(np.float32).to_numpy()[0]

array([6.320e-03, 1.800e+01, 2.310e+00, 0.000e+00, 5.380e-01, 6.575e+00,
       6.520e+01, 4.090e+00, 1.000e+00, 2.960e+02, 1.530e+01, 3.969e+02,
       4.980e+00], dtype=float32)

In [16]:
class BostonHousingDataset(torch.utils.data.Dataset):
    def __init__(self, df):

        # these two are optional
        self.feature_names = list(df.columns[:num_features])
        self.target_name = df.columns[num_features]

        # these two are not, but you can name them anything you like
        # as long as __getitem__ correctly returns the i'th (X, y) pair
        # and __len__ returns the size of the dataset.
        self.X = df[df.columns[:num_features]].astype(np.float32).to_numpy()
        self.y = df['medv'].astype(np.float32).to_numpy()

    def __getitem__(self, index):
        X = self.X[index]
        y = self.y[index]

        return X, np.array([y])

    def __len__(self):
        return len(self.y)

In [17]:
train_ratio = 0.7
num_train_examples = math.floor(len(df) * train_ratio)

train_ds = BostonHousingDataset(df[:num_train_examples])
valid_ds = BostonHousingDataset(df[num_train_examples:])

train_ds[0]

(array([6.320e-03, 1.800e+01, 2.310e+00, 0.000e+00, 5.380e-01, 6.575e+00,
        6.520e+01, 4.090e+00, 1.000e+00, 2.960e+02, 1.530e+01, 3.969e+02,
        4.980e+00], dtype=float32), array([24.], dtype=float32))

In [18]:
print(len(train_ds))
print(len(valid_ds))

354
152


To check that we've set the dataset up correctly, let's try running the model forward and getting a prediction from the dataset.

In [19]:
with torch.no_grad():
    print(model.forward(torch.from_numpy(train_ds[0][0])))

tensor([-4.3728])


In [20]:
# this is the actual y:
train_ds[0][1]

array([24.], dtype=float32)

Note that since we haven't trained anything yet, this is still using the randomly initialized set of weights `W` and bias `b`.

Here's an example of running the model forward on multiple rows in the dataset:

In [21]:
with torch.no_grad():
    print(model.forward(torch.from_numpy(train_ds[:3][0])))

tensor([[-4.3728],
        [-4.7175],
        [-5.9226]])


In [22]:
# this is the actual y vector:
train_ds[:3][1]

array([[24. , 21.6, 34.7]], dtype=float32)

Now that we have our dataset set up, we'll go further and create a couple of `DataLoader`s, one for training and one for validation.

In [23]:
train = torch.utils.data.DataLoader(train_ds,
                                    batch_size=10, shuffle=True)
valid = torch.utils.data.DataLoader(valid_ds,
                                    batch_size=10)

In [24]:
# Number of batches in training and testing:
print("Number of training batches:", len(train))
print("Number of validation batches:", len(valid))

Number of training batches: 36
Number of validation batches: 16


### Now the actual training loop

In [25]:
NUM_EPOCHS = 1000

# if you have a GPU, it's probably better to do this there
# you'll need to send both your model and the X/y data to the GPU (see below)
# otherwise, this code will execute on CPU
if torch.cuda.is_available():
    model.to('cuda')

model.train()
for i in range(NUM_EPOCHS):
    epoch_losses = []

    for X_batch, y_batch in train:
        optimizer.zero_grad()
        
        if torch.cuda.is_available():
            X_batch = X_batch.to('cuda')
            y_batch = y_batch.to('cuda')
        
        y_pred = model(X_batch)
        loss = criterion(y_pred, y_batch)
        
        loss.backward()
        optimizer.step()
        
        epoch_losses.append(loss.item())

    epoch_loss = np.sqrt(np.mean(epoch_losses))
    if (i + 1) % 10 == 0:
        print(f'Epoch: {i+1}, RMSE: {epoch_loss:.3f}')

Epoch: 10, RMSE: 8.402
Epoch: 20, RMSE: 7.493
Epoch: 30, RMSE: 6.980
Epoch: 40, RMSE: 6.653
Epoch: 50, RMSE: 6.603
Epoch: 60, RMSE: 6.405
Epoch: 70, RMSE: 6.180
Epoch: 80, RMSE: 6.180
Epoch: 90, RMSE: 6.044
Epoch: 100, RMSE: 5.886
Epoch: 110, RMSE: 5.900
Epoch: 120, RMSE: 5.812
Epoch: 130, RMSE: 5.746
Epoch: 140, RMSE: 5.617
Epoch: 150, RMSE: 5.641
Epoch: 160, RMSE: 5.504
Epoch: 170, RMSE: 5.590
Epoch: 180, RMSE: 5.557
Epoch: 190, RMSE: 5.427
Epoch: 200, RMSE: 5.325
Epoch: 210, RMSE: 5.311
Epoch: 220, RMSE: 5.298
Epoch: 230, RMSE: 5.229
Epoch: 240, RMSE: 5.270
Epoch: 250, RMSE: 5.148
Epoch: 260, RMSE: 5.163
Epoch: 270, RMSE: 5.067
Epoch: 280, RMSE: 5.028
Epoch: 290, RMSE: 4.974
Epoch: 300, RMSE: 4.955
Epoch: 310, RMSE: 4.934
Epoch: 320, RMSE: 4.849
Epoch: 330, RMSE: 4.944
Epoch: 340, RMSE: 4.844
Epoch: 350, RMSE: 4.862
Epoch: 360, RMSE: 4.855
Epoch: 370, RMSE: 4.722
Epoch: 380, RMSE: 4.735
Epoch: 390, RMSE: 4.660
Epoch: 400, RMSE: 4.693
Epoch: 410, RMSE: 4.600
Epoch: 420, RMSE: 4.649
E

Note that this is almost certainly over-fitting to the training data.

## Evaluation

In the case of linear regression, the criterion we optimize for -- mean square error -- is usually the same as the metric that we actually use to evaluate the system. As we move to logistic regression-like tasks, these will diverge.

Because our loss criterion and our evaluation metric are the same, the following should seem like old hat. Nevertheless, let's do it:

In [26]:
# if it's on the GPU, bring the model back to the CPU
model.to('cpu')

# set our model to evaluation mode
# for this particular model, this won't actually change anything.
# in other models where the training and eval loop operate differently
# (for example, if you're using dropout), this is absolutely necessary!
model.eval()

LinearRegressor(
  (model): Linear(in_features=13, out_features=1, bias=True)
)

In [27]:
with torch.no_grad():  # don't do backprop
    mses = list()
    for X_batch, y_batch in valid:
        y_pred = model(X_batch)
        mse = criterion(y_pred, y_batch)
        mses.append(mse)
    print("RMSE: ", np.sqrt(np.mean(mses)))

RMSE:  20.990253


This RMSE is a lot higher than the training error, suggesting that we've overfit to the training set. There are some ways to fix this:

* Adding regularization
* Early stopping

We'll implement these in future notebooks.

## Hyperparameters you can modify:

The code above sets some hyperparameters that you can try playing with:

* Model:
  * Initialization strategy, e.g. Xavier
* Optimizer:
  * Which optimizer to use
  * Parameters of the optimizer, such as the learning rate
* Training loop:
  * Batch size
  * Number of epochs