**NOTE: This notebook is written for the Google Colab platform, which provides free hardware acceleration. However it can also be run (possibly with minor modifications) as a standard Jupyter notebook, using a local GPU.** 



In [None]:
#@title -- Installation of Packages -- { display-mode: "form" }
import sys
!{sys.executable} -m pip install git+https://github.com/michalgregor/class_utils.git

In [None]:
#@title -- Import of Necessary Packages -- { display-mode: "form" }
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.preprocessing import StandardScaler, OneHotEncoder, KBinsDiscretizer
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from class_utils import error_histogram
import torch.nn as nn
import torch

In [None]:
#@title -- Downloading Data -- { display-mode: "form" }
from class_utils.download import download_file_maybe_extract
download_file_maybe_extract("https://www.dropbox.com/s/p5q7gzupa2ndw55/sigmoid_regression_data.csv?dl=1", directory="data")

# also create a directory for storing any outputs
import os
os.makedirs("output", exist_ok=True)

## Neural Network Based Regression

In this notebook we are going to show how a simple neural net created using `PyTorch` can be applied to a regression problem. We are going to construct a very simple multi-layer perceptron, train it and visualize the results.

### The Dataset

Let's start by defining our regression problem. We will load our dataset from a CSV file – the data consists of noisy samples drawn from a sigmoid (logistic) curve. Given that we have encountered such data in prior notebooks, we are not going to go over the procedure of loading and preprocessing it and so the code of the following cell is hidden.



In [None]:
#@title -- Data Loading and Preprocessing; X_train, Y_train, X_test, Y_test -- { display-mode: "form" }
df = pd.read_csv("data/sigmoid_regression_data.csv")

# we create a discretized version of the y column
# to allow for stratification
kbins = KBinsDiscretizer(6, encode='ordinal')
y_stratify = kbins.fit_transform(df[['y']])

# we split the dataset into train and test
df_train, df_test = train_test_split(df, stratify=y_stratify,
                                 test_size=0.3, random_state=4)

# we specify the inputs and the outputs
categorical_inputs = []
numeric_inputs = ['x']
output = ['y']

# we create the pipeline
input_preproc = make_column_transformer(
    (make_pipeline(
        SimpleImputer(strategy='constant', fill_value='MISSING'),
        OneHotEncoder()),
     categorical_inputs),
    
    (make_pipeline(
        SimpleImputer(),
        StandardScaler()),
     numeric_inputs)
)

# we fit and apply the pipeline on the train set
X_train = input_preproc.fit_transform(df_train[categorical_inputs+numeric_inputs])
Y_train = df_train[output].values

# we apply the same pipeline to the test set,
# taking care to use transform and not fit_transform
X_test = input_preproc.transform(df_test[categorical_inputs+numeric_inputs])
Y_test = df_test[output].values

# we plot the data for visual inspection
plt.scatter(X_train, Y_train, marker='x', label="training data")
plt.scatter(X_test, Y_test, c='r', label="testing data")
plt.xlabel('x')
plt.ylabel('y')
plt.grid(ls='--')
plt.legend()
plt.savefig("output/regression_data.pdf", bbox_inches='tight', pad_inches=0)

In addition to our standard preprocessing, we will also transform the results into datatypes expected by PyTorch, i.e. into PyTorch tensors (similar to `numpy` arrays, but with autodiff support) of 32-bit floats.



In [None]:
X_train = torch.as_tensor(X_train, dtype=torch.float32)
Y_train = torch.as_tensor(Y_train, dtype=torch.float32)
X_test = torch.as_tensor(X_test, dtype=torch.float32)
Y_test = torch.as_tensor(Y_test, dtype=torch.float32)

### Selecting a Device and Transferring our Data

Our neural net can run on several different kinds devices. By default, everythings runs on the processor (CPU), but PyTorch also supports ceratain kinds of graphical cards (GPUs), the use of which can speed up the computations very significantly. There are also other special devices such as TPUs, FPGAs, etc., but to run your models on those, you will generally need some kinds of extensions to PyTorch.

Let us now specify what kind of device we want to use: let's say that we want to use a GPU, if it is available, and the CPU, if it is not. We can check for GPU availability using `torch.cuda.is_available`. Note that on a multi-gpu computer, you can also select which specific GPU or GPUs you want to use, but that is beyond the scope of this notebook.

Here we are merely going to select `"cuda"` (the GPU, so named after the CUDA framework from Nvidia) if `torch.cuda.is_available()` is true and `"cpu"` otherwise.



In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"

When using a certain device to run our model, we need to make sure to also transfer our data to that device's memory. This is easily done using the `.to(device)` method provided by PyTorch tensors. To transfer our data now into our selected device, we can run:



In [None]:
X_train = X_train.to(device)
Y_train = Y_train.to(device)
X_test = X_test.to(device)
Y_test = Y_test.to(device)

### Creation of the Neural Network and Training

In order to create our neural net, we will inherit from the base class `nn.Module`. All layers with learnable parameters are created in the constructor and assigned as attributes to our network. The way in which the layers connect to each other and compute the output from the input is defined in method `forward`. A neural net must have a certain fixed number of input and output neurons. The number of inputs will, of course, equal the number of columns in our `X_train`, while the number of outputs will equal the number of columns in our `Y_train`.

You'll recall that in neural networks built for regression, we usually **leave the last layer linear**  (without an activation function) so that it can produce unbounded outputs and does not have to learn to invert the effect of a non-linear activation function when its shape is not a good match for the regression task.



In [None]:
class Net(nn.Module):
    def __init__(self, num_inputs, num_outputs):
        super().__init__()
        self.fc1 = nn.Linear(num_inputs, 10)
        self.fc2 = nn.Linear(10, 10)
        self.fc3 = nn.Linear(10, num_outputs)

    def forward(self, x):
        y = self.fc1(x)
        y = torch.relu(y)
        
        y = self.fc2(y)
        y = torch.relu(y)
        
        y = self.fc3(y)        
        return y

Now we are ready to construct our model. Note that the model also needs to be transferred to our device of choice, which is done in exactly the same way we employed with the data: by calling `.to(device)`.



In [None]:
num_inputs = X_train.shape[1]
num_outputs = Y_train.shape[1]

model = Net(num_inputs, num_outputs)
model = model.to(device)

#### Running the Network

If we did everything correctly, we should now be able to run our data through the model. Let's try that with the first 5 rows from `X_train`.



In [None]:
y = model(X_train[:5, ...])
y

#### Tensors, Gradients, Detaching

You may have noticed the `grad_fn` in the printout of our tensor. As mentioned before, PyTorch tensors have built-in support for autodiff. When you run operations on them, an on-the-fly computational graph is being built, which can then be backpropagated through.

If you are going to do some further operations with your tensors that are not part of the training process, such as logging loss values, doing plotting, etc., it is a good idea to extract the data and get rid of the computational graph before you do anything else. You can do this using `.detach()`; you will see that the `grad_fn` part will be gone when you display the tensor.



In [None]:
y.detach()

#### Converting to NumPy

To convert your tensor into a `numpy` array, you can run `.numpy()` on it. Since the tensor can have gradient info attached, it is generally a good idea to call `.detach` first. Furthermore, the tensor can be on a different device, so to be safe, you'll usually also want to call `.cpu()` to transfer it back to the CPU first.

I.e. this is the fool-proof way of converting from PyTorch tensors to numpy arrays:



In [None]:
y_np = y.detach().cpu().numpy()
y_np

Similarly, if your tensor contains a scalar, you can extract it simply by calling `.item()`:



In [None]:
y_scalar = y.mean()
s = y_scalar.detach().cpu().item()
s

#### Running without Gradients

When you are running your model outside of training, you usually won't need the autodiff support and the computational graph. In such cases, what you want to do is turn the computational graph off, since building it involves some computational overhead. To do this, you can put your PyTorch calls under a `torch.no_grad()` context, e.g.:



In [None]:
with torch.no_grad():
    y = model(X_train[:5, ...])

y

Note how the tensor doesn't have `grad_fn` now even though we didn't run `.detach()` on it. This is because the `torch.no_grad()` context prevented the computation graph from being built in the first place.

#### Train Mode vs. Eval Mode

PyTorch has a number of special layers that behave differently during training than they do during inference. For instance, there is the dropout layer, which – during training – keeps randomly turning off some portion of the layer's neurons to help prevent the network from overfitting. During inference this behaviour is, of course, deactivated since one doesn't want to interfere with the quality of predictions.

To support both these cases, then, PyTorch models have two distinct modes:

* **Training Mode:**  When training the model, you put it into training mode by calling `model.train()`;
* **Evaluation Mode:**  When running inference, you put it into evaluation mode by calling `model.eval()`.


In [None]:
# during training:
model.train()
y = model(X_train[:5, ...])

# during inference:
model.eval()
with torch.no_grad():
    y = model(X_test[:5, ...])

### The Training Loop

After this, the only thing that remains is to run the training. In PyTorch, this is relatively verbose: we need to construct a loss function, an optimizer and write the entire training loop from scratch. However, this is an approach that allows a lot of flexibility, which is going to be very useful when constructing and training more complex models.

In later examples, we will show how to train on mini-batches and we may even enhance our training loop with more fancy features such as learning rate scheduling, early stopping, loading data on the fly and augmenting it, etc. For now, however, we are going to keep things simple. Since our data is tiny, we are going to do training in full-batch mode, i.e. feed all our training data into the model at once.

#### Constructing the Optimizer

We are going to use `Adam` as our optimizer of choice. When constructing it, we need to specify:

* what parameters it will be optimizing – we specify `model.parameters()`, i.e. parameters of our model;
* what its learning rate is going to be.
#### Constructing the Loss Function

For the loss function, we are going to go with the **mean squared error** , which is a common choice for regression problems. We can construct it simply using PyTorch's `nn.MSELoss`.

The rest of the code for the training loop is going to explained through comments in the following code cell.



In [None]:
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
loss_train = [] # we will store the training loss here for plotting

# we are going to train for a number of  epochs
for epoch in range(1000):
    # we put the model in training mode
    model.train()

    # we run our data through the model
    y = model(X_train)

    # we measure the loss and record it
    loss = criterion(y, Y_train)
    loss_train.append(loss.detach().item())

    # we clear any gradients that have been
    # computed in the previous iteration
    optimizer.zero_grad()

    # we backpropagate the loss
    loss.backward()

    # we update the weights using the optimizer
    optimizer.step()

    # we print a progress report every now and then
    if epoch % 100 == 0:
        print(f"epoch {epoch}, loss: {np.mean(loss_train[-20:])}")

print(f"epoch {epoch}, loss: {np.mean(loss_train[-20:])}")

### Testing

Now that we have trained our model, we are ready to test its performance. We remember to put our model into evaluation mode using `model.eval()` first and running the model inside `torch.no_grad()` to skip building the computational graph.

To evaluate, we are going to compute the MSE, the MAE and display our usual histogram of errors on a standardized scale.

#### On Training Data



In [None]:
Y_train_cpu = Y_train.cpu()

model.eval()
with torch.no_grad():
    y_train_cpu = model(X_train).cpu()

# we compute and display the MSE and the MAE
mse = mean_squared_error(Y_train_cpu, y_train_cpu)
print("MSE = {}".format(mse))

mae = mean_absolute_error(Y_train_cpu, y_train_cpu)
print("MAE = {}".format(mae))

# we display the error histogram
plt.figure(figsize=(8, 6))
error_histogram(Y_train_cpu, y_train_cpu, Y_fit_scaling=Y_train_cpu)

#### On Testing Data



In [None]:
Y_test_cpu = Y_test.cpu()

model.eval()
with torch.no_grad():
    y_test_cpu = model(X_test).cpu()

# we compute and display the MSE and the MAE
mse = mean_squared_error(Y_test_cpu, y_test_cpu)
print("MSE = {}".format(mse))

mae = mean_absolute_error(Y_test_cpu, y_test_cpu)
print("MAE = {}".format(mae))

# we display the error histogram
plt.figure(figsize=(8, 6))
error_histogram(Y_test_cpu, y_test_cpu, Y_fit_scaling=Y_train)

These results indicate that the model works quite well – the errors are low on both the train and the test set. Since we are working with 2D data, let's also plot the points in the original space.

We may still observe minor artifacts in some parts of the curve, but the overall shape should be captured reasonably well, if our results are good on the train and the test set.



In [None]:
#@title -- Regression Curve vs. Data -- { display-mode: "form" }
x_min = min(torch.min(X_train), torch.min(X_test))
x_max = max(torch.max(X_train), torch.max(X_test))
xx = torch.linspace(x_min, x_max, 250).reshape((-1, 1))

model.eval()
with torch.no_grad():
    yy = model(xx.to(device))
    yy = yy.cpu()

plt.scatter(X_train.cpu(), Y_train.cpu(), marker='x', label="training data")
plt.scatter(X_test.cpu(), Y_test.cpu(), c='r', label="testing data")

plt.plot(xx, yy, label="regression curve", c='k')

plt.xlabel('x')
plt.ylabel('y')
plt.grid(ls='--')
plt.legend()

plt.savefig("output/regression.pdf", bbox_inches="tight", pad_inches=0)