<a href="https://colab.research.google.com/github/hotbread213/createClass/blob/master/day1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fin-ML/IVADO Finance and Insurance Workshop
## Introduction to Machine Learning
Day 1 afternoon tutorial, March 4th 2019

1. Intro to Numpy and Scipy

2. Intro to Scikit-learn and machine learning concepts

  * Supervised learning (both frequentist and Bayesian)
    * Fitting and prediction
    * Generalization error
    * Regularization and priors
    * Hyperparameter tuning and model selection
    * Exercise
  * Unsupervised learning
    * A clustering example
 
3.  Intro to Pytorch

  * A multilayer perceptron


## 1. Intro to Numpy and Scipy

The Python standard library has only limited support for computational mathematics, mostly in `math` and `random`. To remedy these deficiencies, by far the most popular libraries are `numpy` and `scipy`.

**`numpy`** is a tensor algebra library. Typical operations supported by `numpy` include tensor multiplication, indexing, random sampling, ect. Most routines are ultimately a wrapper around BLAS/LAPACK implementations in Fortran.

**`scipy`** is a computational mathematics library that complements `numpy`. Typical operations supported by `scipy` include "higher-level" numerical algebra (such as solving linear systems), numerical integration, Fourier transforms, ect.

Those are not specifically _machine learning_ libraries, but they are often used as the foundation for machine learning code, so it is essential (and inescapable) to learn to use them.

### 1.1 Insurance data

Now that we have upgraded numpy, we are ready to do some data analysis.
As a motivating example throughout the afternoon, we will use an insurance dataset.

In [0]:
!wget -O /content/insurance_data.pickle https://www.dropbox.com/s/8enthlp3q4mxtdx/insurance_data.pickle

In [0]:
import numpy as np
import scipy as sp
import pickle

with open("/content/insurance_data.pickle", 'rb') as file:
  data = pickle.load(file)
  
print(f"The dataset has {data['features'].shape[0]} examples with {data['features'].shape[1]} features\n")

print(f"First five features: {data['feature_names'][:5]}")
print(f"First five features of first observation: {data['features'][0, :5]}")
print(f"Total claims of first observation: {data['total_claims'][0]}$")

Tensor algebra works rather intuitively in numpy. For example, if we wanted to create a binary vector that encodes whether a given policy had a nonzero claim, we could do:

In [0]:
data['has_claim'] = (data['total_claims'] != 0)
nb_of_nonzero_claims = sum(data['has_claim'])
print(f"There are {nb_of_nonzero_claims} policies with nonzero claims among the {len(data['has_claim'])}.")

We can also easily index. For example, to access the subset of all policies that have nonzero claims, we can do:

In [0]:
data['features_with_claims'] = data['features'][data['has_claim'], :]
data['nonzero_claims'] = data['total_claims'][data['has_claim']]

## 2. Intro to Machine Learning and Scikit-Learn

We are now ready to do actual machine learning!

**`scikit-learn`** is a general-purpose library with implementations of "classic" machine learning models. It also offers a variety of utilities that can help with preprocessing.

In this tutorial we won't dive into the details of the models we will be using. Instead, we'll discuss and compute key machine learning concepts. They don't depend on the particular model chosen, so we might as well treat the models as black boxes for now, which people often do anyways. The next days will cover in more detail how some of these models work.

### 2.1 Supervised learning

As you saw this morning, in supervised learning you are given feature-response pairs $(x, y)$ and you must learn the mapping $f(x)=y$.

There are two main approaches for learning such a function: the frequentist and the Bayesian way. In both cases however, the goal is the same, namely to use the training examples to predict as well as possible new, unseen examples. That is, the key goal is **generalization**.

#### 2.1.1 Fitting and prediction

To demonstrate these concepts, let's fit a classifier to predict whether a given policy will have a nonzero claim, say on the first 1,000 claims so that training does not take too long. We will use a Bayesian model called a **Gaussian process** to do so.

Let's first create the model and prepare the data.

In [0]:
from sklearn.gaussian_process import GaussianProcessClassifier

nonzero_claim_model = GaussianProcessClassifier()

features = data['features'][:1000, :]
responses = data['has_claim'][:1000]

Let's now fit the model. What kind of accuracy do we get after training?

In [0]:
nonzero_claim_model.fit(features, responses)

accuracy = nonzero_claim_model.score(features, responses)
print(f"I am able to predict with {100*accuracy:.2f}% accuracy if a policy will have a claim.")

Wow!! That's **very** impressive!
Well, great, we're done here! Let's pack up, we've solved the whole insurance business!

...

...

Or did we?
Maybe let's check on the next 1000 policies just to make sure.

In [0]:
next_1000_features = data['features'][1000:2000, :]
next_1000_responses = data['has_claim'][1000:2000]

next_1000_accuracy = nonzero_claim_model.score(next_1000_features, next_1000_responses)
print(f"I am able to predict with {100*next_1000_accuracy:.2f}% accuracy whether one of the next 10000 policies will have a claim.")

Wait, the performance dropped... What happened here?

#### 2.1.2 Generalization error

The thing is, it's easy to do well on the data used for training: you could just memorize the labels, for example.
The tough part is doing well on data _you haven't seen_. And at this task, not all models are created equal. The problem is, it's not easy picking the right model ahead of time because, well, you don't know on what data you will be assessed, yet!

Fortunately, if you have enough data, there is a way to estimate how well you're going to do on a future task: using **train/test splits**. The idea is, take your big dataset, and split it randomly into (say) a 80% train set and a 20% test set. You will train your model on the train set, and assess performance on the test set.

Let's try it out for claim prediction. First create a train/test split.

In [0]:
shuffled_policies = np.random.permutation(np.arange(len(data['total_claims'])))

train_set = shuffled_policies[:8000]
test_set = shuffled_policies[-2000:]

train_features = data['features'][train_set, :]
train_responses = data['has_claim'][train_set]

test_features = data['features'][test_set, :]
test_responses = data['has_claim'][test_set]

Now let's fit our Gaussian process classifier on the train set. (This will take a minute or two.) What are the train and test accuracies?

In [0]:
nonzero_claim_model.fit(train_features, train_responses)

train_accuracy = nonzero_claim_model.score(train_features, train_responses)
test_accuracy = nonzero_claim_model.score(test_features, test_responses)

print(f"I am able to predict with {100*train_accuracy:.2f}% accuracy if a policy will have a claim on the train set,"
      f" and with {100*test_accuracy:.2f}% accuracy on the test set.")

As you can see, adding more data didn't solve the problem: the problem is the model, not the data.

To explore this idea a bit more, let's try to do something a bit more difficult: we're going to fit a **random forest** to predict the size of the claim. Note that this is a regression problem.

We'll use the subset of the policies that had nonzero claims for this. Let's first create a train/test split.

In [0]:
shuffled_policies = np.random.permutation(np.arange(len(data['nonzero_claims'])))
average_claim = np.mean(np.abs(data['nonzero_claims']))

train_set = shuffled_policies[:800]
test_set = shuffled_policies[800:1000]

train_features = data['features_with_claims'][train_set, :]
train_responses = data['nonzero_claims'][train_set]

test_features = data['features_with_claims'][test_set, :]
test_responses = data['nonzero_claims'][test_set]

Let's now fit our random forest to predict the size of the nonzero total claims. What train and test errors do we obtain?

In [0]:
from sklearn.ensemble import RandomForestRegressor

claim_size_model = RandomForestRegressor(n_estimators=100)
claim_size_model.fit(train_features, train_responses)

train_mae = np.mean(np.abs(claim_size_model.predict(train_features) - train_responses))
test_mae = np.mean(np.abs(claim_size_model.predict(test_features) - test_responses))
print(f"My predictions are off by ±{train_mae:.0f}$ on the train set on average, and by ±{test_mae:.0f}$ on the test set.")

print(f"The average claim is {average_claim:.0f}$, so the relative test error is {100*test_mae/average_claim:.2f}%.")

Okay, well the prediction quality is not amazing, but it's a good start. Is there something we could do to improve things?

#### 2.1.3 Regularization and priors

One easy way to improve the (test) performance of a model while overfitting is adding regularization (frequentist case) and using heavier priors (Bayesian case). This tends to reduce the gap between train and test performance, so if the training error is very low, it can reduce the test error. It might also improve the training error.

Since random forests are frequentist, we should be adding regularization. In the RF case, this can be achieved by, for example, increasing the minimum number of samples needed for a split.

Let's fit such a model.

In [0]:
claim_size_model = RandomForestRegressor(n_estimators=100, min_samples_split=100)
claim_size_model.fit(train_features, train_responses)

train_mae = np.mean(np.abs(claim_size_model.predict(train_features) - train_responses))
test_mae = np.mean(np.abs(claim_size_model.predict(test_features) - test_responses))
print(f"My predictions are off by ±{train_mae:.0f}$ on the train set on average, and by ±{test_mae:.0f}$ on the test set.")

print(f"The average claim is {average_claim:.0f}$, so the relative test error is {100*test_mae/average_claim:.2f}%.")

Great! This has helped the test loss a bit! It really hurt the training error, but we don't care.

#### 2.1.4 Hyperparameters tuning and model selection

While we're at it, how about varying more hyperparameters? Consider for example removing the bootstrapping of samples.

In [0]:
claim_size_model = RandomForestRegressor(n_estimators=100, bootstrap=False, min_samples_split=100)
claim_size_model.fit(train_features, train_responses)

train_mae = np.mean(np.abs(claim_size_model.predict(train_features) - train_responses))
test_mae = np.mean(np.abs(claim_size_model.predict(test_features) - test_responses))
print(f"My predictions are off by ±{train_mae:.0f}$ on the train set on average, and by ±{test_mae:.0f}$ on the test set.")

print(f"The average claim is {average_claim:.0f}$, so the relative test error is {100*test_mae/average_claim:.2f}%.")

Ouch! No, bad idea, that hurt the test error. Okay, how about changing the criterion for MAE?

In [0]:
claim_size_model = RandomForestRegressor(n_estimators=100, min_samples_split=100, criterion='mae')
claim_size_model.fit(train_features, train_responses)

train_mae = np.mean(np.abs(claim_size_model.predict(train_features) - train_responses))
test_mae = np.mean(np.abs(claim_size_model.predict(test_features) - test_responses))
print(f"My predictions are off by ±{train_mae:.0f}$ on the train set on average, and by ±{test_mae:.0f}$ on the test set.")

print(f"The average claim is {average_claim:.0f}$, so the relative test error is {100*test_mae/average_claim:.2f}%.")

Great, that helped the test error! It's a new record performance!

In general, we can start playing with hyperparameters like that to optimize test error. We could also enlarge our search, and try to find the _model_ that mimizes test error. Let's try, say, a **support vector regression** model.

In [0]:
from sklearn.svm import SVR

claim_size_model = SVR(kernel='sigmoid', gamma='scale')
claim_size_model.fit(train_features, train_responses)

train_mae = np.mean(np.abs(claim_size_model.predict(train_features) - train_responses))
test_mae = np.mean(np.abs(claim_size_model.predict(test_features) - test_responses))
print(f"My predictions are off by ±{train_mae:.0f}$ on the train set on average, and by ±{test_mae:.0f}$ on the test set.")

print(f"The average claim is {average_claim:.0f}$, so the relative test error is {100*test_mae/average_claim:.2f}%.")

And we can keep playing this game for a while. There is a catch, however.

You might not have noticed, but we didn't actually use all the data for training and testing. There is still some left. How about, just for fun, checking the performance of our best model on the data we _didn't_ use either for training or testing?

In [0]:
other_set = shuffled_policies[1000:]

other_features = data['features_with_claims'][other_set, :]
other_responses = data['nonzero_claims'][other_set]

other_mae = np.mean(np.abs(claim_size_model.predict(other_features) - other_responses))
print(f"My predictions are off by ±{other_mae:.0f}$ on the other set on average.")

print(f"The average claim is {average_claim:.0f}$, so the relative test error is {100*other_mae/average_claim:.2f}%.")

Wait! It's not as good! What happened?

Well, as we keep optimizing over the test set, a bias crept in. We evaluated several models and threw away the bad ones, keeping only the best one (on the test set). In some sense, information about the test set was eventually used to train the model (albeit in a subtle way). This gave us an over-optimistic picture of how much we can expect to predict on new data.

This being said, we still improved somewhat compared to the initial test performance, so it wasn't all useless. It's very possible that the ultimate model we ended up choosing (the SVR) would really be the best model on average on new data. But we shouldn't trust the (over-optimized) test error as an actual picture of the performance to expect on new data.

In practice, people do what we just did, they're just clear about it. Namely, it is typical to divide the training set into a **train/validation/test** split:

- The **train** set is as before used for optimization of parameters.
- The **validation** set is what we called the test set: a set of data for optimization of hyperparameters and model selection.
- The **test** set was like our "other" set, which should be seen only once at the end and **never** used to influence any modelling decision (otherwise bias will creep in again). It's only there to give a final picture.

A 70%/15%/15% split (or something around that) is typical.

#### 2.1.5 Exercise

To practice a bit, let's try to fit a random forest classifier to predict nonzero claims. You should split your data into a 7000 train examples, 1500 validation examples and 1500 test examples. How high can you get in validation accuracy? What's your final test accuracy?

In [0]:
from sklearn.ensemble import RandomForestClassifier

nonzero_claim_model = RandomForestClassifier(n_estimators=100)

features = data['features']
responses = data['has_claim']

# 1. Split your data randomly
# ...

# 2. Train your model - what validation accuracy do you get?
# ...

# 3. Try playing with the model hyperparameters. How high can you get in validation accuracy?
# ...

# 4. Once you are done, check your test accuracy. How high is it?
# N.B. Don't cheat, only look at it at the very end! You only have one shot!

In [0]:
# Possible solution

# 1.
shuffled_policies = np.random.permutation(np.arange(10000))

train_set = shuffled_policies[:7000]
validation_set = shuffled_policies[7000:8500]
test_set = shuffled_policies[8500:]

train_features = data['features'][train_set, :]
train_responses = data['has_claim'][train_set]

validation_features = data['features'][validation_set, :]
validation_responses = data['has_claim'][validation_set]

test_features = data['features'][test_set, :]
test_responses = data['has_claim'][test_set]

In [0]:
# 2.
nonzero_claim_model.fit(train_features, train_responses)

train_accuracy = nonzero_claim_model.score(train_features, train_responses)
validation_accuracy = nonzero_claim_model.score(validation_features, validation_responses)
print(f"Train accuracy: {100*train_accuracy:.2f}%, validation accuracy: {100*validation_accuracy:.2f}%.")

In [0]:
# 3.
# Here you can play for a bit - I'm just going to try two variants
nonzero_claim_model = RandomForestClassifier(n_estimators=100, min_samples_leaf=10)
nonzero_claim_model.fit(train_features, train_responses)

train_accuracy = nonzero_claim_model.score(train_features, train_responses)
validation_accuracy = nonzero_claim_model.score(validation_features, validation_responses)
print(f"Variant 1 | Train accuracy: {100*train_accuracy:.2f}%, validation accuracy: {100*validation_accuracy:.2f}%.")

nonzero_claim_model = RandomForestClassifier(n_estimators=100, min_samples_split=5)
nonzero_claim_model.fit(train_features, train_responses)

train_accuracy = nonzero_claim_model.score(train_features, train_responses)
validation_accuracy = nonzero_claim_model.score(validation_features, validation_responses)
print(f"Variant 2 | Train accuracy: {100*train_accuracy:.2f}%, validation accuracy: {100*validation_accuracy:.2f}%.")

In [0]:
# 4.
# I take the model with best validation accuracy
nonzero_claim_model = RandomForestClassifier(n_estimators=100, min_samples_leaf=10)
nonzero_claim_model.fit(train_features, train_responses)

# and I evaluate the test accuracy
test_accuracy = nonzero_claim_model.score(test_features, test_responses)
print(f"Test accuracy: {100*train_accuracy:.2f}%.")

### 2.2 Unsupervised learning

To complete our overview of machine learning, let's consider another kind of task that might be performed: some unsupervised learning.

In unsupervised learning, concepts on "training" and "test" errors are much more subtle, and there is disagreement as to whether the concept of overfitting even makes sense. So we won't be talking about that.

#### 2.2.1 Principal components analysis

Our first task we be to reduce the dimension of the data a bit to better vizualize it. We will fit a principal components analysis model on our continuous features.

In [0]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2, whiten=True)
pca.fit(data['features'][:, :15])

data['reduced_features'] = pca.transform(data['features'][:, :15])

print(f"Shape before: {data['features'][:, :15].shape}, shape after PCA: {data['reduced_features'].shape}")

Our data has been distilled to two abstract "features". How about vizualizing the results?

We will use a library called **matplotlib** for that.

In [0]:
import matplotlib.pyplot as plt

x, y = np.split(data['reduced_features'], indices_or_sections=2, axis=1)

plt.scatter(x, y)
plt.show()

Very interesting! There seems to be two "groups" of policies: those whose second PCA component is around zero, and those whose second PCA component is around 4ish.

As a next task, let's use **k-means** to try to recuperate those groups.

In [0]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=2)
cluster_assignments = kmeans.fit_predict(data['reduced_features'])

Great! Let's now try to color the plot points by cluster.

In [0]:
plt.scatter(np.squeeze(x), np.squeeze(y), c=cluster_assignments)
plt.show()

At this point a natural next step would be to investigate more in depth what the points in each cluster have in common, to assign them potential meaning. Thus, unsupervised learning is often a prelimnary step in exploratory of analysis of data.

##  3. Intro to Pytorch

We have seen we can do tensor algebra with `numpy`. However, for deep learning, we need more than that! We need 
1. automatic differentiation
2. GPU support

**`pytorch`** is a library that supports that, with additional bells and whisles. There has been attempts to use it for other purposes as well (e.g. as a backend for graphical model inference), but by and large it is usually seen as a deep learning library.


In `pytorch`, tensor algebra operations leave a trace in a computational graph that keeps tracks of relationship between tensors. Thus, when running an operation such as `C = A @ B`, `pytorch` will remember that `C` is the result of a tensor multipliation between `A` and `B`. This is useful because you can also ask `pytorch` to compute the gradient of a one dimensional tensor with respect to any original tensor using the `backward()` method. Thus, for example, running `C.sum().backward()` will compute the gradients of the sum of all components of `C` with respect to `A` and `B`. The resulting gradients will then be accessible as `A.grad` and `B.grad`.  This is a key ingredient in gradient-based parameter optimization.


In [0]:
import torch

A = torch.rand(5, 5)
B = torch.rand(5, 5)

A.requires_grad = True
B.requires_grad = True

C = A @ B
C.sum().backward()
print(f"Gradient of A:\n{A.grad}")
print("")
print(f"Gradient of B:\n{B.grad}")

Moreover, `pytorch` natively supports doing tensor algebra on a GPU, through the CUBLAS/CULAPACK/CUDNN libraries. For example,

In [0]:
import torch

A = torch.rand(5, 5)
B = torch.rand(5, 5)

A_on_GPU, B_on_GPU = A.cuda(), B.cuda()  # Move the tensors to the GPU

C_on_GPU = A_on_GPU @ B_on_GPU  # This matrix multiplication is done on the GPU!
C_on_GPU

For tensor algebra on large tensors, operations on a GPU are usually  _much_ faster.

In [0]:
import time


A = torch.rand(1000, 1000)
B = torch.rand(1000, 1000)

# 1. CPU
start_time = time.time()
A @ B
end_time = time.time()
CPU_total_time = end_time - start_time
print(f"CPU total time: {1e3*CPU_total_time:.2f} milliseconds")

# 2. GPU
A_on_GPU, B_on_GPU = A.cuda(), B.cuda()

start_time = time.time()
A_on_GPU @ B_on_GPU
end_time = time.time() 
GPU_total_time = end_time - start_time
print(f"GPU total time: {1e3*GPU_total_time:.2f} milliseconds")

print(f"\nGPU is {CPU_total_time/GPU_total_time:.2f} times faster in this case")

A thing to keep in mind, however, is that moving data between the CPU and the GPU has a non-negligible computational cost.

In [0]:
start_time = time.time()
A.cuda(), B.cuda()
end_time = time.time()
CPU_to_GPU_copy_cost = end_time - start_time
print(f"CPU to GPU copy cost: {1e3*CPU_to_GPU_copy_cost:.2f} milliseconds")

Thus it is usually beneficial to do as many operations as possible on the GPU, to offset the initial copy cost.

In practice, the actual time improvement will depend on many factors, including how much time is spent doing linear algebra. From anectodal evidence, I usually found that training a deep neural net on a GPU lead to a **10x speed improvement** vs. training it on a CPU. Your mileage may vary.

### 3.1 A multilayer perceptron

As a concrete example of what can be done with `pytorch`, we will fit a multilayer perceptron on the claim prediction task.

We first start by splitting our data again.

In [0]:
shuffled_policies = np.random.permutation(np.arange(10000))

train_set = shuffled_policies[:7000]
validation_set = shuffled_policies[7000:8500]
test_set = shuffled_policies[8500:]

train_features = data['features'][train_set, :].astype(np.float32)
train_responses = data['has_claim'][train_set].astype(np.int64)
validation_features = data['features'][validation_set, :].astype(np.float32)
validation_responses = data['has_claim'][validation_set].astype(np.int64)
test_features = data['features'][test_set, :].astype(np.float32)
test_responses = data['has_claim'][test_set].astype(np.int64)

These are numpy arrays living on the CPU, so the first thing we need to do is to convert them to `pytorch` tensors and move them to the GPU. (The dataset is small enough to all fit in GPU memory.)

In [0]:
train_features = torch.from_numpy(train_features).cuda()
train_responses = torch.from_numpy(train_responses).cuda()
validation_features = torch.from_numpy(validation_features).cuda()
validation_responses = torch.from_numpy(validation_responses).cuda()
test_features = torch.from_numpy(test_features).cuda()
test_responses = torch.from_numpy(test_responses).cuda()

Nowadays, all neural networks are trained by gradient descent with from randomly chosen initial parameters. The distributions from which these initial parameters are chosen assume (or make most sense) if the inputs are normalized between -1 and 1. Therefore, to provide reasonable training, it is usually always recommended to normalize inputs to neural networks, as we will do here.

In [0]:
center = train_features.mean(dim=0).unsqueeze(0)
scale = train_features.std(dim=0).clamp(min=1).unsqueeze(0)
train_features = (train_features - center) / scale
validation_features = (validation_features - center) / scale
test_features = (test_features - center) / scale

Recall, as you saw thing morning with Marzieh, that MLPs are the simplest kind of neural network, with a single hidden layer. We will use, say, 32 units everywhere and ReLU activations.

We could implement the MLP by hand, defining weight and bias matrices,  but `pytorch` comes with implementations of most types of layers found in the literature. Using that language, for example, we can define an MLP as follows.


In [0]:
nb_input_features = train_features.shape[-1]

mlp = torch.nn.Sequential(
        torch.nn.Linear(nb_input_features, 32),
        torch.nn.ReLU(),
        torch.nn.Linear(32, 2),
        torch.nn.Softmax(dim=-1)).cuda()

We can now feed the entirety of our training dataset and it will produce predictions for every 7000 samples.

In [0]:
predictions = mlp(train_features)
print(predictions.shape)
print(predictions[:5, :])

accuracy = (train_responses == predictions.argmax(dim=-1)).float().mean()
print(f"\nAccuracy: {100*accuracy:.1f}%")

We will fit our MLP by gradient descent, a standard approach. This corresponds to taking steps in the direction of the gradient of the loss with respect to the parameters until the gradient becomes (close to) zero, in which case the process stops. In our case, since we have a classification problem, we will use the "cross-entropy loss".

In [0]:
cross_entropy_loss = -(predictions + 1e-5).log().gather(1, train_responses.unsqueeze(-1)).mean()
print(cross_entropy_loss)

We can obtain the gradients by calling `backward` on `cross_entropy_loss`.

In [0]:
mlp.zero_grad()
cross_entropy_loss.backward()

for name, parameter in mlp.named_parameters():
  print(f"Gradient of parameter {name}:")
  print(parameter.grad)
  print("")

Now we can apply our gradient update from those gradients:

In [0]:
for parameter in mlp.parameters():
  parameter.data -= 1e-2 * parameter.grad.data

Tadam! We just did our first gradient update! We should now have slightly better training performance:

In [0]:
predictions = mlp(train_features)
print(predictions.shape)
print(predictions[:5, :])

accuracy = (train_responses == predictions.argmax(dim=-1)).float().mean()
print(f"\nAccuracy: {100*accuracy:.1f}%")

If we repeat this over and over again, our parameters will slowly move towards values that minimize this cross-entropy loss (hence increase classification accuracy).

In [0]:
for update in range(25):
  predictions = mlp(train_features)
  cross_entropy_loss = -(predictions + 1e-5).log().gather(1, train_responses.unsqueeze(-1)).mean()
  mlp.zero_grad()
  cross_entropy_loss.backward()
  for parameter in mlp.parameters():
    parameter.data -= 1e-2 * parameter.grad.data
  accuracy = (train_responses == predictions.argmax(dim=-1)).float().mean()
  print(f"Cross entropy loss: {cross_entropy_loss:.4f}, accuracy: {100*accuracy:.1f}%")

Neural networks can overfit, so it is usually a good idea to monitor the validation error as we update and stop when further updates worsen the validation error ("early stopping").

For MLPs this is less problematic since they tend to lack capacity, but it's good to doublecheck nonetheless.

In [0]:
mlp = torch.nn.Sequential(
        torch.nn.Linear(nb_input_features, 32),
        torch.nn.ReLU(),
        torch.nn.Linear(32, 2),
        torch.nn.Softmax(dim=-1)).cuda()

for update in range(50):
  train_predictions = mlp(train_features)
  train_cross_entropy_loss = -(train_predictions + 1e-5).log().gather(1, train_responses.unsqueeze(-1)).mean()
  mlp.zero_grad()
  train_cross_entropy_loss.backward()
  for parameter in mlp.parameters():
    parameter.data -= 1e-2 * parameter.grad.data
  train_accuracy = (train_responses == train_predictions.argmax(dim=-1)).float().mean()
  
  validation_predictions = mlp(validation_features)
  validation_cross_entropy_loss = -(validation_predictions + 1e-5).log().gather(1, validation_responses.unsqueeze(-1)).mean()
  validation_accuracy = (validation_responses == validation_predictions.argmax(dim=-1)).float().mean()
  print(f"Update {update}")
  print(f"  Training   cross entropy loss: {train_cross_entropy_loss:.4f}, accuracy: {100*train_accuracy:.1f}%")
  print(f"  Validation cross entropy loss: {validation_cross_entropy_loss:.4f}, accuracy: {100*validation_accuracy:.1f}%")

And of course we can check the test performance.

In [0]:
test_predictions = mlp(test_features)
test_cross_entropy_loss = -(test_predictions + 1e-5).log().gather(1, test_responses.unsqueeze(-1)).mean()
test_accuracy = (test_responses == test_predictions.argmax(dim=-1)).float().mean()
print(f"Test cross entropy loss: {test_cross_entropy_loss:.4f}, accuracy: {100*test_accuracy:.1f}%")

That's not bad at all! (Compare with previous GP and RF test accuracies.) Of course we could do some hyperparameter tuning here on the validation error, as usual.