**NOTE: This notebook is written for the Google Colab platform, which provides free hardware acceleration. However it can also be run (possibly with minor modifications) as a standard Jupyter notebook, using a local GPU.** 



In [None]:
#@title -- Installation of Packages -- { display-mode: "form" }
import sys
!{sys.executable} -m pip install git+https://github.com/michalgregor/class_utils.git

In [None]:
#@title -- Import of Necessary Packages -- { display-mode: "form" }
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, KBinsDiscretizer
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from class_utils import error_histogram
from class_utils.pytorch_utils import EarlyStopping
import torch.nn as nn
import torch

In [None]:
#@title -- Downloading Data -- { display-mode: "form" }
from class_utils.download import download_file_maybe_extract
download_file_maybe_extract("https://www.dropbox.com/s/3jnf3000vwaxtcg/boston_housing.zip?dl=1", directory="data/boston_housing")

# also create a directory for storing any outputs
import os
os.makedirs("output", exist_ok=True)

## A Real-Estate-Price Regression Model

In this notebook we will apply neural regression to the problem of real estate price prediction. We will make use of the [Boston housing dataset](https://www.kaggle.com/c/boston-housing).

**Note:**  The example is purely illustrational. The dataset is well-structured (the data is divided into columns with clear meanings etc.), and would therefore probably be approached with a different method in practice – possibly with some approached based on decision trees. Artificial neural networks and deep learning are usually applied to problems with unstructured data, such as images, audio, text etc.

### Loading and Preprocessing the Dataset

Let us start by displaying the description of the dataset:



In [None]:
with open("data/boston_housing/description.txt", "r") as file:
    print("".join(file.readlines()))

As the next step, we will load the dataset itself from a CSV file:



In [None]:
df = pd.read_csv("data/boston_housing/housing.csv")
df.head()

#### The Splitting of the Dataset

Next we continue by splitting the dataset. In this case, however, the data will not be split into two parts the way we usually split it, but rather into three parts: training, validation and testing data in the ratio of 70 : 5 : 25. The validation data will be used during training for regularization and model selection (the details are below).

Also, when splitting, we stratify by the discretized version of the output column:



In [None]:
kbins = KBinsDiscretizer(10, encode='ordinal')

y_stratify = kbins.fit_transform(df[["medv"]])
df_train_valid, df_test = train_test_split(df, test_size=0.25,
                                     stratify=y_stratify,
                                     random_state=9)

y_stratify = kbins.fit_transform(df_train_valid[["medv"]])
df_train, df_valid = train_test_split(df_train_valid, test_size=0.05/0.75,
                                     stratify=y_stratify,
                                     random_state=9)

---
#### Task 1: Data Preprocessing

**Apply our standard preprocessing procedure for neural nets to the data and produce the training set `X_train`, `Y_train`, the validation set `X_valid`, `Y_valid` and the testing set `X_test`, `Y_test` as the result: in the necessary form and cast to the appropriate data type.** 

Remember to only reserve `fit_transform` for the train set and to use `transform` on the validation set and the test set.

Do not forget to cast the data to PyTorch tensors with appropriate data types. Transfer the tensors to `device`.

---


In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"

categorical_inputs = [          ] # -----

numeric_inputs = [              ] # -----

output = ["medv"]


input_preproc = # ---



# -----


output_preproc = StandardScaler()


# -----



---
### Task 2: Creation of Neural Net and Training

**Create a neural regressor and train it using the train set. The result should be a trained `net` object with a `scikit-learn` interface, the performance of which we will subsequently be able to test using the test set.** 

Aid: For the sizes of your layers, you can pick e.g. the following:

* `num_inputs`;
* 128;
* 64;
* 32;
* `num_outputs`;
---


In [None]:
class Net(nn.Module):
    def __init__(self, num_inputs, num_outputs):
        super().__init__()


        # -----
        


In [None]:
num_inputs = X_train.shape[1]
num_outputs = Y_train.shape[1]
model = Net(num_inputs, num_outputs).to(device)

criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
loss_train = []

for epoch in range(2000):
    model.train()
    y = model(X_train)

    loss = criterion(y, Y_train)
    loss_train.append(loss.detach().item())

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if epoch % 100 == 0:
        print(f"epoch {epoch}, loss: {np.mean(loss_train[-20:]):.3g}")

print(f"epoch {epoch}, loss: {np.mean(loss_train[-20:]):.3g}")

Next we can use the losses recorded in `loss_train` to plot the learning curve.



In [None]:
plt.plot(loss_train)
plt.xlabel("epoch")
plt.ylabel("loss")
plt.grid(ls='--')

Once the training is done, let's also do our standard evaluation on the training set. We should see that the results are rather good and the errors are negligible on the scale of the data.



In [None]:
Y_train_cpu = Y_train.cpu()

model.eval()
with torch.no_grad():
    y_train_cpu = model(X_train).cpu()

# we compute and display the MSE and the MAE
mse = mean_squared_error(Y_train_cpu, y_train_cpu)
print("MSE = {}".format(mse))

mae = mean_absolute_error(Y_train_cpu, y_train_cpu)
print("MAE = {}".format(mae))

# we display the error histogram
plt.figure(figsize=(8, 6))
error_histogram(Y_train_cpu, y_train_cpu, Y_fit_scaling=Y_train_cpu)

#### Testing the Model on the Validation Set

Alright, so the results on the training set are quite satisfactory. But does the model actually generalize well?

Given that this is not the final version of our model (we will introduce other versions below), we will **not yet test the performance using the testing set**  (we need to hold that out for the final testing in order to verify generalization), but rather using the **validation set** .



In [None]:
Y_valid_cpu = Y_valid.cpu()
Y_train_cpu = Y_train.cpu()

model.eval()
with torch.no_grad():
    y_valid_cpu = model(X_valid).cpu()

# we compute and display the MSE and the MAE
mse = mean_squared_error(Y_valid_cpu, y_valid_cpu)
print("MSE = {}".format(mse))

mae = mean_absolute_error(Y_valid_cpu, y_valid_cpu)
print("MAE = {}".format(mae))

# we display the error histogram
plt.figure(figsize=(8, 6))
error_histogram(Y_valid_cpu, y_valid_cpu, Y_fit_scaling=Y_train_cpu)

After evaluating the model on the validation set, you should see that the metrics are not even close to those on the training set. This indicates that there is a significant amount of **overfitting** .

#### Testing on the Validation Set Throughout Training

To get a better idea of where the training went wrong, let's record the validation loss throughout training the same way we do with the training loss and let's plot both.



In [None]:
num_inputs = X_train.shape[1]
num_outputs = Y_train.shape[1]
model = Net(num_inputs, num_outputs).to(device)

criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

loss_train = []
loss_valid = []

for epoch in range(2000):
    model.train()
    y = model(X_train)

    loss = criterion(y, Y_train)
    loss_train.append(loss.detach().item())

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    model.eval()
    with torch.no_grad():
        y = model(X_valid)
        loss = criterion(y, Y_valid)
        loss_valid.append(loss.item())

    if epoch % 100 == 0:
        print(f"epoch {epoch}, train loss: {np.mean(loss_train[-20:]):.3g}, valid loss: {np.mean(loss_valid[-20:]):.3g}")

print(f"epoch {epoch}, train loss: {np.mean(loss_train[-20:]):.3g}, valid loss: {np.mean(loss_valid[-20:]):.3g}")

In [None]:
plt.plot(loss_train, label="train")
plt.plot(loss_valid, label="valid")
plt.xlabel("epoch")
plt.ylabel("loss")
plt.grid(ls='--')
plt.legend()

In [None]:
Y_valid_cpu = Y_valid.cpu()
Y_train_cpu = Y_train.cpu()

model.eval()
with torch.no_grad():
    y_valid_cpu = model(X_valid).cpu()

# we compute and display the MSE and the MAE
mse = mean_squared_error(Y_valid_cpu, y_valid_cpu)
print("MSE = {}".format(mse))

mae = mean_absolute_error(Y_valid_cpu, y_valid_cpu)
print("MAE = {}".format(mae))

# we display the error histogram
plt.figure(figsize=(8, 6))
error_histogram(Y_valid_cpu, y_valid_cpu, Y_fit_scaling=Y_train_cpu)

Well, the picture we get now is totally different! The loss on the validation set is much worse than on the test set and, if you look carefully, you may see that it even starts to increase after a while – even though the loss on the train set still decreases or stays low.

### Regularization

How do we resolve the problem presented in the previous section and prevent the network from overfitting? Well, overfitting often occurs because when the model finds it difficult to decrease the loss in a legitimate way, it will start to cheat by memorize data.

In order to prevent that kind of problem, we need to use some regularization methods. These are methods designed to help prevent overfitting. The name "regularization" derives from the fact that we want our model to capture the real regularities in the data and not start memorizing data, including the noise.

#### Getting More Data

Getting more data is generally the best way to improve generalization – given enough data, a learning method should be better able to tell regularities apart from noise. It also shouldn't be able to memorize all data, so it is forced to learn the regularity itself.

The problem with getting more data is that it is generally very difficult and expensive. A number of other regularization methods have therefore been developed, with a view of getting as much from the data we already have as possible.

#### Regularization in Standard Machine Learning

In most machine learning methods, regularization is done by reducing the capacity of the model in some way – e.g. by decreasing its size (the degree of a polynomial, the size of a decision tree, ...).

This helps because the model is no longer able to memorize the training set and actually needs to fit the regularities in the data. In artificial neural networks this can be done by decreasing the numbers and sizes of layers.

#### Early Stopping

Another way to decrease the capacity of a neural model is to use a technique known as early stopping. As we saw earlier, one thing that typically happens in the course of training is that even though the loss on the training set keeps decreasing, the loss on the validation set (if used) stops decreasing or even starts growing.

The idea behind early stopping is to simply stop training at that point and restore weights to the point where the validation loss was actually at its minimum. The further advantage of this approach is that it saves some computation.

#### Regularization in Deep Learning

The area of deep learning is a bit of an exception, because regularization is typically not done by restricting the size of the model. Rather, deep learning practicioners make use of:

* Special layers;
* Clever architectures that inject better inductive preferences to the model (i.e. bias it towards the kind of solution that is likely to generalize well);
* Data augmentation (e.g. generating new random variants of existing samples);
* Transfer learning (i.e. pretraining on a larger dataset);
* ...
#### What We Are Going to Use In This Notebook

In this particular notebook, we are going to keepthings simple. We are only going to use two simple regularization methods:

* **Early stopping** ;
* **Dropout** ;
Plus, since the neural network we are using here is a **shallow**  one and the dataset is tiny, making the **neural net smaller**  may actually also be a good way to regularize – even though you typically wouldn't do that in a deep network trained on millions of samples.

Note again that throughout the entire process of developing our model, **we are using the training set and the validation set** , but **not**  the testing set. We are reserving the testing set so that we can evaluate the very final version of our model.

### Early Stopping

Let's start with early stopping then. Since the losses can be a bit noisy, early stopping usually has a "patience" hyperparameter – this specifies how many steps to wait once the loss has stopped decreasing before the training is actually stopped.



In [None]:
early_stopping = EarlyStopping(checkpoint_path="output/best_model.pt")

num_inputs = X_train.shape[1]
num_outputs = Y_train.shape[1]
model = Net(num_inputs, num_outputs).to(device)

criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

loss_train = []
loss_valid = []

for epoch in range(2000):
    model.train()
    y = model(X_train)

    loss = criterion(y, Y_train)
    loss_train.append(loss.detach().item())

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    model.eval()
    with torch.no_grad():
        y = model(X_valid)
        loss = criterion(y, Y_valid)
        loss_valid.append(loss.item())
        if early_stopping(loss_valid[-1], model):
            print(f"Stopping the training early because the validation loss has not improved in the last {early_stopping.patience} epochs")
            break

    if epoch % 100 == 0:
        print(f"epoch {epoch}, train loss: {np.mean(loss_train[-20:]):.3g}, valid loss: {np.mean(loss_valid[-20:]):.3g}")

In [None]:
plt.plot(loss_train, label="train")
plt.plot(loss_valid, label="valid")
plt.xlabel("epoch")
plt.ylabel("loss")
plt.grid(ls='--')
plt.legend()

Once the training is done, we load the best model back from the checkpoint and run evaluation. The results may already be a bit better – but it is also possible that more powerful regularization will be required.



In [None]:
model.load_state_dict(torch.load("output/best_model.pt"));

In [None]:
Y_valid_cpu = Y_valid.cpu()
Y_train_cpu = Y_train.cpu()

model.eval()
with torch.no_grad():
    y_valid_cpu = model(X_valid).cpu()

# we compute and display the MSE and the MAE
mse = mean_squared_error(Y_valid_cpu, y_valid_cpu)
print("MSE = {}".format(mse))

mae = mean_absolute_error(Y_valid_cpu, y_valid_cpu)
print("MAE = {}".format(mae))

# we display the error histogram
plt.figure(figsize=(8, 6))
error_histogram(Y_valid_cpu, y_valid_cpu, Y_fit_scaling=Y_train_cpu)

### Dropout

The other kind of regularization that we are going to explore in this notebook is called **dropout** . This method will turn off a portion of neurons in a layer randomly (during training, not in evaluation mode). In PyTorch this can be done by placing `nn.Dropout` after a layer. Dropout tends to make the network more robust, improving generalization.

The portion of neurons to be turned off is a hyperparameter. If we wanted to use 0.3, we could add dropout in the following way:

```
class Net(nn.Module):
    def __init__(self):

        ...

        self.dropout = nn.Dropout(0.3)

        ...

    def forward(self, x):

        ...

        y = torch.relu(y)
        y = self.dropout(y)

        ...
```
We typically do not insert a dropout layer after the output layer (given that the outputs are read directly from the output layer, if its outputs are zeroed out, this will cause errors that no network – no matter how robust – would be able to prevent).

#### Dropout and the Model's Capacity

Whenever we use more agressive forms of regularization, this may reduce the capacity of the model too significantly. When using many `Dropout` layers, it can therefore be necessary to make the model a bit larger than it would ordinarily be.

The interaction between various kinds of regularization can also be nontrivial: e.g. when using dropout, we can expect the validation loss to be a lot more noisy (new sources of stochasticity have been added). If using early stopping as well, it can therefore be necessary to use significantly larger values of `patience`.

---
### Task 3

**Try to insert a few `Dropout` layers into your network. E.g. one `Dropout` layer after each `relu`.** 

**When testing the effectiveness of your regularization only use the validation set. The testing set needs to be held out until the end – we can only use it once!** 

---


In [None]:
class DropoutNet(nn.Module):
    def __init__(self, num_inputs, num_outputs):
        super().__init__()


        # -----
        


Let's try training again using our new network. To keep things simple and avoid having to tune the `patience` parameter (an excessively low value could make our results worse), we are not going to be using early stopping in this run.



In [None]:
num_inputs = X_train.shape[1]
num_outputs = Y_train.shape[1]
model = DropoutNet(num_inputs, num_outputs).to(device)

criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

loss_train = []
loss_valid = []

for epoch in range(2000):
    model.train()
    y = model(X_train)

    loss = criterion(y, Y_train)
    loss_train.append(loss.detach().item())

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    model.eval()
    with torch.no_grad():
        y = model(X_valid)
        loss = criterion(y, Y_valid)
        loss_valid.append(loss.item())

    if epoch % 100 == 0:
        print(f"epoch {epoch}, train loss: {np.mean(loss_train[-20:]):.3g}, valid loss: {np.mean(loss_valid[-20:]):.3g}")

In [None]:
plt.plot(loss_train, label="train")
plt.plot(loss_valid, label="valid")
plt.xlabel("epoch")
plt.ylabel("loss")
plt.grid(ls='--')
plt.legend()

Note how our validation set now no longer increases. Note also how the losses are more noisy – this is, of course, because of the noise introduced by dropout.

The training loss and the validation loss now shouldn't differ quite so wildly.



In [None]:
Y_train_cpu = Y_train.cpu()

model.eval()
with torch.no_grad():
    y_train_cpu = model(X_train).cpu()

# we compute and display the MSE and the MAE
mse = mean_squared_error(Y_train_cpu, y_train_cpu)
print("MSE = {}".format(mse))

mae = mean_absolute_error(Y_train_cpu, y_train_cpu)
print("MAE = {}".format(mae))

# we display the error histogram
plt.figure(figsize=(8, 6))
error_histogram(Y_train_cpu, y_train_cpu, Y_fit_scaling=Y_train_cpu)

In [None]:
Y_valid_cpu = Y_valid.cpu()
Y_train_cpu = Y_train.cpu()

model.eval()
with torch.no_grad():
    y_valid_cpu = model(X_valid).cpu()

# we compute and display the MSE and the MAE
mse = mean_squared_error(Y_valid_cpu, y_valid_cpu)
print("MSE = {}".format(mse))

mae = mean_absolute_error(Y_valid_cpu, y_valid_cpu)
print("MAE = {}".format(mae))

# we display the error histogram
plt.figure(figsize=(8, 6))
error_histogram(Y_valid_cpu, y_valid_cpu, Y_fit_scaling=Y_train_cpu)

### Results on the Testing Set

Once we have arrived at our final model, we are going to test its generalization on the testing data as well.

Since we did not use early stopping in our final model, we could actually rerun training on train + validation set now before we do testing. That might improve the results of our final test a bit further – you may add the necessary code if you like.



In [None]:
Y_test_cpu = Y_test.cpu()
Y_train_cpu = Y_train.cpu()

model.eval()
with torch.no_grad():
    y_test_cpu = model(X_test).cpu()

# we compute and display the MSE and the MAE
mse = mean_squared_error(Y_test_cpu, y_test_cpu)
print("MSE = {}".format(mse))

mae = mean_absolute_error(Y_test_cpu, y_test_cpu)
print("MAE = {}".format(mae))

# we display the error histogram
plt.figure(figsize=(8, 6))
error_histogram(Y_test_cpu, y_test_cpu, Y_fit_scaling=Y_train_cpu)

### Regression Using Gradient-Boosted Decision Trees

To reinforce the point that neural nets do not bring many advantages when applied to structured data and that better results can usually be achieved by other methods, we will now compare our results with the XGBoost method, which is based on an ensemble of decision trees created using gradient boosting. There is a good chance that the results will be better than we were able to achieve using a neural net: and the learning will be significantly faster. The real advantages of neural networks generally only become obvious once they are applied to more complex unstructured data such as images, audio, etc.



In [None]:
from xgboost import XGBRegressor
X_train_np = X_train.cpu().numpy()
Y_train_np = Y_train.cpu().numpy()
X_test_np = X_test.cpu().numpy()
Y_test_np = Y_test.cpu().numpy()

model = XGBRegressor()
model.fit(X_train_np, Y_train_np);

In [None]:
y_test = model.predict(X_test_np)

# we compute and display the MSE and the MAE
mse = mean_squared_error(Y_test_np, y_test)
print("MSE = {}".format(mse))

mae = mean_absolute_error(Y_test_np, y_test)
print("MAE = {}".format(mae))

# we display the error histogram
plt.figure(figsize=(8, 6))
error_histogram(Y_test_np, y_test, Y_fit_scaling=Y_train_np)