**NOTE: This notebook is written for the Google Colab platform. However it can also be run (possibly with minor modifications) as a standard Jupyter notebook.** 



In [None]:
#@title -- Installation of Packages -- { display-mode: "form" }
import sys
!{sys.executable} -m pip install git+https://github.com/michalgregor/class_utils.git

In [None]:
#@title -- Import of Necessary Packages -- { display-mode: "form" }
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OrdinalEncoder, KBinsDiscretizer
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error, mean_absolute_error
from class_utils import error_histogram
import matplotlib.pyplot as plt

from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, WhiteKernel
from xgboost import XGBRegressor

In [None]:
#@title -- Downloading Data -- { display-mode: "form" }
from class_utils.download import download_file_maybe_extract
download_file_maybe_extract("https://www.dropbox.com/s/3jnf3000vwaxtcg/boston_housing.zip?dl=1", directory="data/boston_housing")

# also create a directory for storing any outputs
import os
os.makedirs("output", exist_ok=True)

## Gaussian Process Regression: An Example

In this notebook we are going to apply Gaussian process regression to an actual dataset: the Boston housing dataset, where the goal is to predict the median value of a house, given certain input attributes. We'll start by loading and preprocessing the data. Since we have done this many times and there is nothing very special about the preprocessing we are going to apply, the code of the next cell is hidden.



In [None]:
#@title -- Loading Data into X_train, Y_train, X_test, Y_test -- { display-mode: "form" }
df = pd.read_csv("data/boston_housing/housing.csv")
display(df.head())

# we discretize the target variable and use the result for stratification
kbins = KBinsDiscretizer(10, encode='ordinal')
y_stratify = kbins.fit_transform(df[["medv"]])
df_train, df_test = train_test_split(df, stratify=y_stratify,
                        test_size=0.25, random_state=4)

# we split the columns into categorical and numeric inputs and the output
categorical_inputs = ['chas']
numeric_inputs = ['crim', 'zn', 'indus', 'nox', 'rm', 'age', 'dis', 'rad', 'tax',
       'ptratio', 'black', 'lstat']
output = ["medv"]

# we create our preprocessing pipeline
input_preproc = make_column_transformer(
    (make_pipeline(
        SimpleImputer(strategy="most_frequent"),
        OrdinalEncoder(categories='auto')),
     categorical_inputs),
    
    (make_pipeline(
        SimpleImputer(),
        StandardScaler()),
     numeric_inputs)
)

output_preproc = StandardScaler()

# we fit the pipeline on the train set and then apply it to both train and test
X_train = input_preproc.fit_transform(df_train[categorical_inputs+numeric_inputs])
Y_train = output_preproc.fit_transform(df_train[output])
X_test = input_preproc.transform(df_test[categorical_inputs+numeric_inputs])
Y_test = output_preproc.transform(df_test[output])

---
### Task 1: Create the Gaussian Process Regressor

**Use the concepts illustrated in the previous notebook to create a Gaussian process regressor with an RBF + white noise kernel and fit it to the training data.** 

---


In [None]:
kernel = # ----
model = # ----

model.fit(X_train, Y_train)

### Testing the Model

As our next step we are going to do our standard evaluation on the testing set and we will display the histogram of outputs and errors.



In [None]:
#@title -- Testing -- { display-mode: "form" }
y_test = model.predict(X_test)
min_output = np.min(Y_test)
max_output = np.max(Y_test)

# we compute and display the MSE and the MAE
mse = mean_squared_error(Y_test, y_test)
print("MSE = {}".format(mse))

mae = mean_absolute_error(Y_test, y_test)
print("MAE = {}".format(mae))

plt.figure(figsize=(8, 6))
error_histogram(Y_test, y_test, Y_fit_scaling=Y_train)
# plt.savefig("output/error_output_histogram.pdf", bbox_inches='tight', pad_inches=0)

### Comparison with a Baseline Model

To get an idea of how good these results are, let's compare with one of the methods that we already know – e.g. with XGBoost which was one of our best models for structured datasets. We will create and fit an `XGBRegressor` and then evaluate its performance on the test set just as we did with the GP regressor.



In [None]:
xgb = XGBRegressor()
xgb.fit(X_train, Y_train)

In [None]:
#@title -- Testing -- { display-mode: "form" }
y_test = xgb.predict(X_test)

# we compute and display the MSE and the MAE
mse = mean_squared_error(Y_test, y_test)
print("MSE = {}".format(mse))

mae = mean_absolute_error(Y_test, y_test)
print("MAE = {}".format(mae))

plt.figure(figsize=(8, 6))
error_histogram(Y_test, y_test, Y_fit_scaling=Y_train)
# plt.savefig("output/error_output_histogram.pdf", bbox_inches='tight', pad_inches=0)

It seems that, in our case, the results with the GP regressor are rather better. However, we should not draw any general conclusions from this. For one thing, we did not do any hyperparameter tuning. For another thing, to really get a meaningful comparison of both methods we would need to evaluate them over a larger number of different datasets.

In any case though, this indicates that the results we achieved using GP regression were not bad.



Zdá sa, že v našom prípade sú výsledky GP regresora o dosť lepšie. Z toho by sme však v tejto fáze nemali robiť všeobecné závery. Nerobili sme napokon ani žiadne ladenie hyperparametrov. Druhým dôvodom je, že na zmysluplné porovnanie by sme museli obe metódy hodnotiť nad väčším počtom dátových množín.

V každom prípade však porovnanie indikuje, že výsledky, ktoré sme získali pomocou GP regresie neboli zlé.

