# Advanced usage
In this notebook we detail the advanced parameters of the gaussian process.

In [1]:
from fastai.tabular.all import *
from tabularGP import tabularGP_learner

## Data

Builds a regression problem on a subset of the adult dataset:

In [2]:
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv').sample(1000)
procs = [FillMissing, Normalize, Categorify]

In [3]:
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['education-num', 'fnlwgt']
dep_var = 'age'

In [4]:
data = TabularDataLoaders.from_df(df, path, procs=procs, cat_names=cat_names, cont_names=cont_names, y_names=dep_var)

## Number of training points

`nb_training_points` is the number of training points, selected from the training set, used to build the gaussian process.

Usually more points leads to a better fit (even if we sometimes noticed that a small number of points can produced very acurate results) but the computing time grows in `O(n^3)` of the number of training points used meaning that you cannot use more than a few thousand points.

By default the model will use 4000 points (or less if the training set contains less than 4000 points).

In [5]:
learn = tabularGP_learner(data, nb_training_points=50)
learn.fit_one_cycle(5, max_lr=1e-3)

epoch,train_loss,valid_loss,time
0,6.82555,6.761554,00:01
1,120.942329,16.511131,00:00
2,77.82135,18.561262,00:00
3,56.860687,17.061523,00:00
4,44.438683,16.732986,00:00


## Training points selection

By default the training points are selected from the training set with the [k-means++ algorithm](https://en.wikipedia.org/wiki/K-means%2B%2B) (unless the full training set is used obviously).

You can, instead, use random training point by setting `use_random_training_points` to `True` (this can be interesting if you want to select a best-of-several model).

In [6]:
learn = tabularGP_learner(data, use_random_training_points=True)
learn.fit_one_cycle(5, max_lr=1e-3)

epoch,train_loss,valid_loss,time
0,7.585739,7.944963,00:02
1,7.467497,7.64436,00:02
2,7.265495,7.592858,00:02
3,7.119409,7.574205,00:02
4,7.072562,7.583457,00:02


## Fit training points

By setting `fit_training_inputs` and `fit_training_outputs` to `True`, you can fit both the gaussian process and its training points.

This is a good way to get better performances from a gaussian process with very few training points but it might impact negatively its ability to train.

In [7]:
learn = tabularGP_learner(data, fit_training_inputs=True, fit_training_outputs=True)
learn.fit_one_cycle(5, max_lr=1e-3)

epoch,train_loss,valid_loss,time
0,7.167955,7.468031,00:02
1,6.997549,7.194853,00:02
2,6.84012,7.108954,00:02
3,6.716816,7.081225,00:02
4,6.669833,7.086506,00:02


## Noise

The `noise` parameter models the noise in the output expressed as a fraction of its standard deviation: `0.01` would be 1% of the standard deviation.

This parameter is worth exploring as it has a strong impact on the smoothing of the output.
By default it is set to `1e-2` (it cannot be set to 0 as it would lead to numerical unstability).

In [8]:
learn = tabularGP_learner(data, noise=1e-1)
learn.fit_one_cycle(5, max_lr=1e-3)

epoch,train_loss,valid_loss,time
0,3.982376,4.019459,00:02
1,3.979748,4.015594,00:02
2,3.977838,4.013403,00:02
3,3.975542,4.01197,00:02
4,3.97405,4.011341,00:02
