# Advanced usage
In this notebook we detail the advanced parameters of the gaussian process.

In [1]:
from fastai.tabular import *
from tabularGP import tabularGP_learner

## Data

Builds a regression problem on a subset of the adult dataset:

In [2]:
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv').sample(1000)
procs = [FillMissing, Normalize, Categorify]

In [3]:
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['education-num', 'fnlwgt']
dep_var = 'age'

In [4]:
data = (TabularList.from_df(df, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)
                  .split_by_rand_pct()
                  .label_from_df(cols=dep_var, label_cls=FloatList)
                  .databunch())

## Number of training points

`nb_training_points` is the number of training points, selected from the training set, used to build the gaussian process.

Usually more points leads to a better fit (even if we sometimes noticed that a small number of points can produced very acurate results) but the computing time grows in `O(n^3)` of the number of training points used meaning that you cannot use more than a few thousand points.

By default the model will use 4000 points (or less if the training set contains less than 4000 points).

In [5]:
learn = tabularGP_learner(data, nb_training_points=50)
learn.fit_one_cycle(5, max_lr=1e-3)

epoch,train_loss,valid_loss,time
0,643.128296,11.510853,00:01
1,286.69577,11.210815,00:01
2,985.787109,12.29082,00:01
3,640.537292,12.686558,00:01
4,442.588867,12.700387,00:01


## Training points selection

By default the training points are selected from the training set with the [k-means++ algorithm](https://en.wikipedia.org/wiki/K-means%2B%2B) (unless the full training set is used obviously).

You can, instead, use random training point by setting `use_random_training_points` to `True` (this can be interesting if you want to select a best-of-several model).

In [7]:
learn = tabularGP_learner(data, use_random_training_points=True)
learn.fit_one_cycle(5, max_lr=1e-3)

epoch,train_loss,valid_loss,time
0,7.209882,6.483124,00:02
1,7.045302,6.225573,00:02
2,8.193801,7.906738,00:02
3,8.126514,7.724882,00:02
4,8.014868,7.732232,00:02


## Fit training points

By setting `fit_training_inputs` and `fit_training_outputs` to `True`, you can fit both the gaussian process and its training points.

This is a good way to get better performances from a gaussian process with very few training points but it might impact negatively its ability to train.

In [6]:
learn = tabularGP_learner(data, fit_training_inputs=True, fit_training_outputs=True)
learn.fit_one_cycle(5, max_lr=1e-3)

epoch,train_loss,valid_loss,time
0,7.136113,6.360609,00:02
1,6.989767,6.123254,00:02
2,9.356528,7.962548,00:02
3,8.896605,7.808175,00:02
4,8.569292,7.824018,00:02


## Noise

The `noise` parameter models the noise in the output expressed as a fraction of its standard deviation: `0.01` would be 1% of the standard deviation.

This parameter is worth exploring as it has a strong impact on the smoothing of the output.
By default it is set to `1e-2` (it cannot be set to 0 as it would lead to numerical unstability).

In [9]:
learn = tabularGP_learner(data, noise=1e-1)
learn.fit_one_cycle(5, max_lr=1e-3)

epoch,train_loss,valid_loss,time
0,4.045887,4.036719,00:02
1,4.043304,4.028405,00:02
2,4.03991,4.024992,00:02
3,4.03761,4.021344,00:02
4,4.035222,4.019975,00:02
