# Kernel selection
In this notebook we illustrate the selection of a kernel for a gaussian process.

The kernel is there to modelize the similarity between two points in the input space and, as far as gaussian process are concerned, it can make or break the algorithm.

In [1]:
from fastai.tabular import *
from tabularGP import tabularGP_learner
from tabularGP.kernel import *

## Data

Builds a regression problem on a subset of the adult dataset:

In [2]:
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv').sample(1000)
procs = [FillMissing, Normalize, Categorify]

In [3]:
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['education-num', 'fnlwgt']
dep_var = 'age'

In [4]:
data = (TabularList.from_df(df, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)
                  .split_by_rand_pct()
                  .label_from_df(cols=dep_var, label_cls=FloatList)
                  .databunch())

## Tabular kernels

By default, tabularGP uses one kernel type for each continuous features (a [gaussian kernel](https://en.wikipedia.org/wiki/Radial_basis_function_kernel)) and one kernel type for each categorial features (an [index kernel](https://gpytorch.readthedocs.io/en/latest/kernels.html#indexkernel)).  
Using those kernels we can compute the similarity between the individual coordinates of two points, those similarity are them combined with what we call a tabular kernel.

The simplest kernel is the `WeightedSumKernel` kernel which computes a weighted sum of the feature similarities.  
It is equivalent to a `OR` type of relation: if two points have at least one feature that is similar then they will be considered close in the input space (even if all the other features are very dissimilar).

In [5]:
learn = tabularGP_learner(data, kernel=WeightedSumKernel)
learn.fit_one_cycle(5, max_lr=1e-3)

epoch,train_loss,valid_loss,time
0,15.438261,10.809207,00:02
1,14.74088,9.228584,00:02
2,13.860456,10.228907,00:02
3,13.121478,12.005603,00:02
4,12.606131,13.321619,00:02


Then there is the `WeightedProductKernel` kernel which computes a weighted geometric mean (weighted product) of the feature similarities.  
It is equivalent to a `AND` type of relation: all features need to be similar to consider two points similar in the input space.
It is a good kernel to use when features are all continuous and similar (i.e. the `x,y` plane for a function).

In [6]:
learn = tabularGP_learner(data, kernel=WeightedProductKernel)
learn.fit_one_cycle(5, max_lr=1e-3)

epoch,train_loss,valid_loss,time
0,3.154345,5.47148,00:02
1,3.125328,5.225213,00:02
2,3.088685,5.099967,00:02
3,3.057838,5.057697,00:02
4,3.036019,5.05503,00:02


The default tabular kernel is a `ProductOfSumsKernel` which modelise a combinaison of the form: $$s = \prod_i{(\sum_j{\beta_j * s_j})^{\alpha_i}}$$
It is equivalent to a `WeightedProductKernel` put on top of a `WeightedSumKernel` kernel.
This kernel is extremely flexible and recommended when you have a mix of continuous and categorial features.

In [7]:
learn = tabularGP_learner(data, kernel=ProductOfSumsKernel)
learn.fit_one_cycle(5, max_lr=1e-3)

epoch,train_loss,valid_loss,time
0,7.631459,8.160049,00:02
1,7.445803,7.749929,00:02
2,7.264214,7.647715,00:02
3,7.125649,7.610725,00:02
4,7.03003,7.617285,00:02


It is important to note that the choice of the tabular kernel can have a drastic impact on your loss and that you should probably always test all available kernels to find the one that is most suited to your particular problem.

Note that it is fairly easy to design your own `TabularKernel`, following the examples in the [kernel.py](https://github.com/nestordemeure/tabularGP/blob/master/kernel.py) file (while the `feature importance` property is useful, it is optionnal), in order to better accomodate the particular structure of your problem.

## Feature kernels

In [8]:
from tabularGP.loss_functions import *
from tabularGP import *

There are four continuous kernel provided:

- `ExponentialKernel` which is zero differentiable
- `Matern1Kernel` which is once differentiable
- `Matern2Kernel` which is twice differentiable
- `GaussianKernel` (the default) which is infinitely differentiable

The more differentiable a kernel is and the smoother the modelized function will be.

There are two categorial kernel provided:

- `HammingKernel` which consider different elements of a category as have a similarity of zero
- `IndexKernel` (the default) which consider that different elements can still be similar

While the choice of feature kernel tend to be less impactful, you can manually select them if you build your model yourself:

In [9]:
model = TabularGPModel(data, kernel=WeightedProductKernel, cont_kernel=ExponentialKernel, cat_kernel=HammingKernel)
loss_func = gp_gaussian_marginal_log_likelihood # would have used `gp_is_greater_log_likelihood` for classification
learn = TabularGPLearner(data, model, loss_func=loss_func)
learn.fit_one_cycle(5, max_lr=1e-3)

epoch,train_loss,valid_loss,time
0,1.551779,5.07584,00:00
1,1.519318,5.100527,00:00
2,1.467494,5.112877,00:00
3,1.415412,5.107607,00:00
4,1.376152,5.10739,00:00


It is also fairly easy to provide your own feature kernel to modelize behaviour specific to your data (periodicity, trends, etc).

To learn more about the implementation of kernels adapted to a particular problem, we recommend the chapter two (*Expressing Structure with Kernels*) and three (*Automatic Model Construction*) of the very good [Automatic Model Construction with Gaussian Processes](http://www.cs.toronto.edu/~duvenaud/thesis.pdf).

## Transfer learning

Kernels model the input space, as such they can be reused from an output type to another in order to tranfert domain knowledge and speed up training.

Here is a classification problem using the same input features (different features would lead to a crash as the input space would be different):

In [10]:
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['education-num', 'fnlwgt']
dep_var = 'salary'

In [11]:
data_classification = (TabularList.from_df(df, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)
                  .split_by_rand_pct()
                  .label_from_df(cols=dep_var)
                  .databunch())

We can reuse the kernel from our regression task by passing the learner, model or trained kernel to the `kernel` argument of our builder:

In [12]:
learn_classification = tabularGP_learner(data, kernel=learn)
learn_classification.fit_one_cycle(5, max_lr=1e-3)

epoch,train_loss,valid_loss,time
0,1.551779,4.969957,00:00
1,1.519319,5.023561,00:00
2,1.467495,5.075471,00:00
3,1.415412,5.102517,00:00
4,1.376152,5.107355,00:00


Note that, by default, the kernel is frozen when transfering knowledge. Lets unfreeze it now that the rest of the gaussian process is trained:

In [13]:
learn_classification.unfreeze(kernel=True)
learn_classification.fit_one_cycle(5, max_lr=1e-3)

epoch,train_loss,valid_loss,time
0,1.279576,5.115157,00:00
1,1.236599,5.13907,00:00
2,1.166112,5.157554,00:00
3,1.093386,5.170861,00:00
4,1.037871,5.173307,00:00
