# Basic usage

In this notebook, we illustrate the basic functionalities of the tabularGP librarie (note that this is purposefully similar to [fastai's tabular model](https://docs.fast.ai/tabular.html)).

## Data and dependencies

Loads both fastai (for the training and data cleaning) and tabularGP (for the model):

In [1]:
from fastai.tabular import *
from tabularGP import tabularGP_learner

Loads a subset of the `Adult` dataset:

In [2]:
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv').sample(1000)
procs = [FillMissing, Normalize, Categorify]

## Regression

### Problem definition

In [3]:
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'salary']
cont_names = ['education-num', 'fnlwgt']
dep_var = 'age'

In [4]:
data = (TabularList.from_df(df, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)
                  .split_by_rand_pct()
                  .label_from_df(cols=dep_var, label_cls=FloatList)
                  .databunch())

### Model

Creates a gaussian process model:

In [5]:
learn = tabularGP_learner(data)

In [6]:
learn.fit_one_cycle(5, max_lr=1e-3)

epoch,train_loss,valid_loss,time
0,6.924572,7.045,00:02
1,6.775909,6.773074,00:02
2,6.621901,6.683449,00:02
3,6.500859,6.637858,00:02
4,6.416223,6.63686,00:03


### Uncertainty

Gaussian processes produce a mean (the usual output) and a standard deviation (modelizing the uncertainty on the result).
Here they are stored respectively in the index 0 and 1 of the last dimenssion of the tensor outputed by the model:

In [7]:
_, _, prediction = learn.predict(df.iloc[0])
mean = prediction[..., 0]
standard_deviation = prediction[..., 1]
print(mean.item(), "±", standard_deviation.item())

40.21514892578125 ± 4.060330390930176


## Classification

### Problem definition

In [8]:
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['education-num', 'fnlwgt', 'age']
dep_var = 'salary'

In [9]:
data = (TabularList.from_df(df, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)
                  .split_by_rand_pct()
                  .label_from_df(cols=dep_var)
                  .databunch(bs=63))

### Model

Creates a gaussian process model (notice that nothing is doen to indicate that this is a classification problem):

In [10]:
learn = tabularGP_learner(data)

In [11]:
learn.fit_one_cycle(5, max_lr=1e-3)

epoch,train_loss,valid_loss,time
0,2.641328,2.290921,00:02
1,4.868949,4.248219,00:02
2,4.882902,4.114988,00:02
3,4.791674,4.158504,00:02
4,4.71731,4.212892,00:02


### Uncertainty

In [12]:
from tabularGP.loss_functions import gp_softmax

Classification models also have a standard deviation but, following the pytorch convention, the output is a raw logit and not a genuine probability (hence the means might not sum to one):

In [13]:
_, _, prediction = learn.predict(df.iloc[0])
mean = prediction[..., 0]
standard_deviation = prediction[..., 1]
for class_idx in range(mean.size(0)): print("class", class_idx, ":", mean[class_idx].item(), "±", standard_deviation[class_idx].item())

class 0 : 0.8852893710136414 ± 0.06899724155664444
class 1 : 0.11504742503166199 ± 0.06760115176439285


The proper way to get probabilities is to apply `gp_softmax` to your raw output (as you would apply a `softmax` to a traditional classification output):

In [16]:
probabilities = gp_softmax(prediction)[0]
for class_idx in range(probabilities.size(0)): print("class", class_idx, ":", probabilities[class_idx].item())

class 0 : 1.0
class 1 : 0.0
