# Basic usage

In this notebook, we illustrate the basic functionalities of the tabularGP librarie (note that this is purposefully similar to [fastai's tabular model](https://docs.fast.ai/tutorial.tabular)).

## Data and dependencies

Loads both fastai (for the training and data cleaning) and tabularGP (for the model):

In [1]:
from fastai.tabular.all import *
from tabularGP import tabularGP_learner

Loads a subset of the `Adult` dataset:

In [2]:
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv').sample(1000)
procs = [Categorify, FillMissing, Normalize]

## Regression

### Problem definition

In [3]:
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'salary']
cont_names = ['education-num', 'fnlwgt']
dep_var = 'age'

In [4]:
data = TabularDataLoaders.from_df(df, path, procs=procs, cat_names=cat_names, cont_names=cont_names, y_names=dep_var)

### Model

Creates a gaussian process model:

In [5]:
learn = tabularGP_learner(data, metrics=rmse)

In [6]:
learn.fit_one_cycle(5, max_lr=1e-3)

epoch,train_loss,valid_loss,_rmse,time
0,7.593425,7.485846,11.393297,00:03
1,7.174219,7.047753,11.409024,00:02
2,6.869809,6.933912,11.42774,00:02
3,6.700057,6.906411,11.436115,00:02
4,6.603338,6.915399,11.436819,00:02


Notice that here the error increase slightly while the loss decreases. It might happen as the loss and error are not directly correlated (the loss also takes the calibration of the model, its resistance to overfitting and ability to estimate its uncertainty, into account).

### Uncertainty

Gaussian processes produce a mean (the usual output) and a standard deviation (modelizing the uncertainty on the result).
Here we plot the mean, note that the current version of Fastai requires that we use `[1:]` to exlude our target column from the input of predict.
We store the standard deviation in a `.stdev` member of the output of our model's `forward` function which cannot be seen here as it is erased by Fastai's predict method.

In [8]:
_, _, prediction = learn.predict(df.iloc[0][1:])
print("mean: ", prediction.item())

mean:  40.28834533691406


## Classification

### Problem definition

In [23]:
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['education-num', 'fnlwgt', 'age']
dep_var = 'salary'

In [24]:
data = TabularDataLoaders.from_df(df, path, procs=procs, cat_names=cat_names, cont_names=cont_names, y_names=dep_var, bs=64)

### Model

Creates a gaussian process model (notice that the only difference with the classification problem is the, optional, choice of a different metric):

In [25]:
learn = tabularGP_learner(data, metrics=accuracy)

In [26]:
learn.fit_one_cycle(5, max_lr=1e-3)

epoch,train_loss,valid_loss,accuracy,time
0,99.452339,90.606194,0.855,00:02
1,158.20134,93.847961,0.855,00:02
2,126.4226,92.377396,0.855,00:02
3,114.720161,91.908127,0.855,00:02
4,104.662125,91.968597,0.855,00:02


### Uncertainty

Classification models also have a standard deviation but, following the pytorch convention, the output is a raw logit and not a genuine probability (hence the means might not sum to one):

In [27]:
_, _, prediction = learn.predict(df.iloc[0][:-1])
for class_idx in range(prediction.size(0)): print("class", class_idx, ":", prediction[class_idx].item())

ValueError: Shape of passed values is (1, 12), indices imply (1, 10)