# Basic usage

In this notebook, we illustrate the basic functionalities of the tabularGP librarie (note that this is purposefully similar to [fastai's tabular model](https://docs.fast.ai/tabular.html)).

## Data and dependencies

Loads both fastai (for the training and data cleaning) and tabularGP (for the model):

Loads a subset of the `Adult` dataset:

In [2]:
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv').sample(1000)
procs = [FillMissing, Normalize, Categorify]

## Regression

### Problem definition

In [8]:
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'salary']
cont_names = ['education-num', 'fnlwgt']
dep_var = 'age'

In [9]:
data = (TabularList.from_df(df, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)
                  .split_by_rand_pct()
                  .label_from_df(cols=dep_var, label_cls=FloatList)
                  .databunch())

### Model

Creates a gaussian process model:

In [10]:
learn = tabularGP_learner(data, metrics=rmse)

In [11]:
learn.fit_one_cycle(5, max_lr=1e-3)

epoch,train_loss,valid_loss,root_mean_squared_error,time
0,8.169525,8.8255,10.893338,00:02
1,7.681621,7.941566,10.887447,00:02
2,7.300935,7.749237,10.902524,00:02
3,7.048365,7.698985,10.911494,00:02
4,6.884355,7.707963,10.912577,00:02


Notice that here the error increase slightly while the loss decreases. It might happen as the loss and error are not directly correlated (the loss also takes the calibration of the model, its resistance to overfitting and ability to estimate its uncertainty, into account).

### Uncertainty

Gaussian processes produce a mean (the usual output) and a standard deviation (modelizing the uncertainty on the result).
Here they are stored respectively in the index 0 and 1 of the last dimenssion of the tensor outputed by the model:

In [7]:
_, _, prediction = learn.predict(df.iloc[0])
mean = prediction[..., 0]
standard_deviation = prediction[..., 1]
print(mean.item(), "±", standard_deviation.item())

40.21514892578125 ± 4.060330390930176


## Classification

### Problem definition

In [3]:
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['education-num', 'fnlwgt', 'age']
dep_var = 'salary'

In [4]:
data = (TabularList.from_df(df, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)
                  .split_by_rand_pct()
                  .label_from_df(cols=dep_var)
                  .databunch(bs=63))

### Model

Creates a gaussian process model (notice that the only difference with the classification problem is the, optional, choice of a different metric):

In [6]:
learn = tabularGP_learner(data, metrics=accuracy)

In [7]:
learn.fit_one_cycle(5, max_lr=1e-3)

epoch,train_loss,valid_loss,accuracy,time
0,2.985803,3.42232,0.775,00:02
1,2.909588,3.230241,0.78,00:02
2,3.026408,6.555335,0.785,00:02
3,3.826358,6.593029,0.805,00:02
4,4.175246,6.743631,0.805,00:02


### Uncertainty

In [12]:
from tabularGP.loss_functions import gp_softmax

Classification models also have a standard deviation but, following the pytorch convention, the output is a raw logit and not a genuine probability (hence the means might not sum to one):

In [13]:
_, _, prediction = learn.predict(df.iloc[0])
mean = prediction[..., 0]
standard_deviation = prediction[..., 1]
for class_idx in range(mean.size(0)): print("class", class_idx, ":", mean[class_idx].item(), "±", standard_deviation[class_idx].item())

class 0 : 0.8852893710136414 ± 0.06899724155664444
class 1 : 0.11504742503166199 ± 0.06760115176439285


The proper way to get probabilities is to apply `gp_softmax` to your raw output (as you would apply a `softmax` to a traditional classification output):

In [16]:
probabilities = gp_softmax(prediction)[0]
for class_idx in range(probabilities.size(0)): print("class", class_idx, ":", probabilities[class_idx].item())

class 0 : 1.0
class 1 : 0.0
