## Example: Fitting tabular data

This straightforward warm-up example makes use of a small,
fairly random UCI repository dataset with about 45,000 datapoints. We'll
download this data, do some light preprocessing, and fit an RBF kernel.

In [1]:
import os
import math
import time

import wget
import pandas as pd
import numpy as np
import sklearn
from sklearn.model_selection import train_test_split

from xGPR.xGP_Regression import xGPRegression as xGPReg
from xGPR.data_handling.dataset_builder import build_online_dataset
from xGPR.data_handling.dataset_builder import build_offline_fixed_vector_dataset

In [2]:
fname = wget.download("https://archive.ics.uci.edu/ml/machine-learning-databases/00265/CASP.csv")
raw_data = pd.read_csv(fname)
os.remove(fname)

  0% [                                                                          ]       0 / 3528710  0% [                                                                          ]    8192 / 3528710  0% [                                                                          ]   16384 / 3528710  0% [                                                                          ]   24576 / 3528710  0% [                                                                          ]   32768 / 3528710  1% [                                                                          ]   40960 / 3528710  1% [.                                                                         ]   49152 / 3528710  1% [.                                                                         ]   57344 / 3528710  1% [.                                                                         ]   65536 / 3528710  2% [.                                                                         ]   73728 / 3528710

 29% [.....................                                                     ] 1032192 / 3528710 29% [.....................                                                     ] 1040384 / 3528710 29% [.....................                                                     ] 1048576 / 3528710 29% [......................                                                    ] 1056768 / 3528710 30% [......................                                                    ] 1064960 / 3528710 30% [......................                                                    ] 1073152 / 3528710 30% [......................                                                    ] 1081344 / 3528710 30% [......................                                                    ] 1089536 / 3528710 31% [.......................                                                   ] 1097728 / 3528710 31% [.......................                                                   ] 1105920 / 3528710

 60% [.............................................                             ] 2146304 / 3528710 61% [.............................................                             ] 2154496 / 3528710 61% [.............................................                             ] 2162688 / 3528710 61% [.............................................                             ] 2170880 / 3528710 61% [.............................................                             ] 2179072 / 3528710 61% [.............................................                             ] 2187264 / 3528710 62% [..............................................                            ] 2195456 / 3528710 62% [..............................................                            ] 2203648 / 3528710 62% [..............................................                            ] 2211840 / 3528710 62% [..............................................                            ] 2220032 / 3528710

 88% [.................................................................         ] 3129344 / 3528710 88% [.................................................................         ] 3137536 / 3528710 89% [.................................................................         ] 3145728 / 3528710 89% [..................................................................        ] 3153920 / 3528710 89% [..................................................................        ] 3162112 / 3528710 89% [..................................................................        ] 3170304 / 3528710 90% [..................................................................        ] 3178496 / 3528710 90% [..................................................................        ] 3186688 / 3528710 90% [..................................................................        ] 3194880 / 3528710 90% [...................................................................       ] 3203072 / 3528710

In [3]:
raw_data

Unnamed: 0,RMSD,F1,F2,F3,F4,F5,F6,F7,F8,F9
0,17.284,13558.30,4305.35,0.31754,162.1730,1.872791e+06,215.3590,4287.87,102,27.0302
1,6.021,6191.96,1623.16,0.26213,53.3894,8.034467e+05,87.2024,3328.91,39,38.5468
2,9.275,7725.98,1726.28,0.22343,67.2887,1.075648e+06,81.7913,2981.04,29,38.8119
3,15.851,8424.58,2368.25,0.28111,67.8325,1.210472e+06,109.4390,3248.22,70,39.0651
4,7.962,7460.84,1736.94,0.23280,52.4123,1.021020e+06,94.5234,2814.42,41,39.9147
...,...,...,...,...,...,...,...,...,...,...
45725,3.762,8037.12,2777.68,0.34560,64.3390,1.105797e+06,112.7460,3384.21,84,36.8036
45726,6.521,7978.76,2508.57,0.31440,75.8654,1.116725e+06,102.2770,3974.52,54,36.0470
45727,10.356,7726.65,2489.58,0.32220,70.9903,1.076560e+06,103.6780,3290.46,46,37.4718
45728,9.791,8878.93,3055.78,0.34416,94.0314,1.242266e+06,115.1950,3421.79,41,35.6045




Note that we can but don't need to rescale
y-values -- xGPR will rescale y-values automatically.



In [4]:
train_data, test_data = train_test_split(raw_data, test_size = 0.2, random_state=123)

train_y, test_y = train_data["RMSD"].values, test_data["RMSD"].values
train_x, test_x = train_data.iloc[:,1:].values, test_data.iloc[:,1:].values

train_mean, train_std = train_x.mean(axis=0), train_x.std(axis=0)
train_x = (train_x - train_mean[None,:]) / train_std[None,:]

test_x = (test_x - train_mean[None,:]) / train_std[None,:]

Next, we'll set the data up for use as a training set by xGPR. If 
the data is too large to fit in memory, we can save it in "chunks"
to disk, each chunk as a .npy file with the corresponding y-values
as another .npy file, then build an OfflineDataset.
In this case, we'll build an OnlineDataset as well to illustrate.

The chunk_size parameter indicates how much data the Dataset
will feed to xGPR at any one given time during training. It's a 
little like a minibatch for deep learning. If you're using a
large number of random features to ensure a highly accurate model,
or if your data has a large number of features per datapoint,
set chunk_size small to avoid excessive memory consumption. This
does not affect the accuracy of the model or training in any way,
merely memory and to some extent speed (larger chunk sizes are
slightly faster).

In [5]:
online_train_data = build_online_dataset(train_x, train_y, chunk_size = 2000)

For the OfflineDataset, we'll save the data to .npy files; each file
can only contain up to chunk_size datapoints. skip_safety_checks
defaults to False and when False checks the data to make sure there
are no np.nan or np.inf, that all y-files have the same number of
datapoints as corresponding x-files and so on.

In [6]:
chunk_size = 2000
xfiles, yfiles = [], []

for i in range(0, math.ceil(train_x.shape[0] / chunk_size)):
    xfiles.append(f"{i}_xblock.npy")
    yfiles.append(f"{i}_yblock.npy")
    start = i * chunk_size
    end = min((i + 1) * chunk_size, train_x.shape[0])
    np.save(xfiles[-1], train_x[start:end,:])
    np.save(yfiles[-1], train_y[start:end])

#For OnlineDatasets, we always use build_online_dataset.
#For OfflineDatasets, we use either build_offline_fixed_vector_dataset
#(for tabular data) or build_offline_sequence_dataset (for sequences
#and graphs).
offline_train_data = build_offline_fixed_vector_dataset(xfiles, yfiles, chunk_size = 2000,
                                                       skip_safety_checks = False)

We'll tune hyperparameters using a Bayesian approach implemented under the class method ``crude_bayes``.
For now, we'll use 2048 random features to tune hyperparameters
and 8192 to fit the final model (fitting scales better to 
larger numbers of features and performance is more sensitive to the number
used to fit than to the number used to tune, although increasing either
boosts performance). For variance, 512 - 2048 random features is
generally fine; this merely affects the accuracy with which uncertainty
on predictions is quantified (more = better accuracy but more expensive).

Note that "crude_bayes", like ``crude_grid`` and ``crude_lbfgs``,
because it uses matrix decompositions, has very poor
scaling to large numbers of random features. For < 5000 random features, it's reasonably
fast on GPU (for CPU, even fewer is preferable). We'll show how to fine-tune the
result from this procedure with a larger number of random features shortly.

The "subsample = 1" argument is the default; it merely indicates we should use the whole
dataset. "subsample = 0.1" would cause xGPR to randomly sample 10% of the training data
when tuning, "subsample = 0.5" would sample 50% and so on.

In [7]:
uci_model = xGPReg(training_rffs = 2048, fitting_rffs = 8192, variance_rffs = 1024,
                  kernel_choice = "RBF", verbose = True, device = "gpu")

start_time = time.time()
uci_model.tune_hyperparams_crude_bayes(online_train_data, max_bayes_iter = 30, subsample = 1)
end_time = time.time()

print(f"Wallclock: {end_time - start_time}")

starting_tuning
Grid point 0 acquired.
Grid point 1 acquired.
Grid point 2 acquired.
Grid point 3 acquired.
Grid point 4 acquired.
Grid point 5 acquired.
Grid point 6 acquired.
Grid point 7 acquired.
Grid point 8 acquired.
Grid point 9 acquired.
New hparams: [-0.1576268]
Additional acquisition 10.
New hparams: [0.3133586]
Additional acquisition 11.
New hparams: [0.4027546]
Additional acquisition 12.
New hparams: [-0.9321185]
Additional acquisition 13.
New hparams: [0.8437973]
Additional acquisition 14.
New hparams: [-3.1247573]
Additional acquisition 15.
New hparams: [-5.603476]
Best score achieved: 38889.132
Best hyperparams: [-0.4226011  0.192718   0.4027546]
Tuning complete.
Wallclock: 77.255704164505


Just for fun, let's repeat this using the offline dataset...this requires
loading data from disk in batches on each iteration.

In [8]:
start_time = time.time()
uci_model.tune_hyperparams_crude_bayes(offline_train_data, max_bayes_iter = 30)
end_time = time.time()

print(f"Wallclock: {end_time - start_time}")

starting_tuning
Grid point 0 acquired.
Grid point 1 acquired.
Grid point 2 acquired.
Grid point 3 acquired.
Grid point 4 acquired.
Grid point 5 acquired.
Grid point 6 acquired.
Grid point 7 acquired.
Grid point 8 acquired.
Grid point 9 acquired.
New hparams: [-0.1576268]
Additional acquisition 10.
New hparams: [0.3133586]
Additional acquisition 11.
New hparams: [0.4027546]
Additional acquisition 12.
New hparams: [-0.8735007]
Additional acquisition 13.
New hparams: [0.8437973]
Additional acquisition 14.
New hparams: [0.1634564]
Additional acquisition 15.
New hparams: [-5.603476]
Best score achieved: 38889.132
Best hyperparams: [-0.4226011  0.192718   0.4027546]
Tuning complete.
Wallclock: 31.499005556106567


Finally, let's see what happens if we tune using L-BFGS with multiple restarts. This is usually slower (sometimes much slower) than ``crude_bayes`` but is a little more foolproof. You'll generally need to set
n_restarts to some value like 3 or 5 -- L-BFGS is a local optimization strategy and
can get trapped in poor local minima.

Note that ``crude_lbfgs``, because it uses matrix decompositions, has very poor
scaling to large numbers of random features. For < 5000 random features, it's reasonably
fast on GPU (for CPU, even fewer is preferable). Like ``crude_bayes``, it's best
as a way to find a starting point for futher optimization.
We'll show how to fine-tune the result from this procedure with a larger number
of random features shortly. Once again, just as for ``crude_bayes``, we can
specify a ``subsample`` parameter if desired.

In [9]:
uci_model = xGPReg(training_rffs = 2048, fitting_rffs = 8192, variance_rffs = 1024,
                  kernel_choice = "RBF", verbose = True, device = "gpu")

start_time = time.time()
uci_model.tune_hyperparams_crude_lbfgs(online_train_data, n_restarts = 3, subsample = 1)
end_time = time.time()

print(f"Wallclock: {end_time - start_time}")

starting_tuning
Now beginning L-BFGS minimization.
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Restart 0 completed. Best score is 38886.80149118837.
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Restart 1 completed. Best score is 38886.80149118837.
Evaluating gr

We can retrieve the resulting hyperparameters and save them somewhere
for future use if needed. This function always returns the
log of the hyperparameters, and if you're passing
hyperparameters to the fitting function, you should use
the log of the hyperparameters as well.

In [10]:
uci_model.get_hyperparams()

array([-0.41722132,  0.20104138,  0.39762501])

For fitting, we generally build a preconditioner first, unless
fitting_rffs is small, in which case we can use mode = "exact"
and fit with a single pass over the dataset, or if the dataset
is small, in which case we can fit using mode = "lbfgs".
For larger datasets and numbers of random features,
preconditioned CG is our preferred option (sometimes
stochastic gradient descent can also be competitive).

method can be either 'srht' or 'srht_2'. 'srht' requires only
one pass across the dataset and no matrix multiplications, so it's
pretty fast. 'srht_2' requires two passes across the dataset and
involves matrix multiplication, so it's slower but the resulting
preconditioner usually reduces the number of CG iterations required to
fit by about 20-25% compared with 'srht', so it builds a better
preconditioner. We recommend using 'srht_2' unless you're training on
CPU, in which case 'srht_2' might be much slower.

In [11]:
start_time = time.time()
preconditioner, ratio = uci_model.build_preconditioner(offline_train_data, max_rank = 512,
                                                      method = "srht_2")
end_time = time.time()
print(f"Wallclock: {end_time - start_time}")

Chunk 0 complete.
Chunk 10 complete.
Chunk 0 complete.
Chunk 10 complete.
Wallclock: 1.8686835765838623


In [12]:
print(ratio)

13.69709540628906


Notice the "ratio" (aka Min eigval / lambda**2). The smaller
this value, the fewer iterations CG will need to fit. See the
"Fitting" section of the docs for some guidance on what is a 
"good-enough" ratio (i.e. a ratio that will result in a fast fit).

The smaller tol, the tighter
the fit and the more accurate the model, but there are sharply
diminishing returns on this -- past a certain point, decreasing
tol just increases the number of iterations required to fit
while providing a very slight benefit. 1e-6 is usually more
than enough for noisy data; for noise-free data, 1e-7 is a good
setting; 1e-8 is usually very expensive overkill. For
datasets where the data is virtually noise-free and you need
a really tight fit to minimize error as much as possible
1e-8 may sometimes be useful; just keep in mind it will
greatly increase the number of iterations required to fit
and hence fitting time.

In [13]:
start_time = time.time()
uci_model.fit(offline_train_data, preconditioner = preconditioner,
             mode = "cg", tol = 1e-6)
end_time = time.time()
print(f"Wallclock: {end_time - start_time}")

starting fitting
Iteration 0
Iteration 5
Iteration 10
Iteration 15
Iteration 20
Iteration 25
Iteration 30
Estimating variance...
Variance estimated.
Fitting complete.
Wallclock: 2.4214823246002197


We can get the uncertainty on predictions by setting get_var = True.
In this case, we don't need it, so we'll skip it. chunk_size ensures
we only process up to chunk_size datapoints at one time to limit
memory consumption.

In [14]:
test_predictions = uci_model.predict(test_x, get_var = False, chunk_size = 1000)

In [15]:
mae = np.mean( np.abs(test_predictions - test_y))
print(f"MAE: {mae}")

MAE: 2.9641127770639915


Suppose we are unhappy with this result. We could of course consider
a different kernel or modeling approach; alternatively, we can
increase the number of random features for either tuning or
fitting, which will almost invariably improve performance.

Tuning hyperparameters with ``crude`` methods
is a quick and dirty approach that does not scale well to large numbers of random
features. If we want to increase the number of features used for tuning to
more than 4096 (or to more than 2048 on CPU) we should consider using
approximate marginal likelihood instead. This is slow but has much better
scaling (the increase in cost with a larger number of random features is
much smaller).

Generally increasing the number of random features used for fitting gives a
bigger performance boost than increasing the number for tuning. For fitting,
it can be beneficial to use as many as 32,768 random features, while
for tuning, we seldom see large performance gains for more than 10,000.
Either way, however, increasing the number of random features
yields diminishing returns. Going from 1024 to 2048 gives
a more substantial improvement than going from 2048 to
4096, and so on. If you ever find yourself needing to
go to very high numbers, the model & kernel may not be
a good fit for that particular problem.

First, let's increase the number used to fit and see what happens...

In [16]:
uci_model.fitting_rffs = 32768

start_time = time.time()
preconditioner, ratio = uci_model.build_preconditioner(offline_train_data, max_rank = 512,
                                                      method = "srht_2")
end_time = time.time()
print(f"Wallclock: {end_time - start_time}")

Chunk 0 complete.
Chunk 10 complete.
Chunk 0 complete.
Chunk 10 complete.
Wallclock: 6.944976329803467


In [17]:
start_time = time.time()
uci_model.fit(offline_train_data, preconditioner = preconditioner,
             mode = "cg", tol = 1e-6)
end_time = time.time()
print(f"Wallclock: {end_time - start_time}")

starting fitting
Iteration 0
Iteration 5
Iteration 10
Iteration 15
Iteration 20
Iteration 25
Iteration 30
Estimating variance...
Variance estimated.
Fitting complete.
Wallclock: 7.713194131851196


In [18]:
test_predictions = uci_model.predict(test_x, get_var = False, chunk_size = 1000)
mae = np.mean( np.abs(test_predictions - test_y))
print(f"MAE: {mae}")

MAE: 2.8691260253432724


As discussed above, we could also retune hyperparameters using a larger number of random features, preferably
using approximate marginal likelihood. There are two strategies for doing this implemented in xGPR: a Bayesian strategy (``uci_model.tune_hyperparams_fine_bayes``) or a direct strategy (``uci_model.tune_hyperparams_fine_direct``, which uses either the "Powell" algorithm or "Nelder-Mead"). There are more "knobs" that have to be set correctly
to ensure the approximation is calculated correctly; we'll discuss a few of these briefly here, but see the Tuning section of the docs for more. Also see the small molecule example for another illustration on a problem involving a graph convolution kernel.

``fine_direct`` with "Powell", is faster, but for it to work well, the starting point has to be "within sight" of the global optimum, whereas ``fine_bayes`` can find the global optimum as long as it's somewhere in the neighborhood. ``fine_direct`` with ``optim_method="Nelder-Mead"`` is a little bit of a wild card that usually takes many more iterations than Powell but often works slightly better. We generally prefer ``fine_direct`` with Powell if we're confident that we've got a good starting point, ``fine_bayes`` if we're not confident in our starting point. We'll illustrate the use of ``fine_direct`` here and ``fine_bayes`` under the small molecule example.

Tuning with up to 35 iterations will essentially work out to
fitting the model up to 35x, so if fitting the model once takes the better part of a minute, you can see how much time this is likely to take. As discussed, this approach is slower -- but also more scalable -- than ``crude_bayes``. Increasing max_iter will increase the chances of finding the best possible hyperparameters, but may of course take longer.

The same settings that work well for fitting generally work well for ensuring the marginal likelihood approximation used by this function is accurate. Since we used ``tol`` of 1e-6 for fitting, we'll use ``nmll_tol = 1e-6`` here as well. It is important not to use an ``nmll_rank`` that's too small -- 1024 is generally a decent default, but if you need to use a larger value than that to get a fast fit, you should probably use the same value here as well.

In [19]:
uci_model.training_rffs = 8192

start_time = time.time()
start_hparams = np.array([-0.41722132,  0.20104138,  0.39762501])

uci_model.tune_hyperparams_fine_direct(offline_train_data, starting_hyperparams = start_hparams,
                                  optim_method = "Powell",
                                  random_seed = 123, max_iter = 40,
                                  nmll_tol = 1e-6, nmll_rank = 1024)
end_time = time.time()

print(f"Wallclock: {end_time - start_time}")

starting_tuning
Now beginning NM minimization.
Now building preconditioner...
Now fitting...
NMLL evaluation completed.
Now building preconditioner...
Now fitting...
NMLL evaluation completed.
Now building preconditioner...
Now fitting...
NMLL evaluation completed.
Now building preconditioner...
Now fitting...
NMLL evaluation completed.
Now building preconditioner...
Now fitting...
NMLL evaluation completed.
Now building preconditioner...
Now fitting...
NMLL evaluation completed.
Now building preconditioner...
Now fitting...
NMLL evaluation completed.
Now building preconditioner...
Now fitting...
NMLL evaluation completed.
Now building preconditioner...
Now fitting...
NMLL evaluation completed.
Now building preconditioner...
Now fitting...
NMLL evaluation completed.
Now building preconditioner...
Now fitting...
NMLL evaluation completed.
Now building preconditioner...
Now fitting...
NMLL evaluation completed.
Now building preconditioner...
Now fitting...
NMLL evaluation completed.
Now 

In [20]:
print(uci_model.get_hyperparams())

[-0.49753128  0.27381241  0.6262515 ]


Now let's refit using our new hyperparameters, using 8192 fitting rffs so we 
can compare to what we used at first. We'll see that we get a slight improvement
over our initial tuning run, but nothing to write home about. (This isn't always
true -- see the small molecule example for a case where fine-tuning provides
a much more substantial boost!) Of course,
by fitting using this new hyperparameter set with 32768 RFFs instead of 8192
we could get some additional improvement. Random features offer asymptotic improvement --
we are asymptotically approaching what we would get with a (much more expensive) exact GP.

In [21]:
uci_model.fitting_rffs = 8192
preconditioner, ratio = uci_model.build_preconditioner(online_train_data, max_rank = 512,
                                                      method = "srht_2")
uci_model.fit(online_train_data, preconditioner = preconditioner,
             mode = "cg", tol = 1e-6)
test_predictions = uci_model.predict(test_x, get_var = False, chunk_size = 1000)
mae = np.mean( np.abs(test_predictions - test_y))
print(mae)

Chunk 0 complete.
Chunk 10 complete.
Chunk 0 complete.
Chunk 10 complete.
starting fitting
Iteration 0
Iteration 5
Iteration 10
Iteration 15
Iteration 20
Iteration 25
Iteration 30
Iteration 35
Iteration 40
Iteration 45
Estimating variance...
Variance estimated.
Fitting complete.
2.8871837322706764


In [22]:
#We can switch the model over to CPU if we want to do inference on CPU (training is best
#done on GPU if possible.)
uci_model.device = "cpu"

In [23]:
#Finally, we'll delete the .npy files we created earlier.
offline_train_data.delete_dataset_files()

That's it for this simple warm-up. Now let's look at some more
interesting examples.