### Active learning on GB1

In this experiment, we test to see how xGPR would perform if used for an active
learning protein engineering experiment on the GB1 dataset. GB1 is convenient
for this purpose since it includes results for (most) of the 160,000 possible mutants for
a protein mutated at four sites, so that we can run an in silico experiment that
recapitulates what would happen if we used the same approach in the lab.

We here imagine a scenario where an experimenter randomly selects 96 sequences,
trains a model on these, then uses an acquisition function (as typical in Bayesian
optimization) to pick another 96, experimentally evaluates these etc. The goal
is to minimize the number of iterations to achieve a "good" result. Fitness
of each sequence here is measured from 0 to 1, where 0 is worst and 1 best
possible.

In [26]:
import os

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import scipy
from scipy import stats
import xGPR
from xGPR.xGP_Regression import xGPRegression
from xGPR.data_handling.dataset_builder import build_online_dataset

In [27]:
if "auxiliary_experiments" in os.getcwd():
    os.chdir(os.path.join("..", "benchmark_evals", "active_learn", "encoded_data"))
    gb1_x, gb1_y = np.load(os.path.join("GB1", "0_block_xvalues.npy")).astype(np.float64), \
            np.load(os.path.join("GB1", "0_block_yvalues.npy")).astype(np.float64)

In [97]:
def init_sample(all_x, all_y, random_seed):
    rng = np.random.default_rng(123)
    ind = rng.permutation(all_x.shape[0])
    ind_train, ind_test = ind[:384], ind[384:]
    testx, testy = all_x[ind_test,:], all_y[ind_test]
    train_dset = build_online_dataset(all_x[ind_train,:],
                                      all_y[ind_train], chunk_size=2000)
    return train_dset, testx, testy

def acquisition_rank(y_pred, var_pred):
    return scipy.stats.rankdata(y_pred) + scipy.stats.rankdata(-var_pred)

def sample_and_stack(init_trainx, init_trainy, test_x, test_y, model):
    preds, var = model.predict(test_x, get_var = True, chunk_size=2000)
    rankings = acquisition_rank(preds, var)
    acq_argsort = np.argsort(-rankings)
    best_idx = acq_argsort[:96]
    upper_bound = preds + 1.96 * var
    best_idx = np.argsort(upper_bound)[-96:]
    sampled_y = test_y[best_idx]

    train_x = np.vstack([init_trainx, test_x[best_idx,:]])
    train_y = np.concatenate([init_trainy, test_y[best_idx]])
    mask = np.ones(test_x.shape[0], dtype=bool)
    mask[best_idx] = False
    new_test_x = test_x[mask,:]
    new_test_y = test_y[mask]
    new_train_dset = build_online_dataset(train_x, train_y, chunk_size=2000)
    return new_train_dset, new_test_x, new_test_y, sampled_y

In [112]:
tdset, tx, ty = init_sample(gb1_x, gb1_y, 123)

mod = xGPRegression(training_rffs = 512, fitting_rffs = 8192, device = "gpu",
                    variance_rffs = 1024,
                   kernel_choice = "RBF", kernel_specific_params =
                    {"split_points":[21,42,63]}, verbose = True)

_ = mod.tune_hyperparams_crude_lbfgs(tdset, max_iter=50, n_restarts=1)
#_ = mod.tune_hyperparams_crude_bayes(tdset, max_bayes_iter=30)
preconditioner, ratio = mod.build_preconditioner(tdset,
                                         max_rank = 256, method = 'srht')
print("Ratio: %s"%ratio)
mod.fit(tdset, preconditioner = preconditioner, 
        mode = "cg", tol=1e-6)

starting_tuning
Now beginning L-BFGS minimization.
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Restart 0 completed. Best score is 533.6090861702294.
Tuning complete.
Chunk 0 complete.
Ratio: 0.08050471840240853
starting fitting
Iteration 0
Iteration 5
Now performing variance calculations...
Fitting complete.


In [113]:
tdset, tx, ty, sampled_y = sample_and_stack(tdset.xdata_, tdset.ydata_, tx, ty, mod)

In [114]:
_ = mod.tune_hyperparams_crude_lbfgs(tdset, max_iter=50, n_restarts=3)
#_ = mod.tune_hyperparams_crude_bayes(tdset, max_bayes_iter=30)
preconditioner, ratio = mod.build_preconditioner(tdset,
                                         max_rank = 256, method = 'srht')
print("Ratio: %s"%ratio)
mod.fit(tdset, preconditioner = preconditioner, 
        mode = "cg", tol=1e-6)

starting_tuning
Now beginning L-BFGS minimization.
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Restart 0 completed. Best score is 475.49245402881894.
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient.

In [115]:
tdset, tx, ty, sampled_y = sample_and_stack(tdset.xdata_, tdset.ydata_, tx, ty, mod)
_ = mod.tune_hyperparams_crude_lbfgs(tdset, max_iter=50, n_restarts=3)
#_ = mod.tune_hyperparams_crude_bayes(tdset, max_bayes_iter=30)
preconditioner, ratio = mod.build_preconditioner(tdset,
                                         max_rank = 256, method = 'srht')
print("Ratio: %s"%ratio)
mod.fit(tdset, preconditioner = preconditioner, 
        mode = "cg", tol=1e-6)

starting_tuning
Now beginning L-BFGS minimization.
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Restart 0 completed. Best score is 547.8332197100323.
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Restart 1 completed. Best score is 547.8332197100323.
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gr

In [116]:
tdset, tx, ty, sampled_y = sample_and_stack(tdset.xdata_, tdset.ydata_, tx, ty, mod)

In [118]:
np.max(sampled_y)

0.8622110018596997

In [119]:
tdset, tx, ty, sampled_y = sample_and_stack(tdset.xdata_, tdset.ydata_, tx, ty, mod)
_ = mod.tune_hyperparams_crude_lbfgs(tdset, max_iter=50, n_restarts=3)
#_ = mod.tune_hyperparams_crude_bayes(tdset, max_bayes_iter=30)
preconditioner, ratio = mod.build_preconditioner(tdset,
                                         max_rank = 256, method = 'srht')
print("Ratio: %s"%ratio)
mod.fit(tdset, preconditioner = preconditioner, 
        mode = "cg", tol=1e-6)

starting_tuning
Now beginning L-BFGS minimization.
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Restart 0 completed. Best score is 689.0670149940382.
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient..

In [120]:
tdset, tx, ty, sampled_y = sample_and_stack(tdset.xdata_, tdset.ydata_, tx, ty, mod)

In [122]:
np.max(sampled_y)

0.8311300290585193

In [123]:
tdset, tx, ty, sampled_y = sample_and_stack(tdset.xdata_, tdset.ydata_, tx, ty, mod)
_ = mod.tune_hyperparams_crude_lbfgs(tdset, max_iter=50, n_restarts=3)
#_ = mod.tune_hyperparams_crude_bayes(tdset, max_bayes_iter=30)
preconditioner, ratio = mod.build_preconditioner(tdset,
                                         max_rank = 256, method = 'srht')
print("Ratio: %s"%ratio)
mod.fit(tdset, preconditioner = preconditioner, 
        mode = "cg", tol=1e-6)

starting_tuning
Now beginning L-BFGS minimization.
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Restart 0 completed. Best score is 870.0028169045173.
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient...
Evaluating gradient..

In [124]:
tdset, tx, ty, sampled_y = sample_and_stack(tdset.xdata_, tdset.ydata_, tx, ty, mod)

In [126]:
np.mean(sampled_y)

0.4212812537250494