## Example: Sequence Data

Next, let's consider the AAV dataset, designed vs mutant split, 
from the FLIP benchmark suite. For this dataset, we train on 200,000
length 57 amino acid sequences and try to predict the fitness
of a pre-specified test set. Dallago et al. report that a standard
1d-CNN trained on this achieves a Spearman's r of 0.75, while
a 750-million parameter pretrained model that took 50 GPU-days of
time to train achieves Spearman's r of 0.79.

We'll evaluate a convolution kernel and show that we can easily
match or outperform the deep learning baselines without too
much effort.

This was originally run using xGPR v0.3.

In [1]:
import os
import shutil
import subprocess
import math
import time
import zipfile

import pandas as pd
import numpy as np

from xGPR import xGPRegression as xGPReg
from xGPR import build_regression_dataset

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
#This may take a minute...
subprocess.run(["git", "clone", "https://github.com/J-SNACKKB/FLIP"])

shutil.move(os.path.join("FLIP", "splits", "aav", "full_data.csv.zip"), "full_data.csv.zip")
fname = "full_data.csv.zip"

with zipfile.ZipFile(fname, "r") as zip_ref:
    zip_ref.extractall()

os.remove("full_data.csv.zip")


shutil.rmtree("FLIP")

Cloning into 'FLIP'...


In [3]:
raw_data = pd.read_csv("full_data.csv")
os.remove("full_data.csv")

  raw_data = pd.read_csv("full_data.csv")


In [4]:
raw_data["input_seq"] = [f.upper().replace("*", "") for f in raw_data["mutated_region"].tolist()]

We'll use simple one-hot encoding for the sequences. This may take a minute to set up. Notice that when
encoding the sequences we record the length of each sequence so that the zero-padding we've added
to the end of the sequence can be masked-out when fitting the model. This is new in xGPR 0.3. If you
want the zero-padding included for some reason, you can just set all sequence lengths to be the same.

In [5]:
def one_hot_encode(input_seq_list, y_values, chunk_size, ftype = "train"):
    aas = ["A", "C", "D", "E", "F", "G", "H", "I",
               "K", "L", "M", "N", "P", "Q", "R", "S", "T",
               "V", "W", "Y", "-"]
    output_x, output_y, output_seqlen = [], [], []
    xfiles, yfiles, seqlen_files = [], [], []
    fcounter = 0
    
    for seq, y_value in zip(input_seq_list, y_values):
        encoded_x = np.zeros((1,57,21), dtype = np.uint8)
        for i, letter in enumerate(seq):
            encoded_x[0, i, aas.index(letter)] = 1

        output_x.append(encoded_x)
        output_y.append(y_value)
        output_seqlen.append(len(seq))

        if len(output_x) >= chunk_size:
            xfiles.append(f"{fcounter}_{ftype}_xblock.npy")
            yfiles.append(f"{fcounter}_{ftype}_yblock.npy")
            seqlen_files.append(f"{fcounter}_{ftype}_seqlen.npy")
            np.save(xfiles[-1], np.vstack(output_x))
            np.save(yfiles[-1], np.asarray(output_y))
            np.save(seqlen_files[-1], np.array(output_seqlen).astype(np.int32))
            fcounter += 1
            output_x, output_y, output_seqlen = [], [], []
            print(f"Encoded file {fcounter}")
    return xfiles, yfiles, seqlen_files

In [6]:
train_data = raw_data[raw_data["des_mut_split"]=="train"]
test_data = raw_data[raw_data["des_mut_split"]=="test"]


train_x_files, train_y_files, train_seqlen_files = one_hot_encode(train_data["input_seq"].tolist(),
                                              train_data["score"].tolist(), 2000, "train")
test_x_files, test_y_files, test_seqlen_files = one_hot_encode(test_data["input_seq"].tolist(),
                                            test_data["score"].tolist(), 2000, "test")

Encoded file 1
Encoded file 2
Encoded file 3
Encoded file 4
Encoded file 5
Encoded file 6
Encoded file 7
Encoded file 8
Encoded file 9
Encoded file 10
Encoded file 11
Encoded file 12
Encoded file 13
Encoded file 14
Encoded file 15
Encoded file 16
Encoded file 17
Encoded file 18
Encoded file 19
Encoded file 20
Encoded file 21
Encoded file 22
Encoded file 23
Encoded file 24
Encoded file 25
Encoded file 26
Encoded file 27
Encoded file 28
Encoded file 29
Encoded file 30
Encoded file 31
Encoded file 32
Encoded file 33
Encoded file 34
Encoded file 35
Encoded file 36
Encoded file 37
Encoded file 38
Encoded file 39
Encoded file 40
Encoded file 41
Encoded file 42
Encoded file 43
Encoded file 44
Encoded file 45
Encoded file 46
Encoded file 47
Encoded file 48
Encoded file 49
Encoded file 50
Encoded file 51
Encoded file 52
Encoded file 53
Encoded file 54
Encoded file 55
Encoded file 56
Encoded file 57
Encoded file 58
Encoded file 59
Encoded file 60
Encoded file 61
Encoded file 62
Encoded file 63
E

Notice that we pass the list of seqlen_files into the dataset builder. This is required if working with
3d arrays / convolution kernels. If you are working with 2d arrays / fixed-length vector kernels,
the default for the third argument (```None```) is appropriate.

In [7]:
training_dset = build_regression_dataset(train_x_files, train_y_files, train_seqlen_files, chunk_size = 2000)

Here we'll use the Conv1dRBF kernel, a kernel for sequences. Convolution kernels are usually slower than RBF / Matern, especially if the sequence is long. We'll run a quick and dirty tuning experiment using 1024 random features, then fine-tune this using a larger number of random features just as we did for the tabular dataset.

Many kernels in xGPR have kernel-specific settings. For Conv1dRBF, we can set two key options: sequence averaging, which is one of 'none', 'sqrt' or 'full', and the width of the convolution to use. Just as with a convolutional network, the width of the convolution filters can affect performance. One way to choose a good setting: see what marginal likelihood score you get from hyperparameter tuning (e.g. with ``crude_bayes`` or ``crude_grid``) using a small number of RFFs (e.g. 1024 - 2048) for several different settings of "conv_width". The smallest score achieved likely corresponds to the best value for "conv_width".

In [8]:
aav_model = xGPReg(num_rffs = 1024, variance_rffs = 512,
                  kernel_choice = "Conv1dRBF",
                   kernel_settings = {"conv_width":11, "averaging":'none'},
                   verbose = True, device = "gpu")

start_time = time.time()
hparams, niter, best_score = aav_model.tune_hyperparams_crude(training_dset)
end_time = time.time()

print(f"Best estimated negative marginal log likelihood: {best_score}")
print(f"Wallclock: {end_time - start_time}")

Grid point 0 acquired.
Grid point 1 acquired.
Grid point 2 acquired.
Grid point 3 acquired.
Grid point 4 acquired.
Grid point 5 acquired.
Grid point 6 acquired.
Grid point 7 acquired.
Grid point 8 acquired.
Grid point 9 acquired.
New hparams: [-2.1136193]
Additional acquisition 10.
New hparams: [-1.7113183]
Additional acquisition 11.
New hparams: [-1.601644]
Best score achieved: 129179.362
Best hyperparams: [-3.0363038 -1.601644 ]
Best estimated negative marginal log likelihood: 129179.362
Wallclock: 24.44625163078308


We now have a rough estimate of our hyperparameters, acquired using a sketchy kernel approximation
(num_rffs=1024) and a crude tuning procedure. Let's fine-tune this a little. We could use
the built-in tuning routine in xGPR the way we did for the tabular data, or we could use
Optuna (or some other library), or we could do a simple gridsearch. For illustrative
purposes here, we'll use Optuna using num_rffs=4,096 (a somewhat better kernel
approximation) and see what that looks like. We'll search the region around the
hyperparameters obtained from ``tune_hyperparams_crude``. To run this
next piece, you'll need to have Optuna installed. Optuna is one of our
favorite approaches and is often able to do a little better than other methods.

In [9]:
import optuna
from optuna.samplers import TPESampler

def objective(trial):
    lambda_ = trial.suggest_float("lambda_", -2., 0.)
    sigma = trial.suggest_float("sigma", -3., -1.)
    hparams = np.array([lambda_, sigma])
    nmll = aav_model.exact_nmll(hparams, training_dset)
    return nmll

In [10]:
aav_model.num_rffs = 4096

sampler = TPESampler(seed=123)
study = optuna.create_study(sampler=sampler)
study.optimize(objective, n_trials=35)

[I 2024-02-28 14:51:37,848] A new study created in memory with name: no-name-07d4d74d-1f68-40c2-90f3-add8c8746d1b
[I 2024-02-28 14:51:52,970] Trial 0 finished with value: 191010.51515248106 and parameters: {'lambda_': -0.6070616288042767, 'sigma': -2.4277213300992413}. Best is trial 0 with value: 191010.51515248106.


Evaluated NMLL.


[I 2024-02-28 14:52:08,106] Trial 1 finished with value: 139098.10635091155 and parameters: {'lambda_': -1.5462970928715938, 'sigma': -1.8973704618342175}. Best is trial 1 with value: 139098.10635091155.


Evaluated NMLL.


[I 2024-02-28 14:52:23,307] Trial 2 finished with value: 179035.2427503361 and parameters: {'lambda_': -0.5610620604288739, 'sigma': -2.153787079751078}. Best is trial 1 with value: 139098.10635091155.


Evaluated NMLL.


[I 2024-02-28 14:52:38,505] Trial 3 finished with value: 172288.1657109192 and parameters: {'lambda_': -0.038471603230769036, 'sigma': -1.6303405228302734}. Best is trial 1 with value: 139098.10635091155.


Evaluated NMLL.


[I 2024-02-28 14:52:53,704] Trial 4 finished with value: 165436.6996476398 and parameters: {'lambda_': -1.0381361970312781, 'sigma': -2.215764963611699}. Best is trial 1 with value: 139098.10635091155.


Evaluated NMLL.


[I 2024-02-28 14:53:08,901] Trial 5 finished with value: 131532.96759474633 and parameters: {'lambda_': -1.3136439676982612, 'sigma': -1.5419005852319168}. Best is trial 5 with value: 131532.96759474633.


Evaluated NMLL.


[I 2024-02-28 14:53:24,094] Trial 6 finished with value: 192336.93498199942 and parameters: {'lambda_': -1.1228555106407512, 'sigma': -2.8806442067808633}. Best is trial 5 with value: 131532.96759474633.


Evaluated NMLL.


[I 2024-02-28 14:53:39,317] Trial 7 finished with value: 133291.688314279 and parameters: {'lambda_': -1.2039114893391372, 'sigma': -1.5240091885359286}. Best is trial 5 with value: 131532.96759474633.


Evaluated NMLL.


[I 2024-02-28 14:53:54,640] Trial 8 finished with value: 166185.98465624175 and parameters: {'lambda_': -1.635016539093, 'sigma': -2.649096487705015}. Best is trial 5 with value: 131532.96759474633.


Evaluated NMLL.


[I 2024-02-28 14:54:09,971] Trial 9 finished with value: 156150.76989813204 and parameters: {'lambda_': -0.9368972523163233, 'sigma': -1.9363448258062679}. Best is trial 5 with value: 131532.96759474633.


Evaluated NMLL.


[I 2024-02-28 14:54:25,305] Trial 10 finished with value: 111334.22656963504 and parameters: {'lambda_': -1.8041913070310849, 'sigma': -1.0083487851495223}. Best is trial 10 with value: 111334.22656963504.


Evaluated NMLL.


[I 2024-02-28 14:54:40,641] Trial 11 finished with value: 110739.69821716385 and parameters: {'lambda_': -1.9316015294012514, 'sigma': -1.0461089979443123}. Best is trial 11 with value: 110739.69821716385.


Evaluated NMLL.


[I 2024-02-28 14:54:55,978] Trial 12 finished with value: 110615.42786483007 and parameters: {'lambda_': -1.9697942932881722, 'sigma': -1.0625029186361101}. Best is trial 12 with value: 110615.42786483007.


Evaluated NMLL.


[I 2024-02-28 14:55:11,314] Trial 13 finished with value: 110289.24437218352 and parameters: {'lambda_': -1.9923237115324908, 'sigma': -1.0312083464477573}. Best is trial 13 with value: 110289.24437218352.


Evaluated NMLL.


[I 2024-02-28 14:55:26,657] Trial 14 finished with value: 112657.04601012528 and parameters: {'lambda_': -1.9741907548353694, 'sigma': -1.238809634853463}. Best is trial 13 with value: 110289.24437218352.


Evaluated NMLL.


[I 2024-02-28 14:55:41,991] Trial 15 finished with value: 118612.16505182625 and parameters: {'lambda_': -1.6626657406640857, 'sigma': -1.3325845613948655}. Best is trial 13 with value: 110289.24437218352.


Evaluated NMLL.


[I 2024-02-28 14:55:57,322] Trial 16 finished with value: 113355.71398185895 and parameters: {'lambda_': -1.979911115241622, 'sigma': -1.282551755845012}. Best is trial 13 with value: 110289.24437218352.


Evaluated NMLL.


[I 2024-02-28 14:56:12,657] Trial 17 finished with value: 115027.21113692805 and parameters: {'lambda_': -1.44070011461748, 'sigma': -1.016021201953487}. Best is trial 13 with value: 110289.24437218352.


Evaluated NMLL.


[I 2024-02-28 14:56:27,992] Trial 18 finished with value: 130971.75932766096 and parameters: {'lambda_': -1.7605244043847585, 'sigma': -1.7898449770498086}. Best is trial 13 with value: 110289.24437218352.


Evaluated NMLL.


[I 2024-02-28 14:56:43,341] Trial 19 finished with value: 123053.40935089342 and parameters: {'lambda_': -1.4663024401332163, 'sigma': -1.3737507499679595}. Best is trial 13 with value: 110289.24437218352.


Evaluated NMLL.


[I 2024-02-28 14:56:58,678] Trial 20 finished with value: 113608.10209868877 and parameters: {'lambda_': -1.7729705022760882, 'sigma': -1.1697605744816058}. Best is trial 13 with value: 110289.24437218352.


Evaluated NMLL.


[I 2024-02-28 14:57:14,012] Trial 21 finished with value: 110342.09086018638 and parameters: {'lambda_': -1.958363634417783, 'sigma': -1.004823531585883}. Best is trial 13 with value: 110289.24437218352.


Evaluated NMLL.


[I 2024-02-28 14:57:29,351] Trial 22 finished with value: 111344.37591478857 and parameters: {'lambda_': -1.9926447281221582, 'sigma': -1.156675028124563}. Best is trial 13 with value: 110289.24437218352.


Evaluated NMLL.


[I 2024-02-28 14:57:44,685] Trial 23 finished with value: 118983.5770222494 and parameters: {'lambda_': -1.7506340815240233, 'sigma': -1.3972777365917783}. Best is trial 13 with value: 110289.24437218352.


Evaluated NMLL.


[I 2024-02-28 14:58:00,017] Trial 24 finished with value: 111980.00307648914 and parameters: {'lambda_': -1.8425516828455992, 'sigma': -1.1049320833171015}. Best is trial 13 with value: 110289.24437218352.


Evaluated NMLL.


[I 2024-02-28 14:58:15,354] Trial 25 finished with value: 117286.81287115451 and parameters: {'lambda_': -1.5510336616392377, 'sigma': -1.2107364825415443}. Best is trial 13 with value: 110289.24437218352.


Evaluated NMLL.


[I 2024-02-28 14:58:30,697] Trial 26 finished with value: 119212.55923930157 and parameters: {'lambda_': -1.8446486815766967, 'sigma': -1.4581632870322487}. Best is trial 13 with value: 110289.24437218352.


Evaluated NMLL.


[I 2024-02-28 14:58:46,040] Trial 27 finished with value: 128302.00065426654 and parameters: {'lambda_': -1.691677142394271, 'sigma': -1.6698962260217383}. Best is trial 13 with value: 110289.24437218352.


Evaluated NMLL.


[I 2024-02-28 14:59:01,377] Trial 28 finished with value: 110135.62559608728 and parameters: {'lambda_': -1.999704295150311, 'sigma': -1.0054636279089055}. Best is trial 28 with value: 110135.62559608728.


Evaluated NMLL.


[I 2024-02-28 14:59:16,710] Trial 29 finished with value: 114448.50938787832 and parameters: {'lambda_': -1.8708026583766026, 'sigma': -1.2755007084944523}. Best is trial 28 with value: 110135.62559608728.


Evaluated NMLL.


[I 2024-02-28 14:59:32,044] Trial 30 finished with value: 115188.77677291981 and parameters: {'lambda_': -1.6377650041263696, 'sigma': -1.1681786788801491}. Best is trial 28 with value: 110135.62559608728.


Evaluated NMLL.


[I 2024-02-28 14:59:47,379] Trial 31 finished with value: 110272.25703744689 and parameters: {'lambda_': -1.9921321214425995, 'sigma': -1.0276982897258669}. Best is trial 28 with value: 110135.62559608728.


Evaluated NMLL.


[I 2024-02-28 15:00:02,713] Trial 32 finished with value: 116888.90020478022 and parameters: {'lambda_': -1.8255404598855574, 'sigma': -1.3584664560199293}. Best is trial 28 with value: 110135.62559608728.


Evaluated NMLL.


[I 2024-02-28 15:00:18,044] Trial 33 finished with value: 111917.0952403296 and parameters: {'lambda_': -1.8770955212827354, 'sigma': -1.1245007589287543}. Best is trial 28 with value: 110135.62559608728.


Evaluated NMLL.


[I 2024-02-28 15:00:33,380] Trial 34 finished with value: 110201.41865947528 and parameters: {'lambda_': -1.9928388001115445, 'sigma': -1.0133404395427514}. Best is trial 28 with value: 110135.62559608728.


Evaluated NMLL.


Set the model hyperparameters to the best ones found by Optuna.

In [11]:
study.best_params

{'lambda_': -1.999704295150311, 'sigma': -1.0054636279089055}

In [12]:
aav_model.set_hyperparams(np.array([-1.9997, -1.005464]), training_dset)

Now we'll fit the model using 8192 RFFs. We like to use a more accurate kernel approximationwhen fitting than when tuning for two reasons. First, tuning is more expensive because the model has to be fit multiple times when tuning hyperparameters. Second, model performance usually
increases faster by increasing the number of rffs used for fitting than for tuning. (Using 16,384 RFFs here for fitting further
increases test set performance as you'd expect.)

On gpu, for fitting, ``mode=exact`` works well up to 8,192 RFFs or so, while ``mode=cg`` although
slower for small numbers of RFFs is more scalable. On this dataset, using 8,192 RFFs, "exact" takes about 70 seconds on our GPU.
We'll use cg here just for illustrative purposes. Notice that using fitting with default settings takes about 45 iterations with
CG. We can speed this up by changing the defaults (see the Advanced Tutorials for more on how to do this).

``tol`` determines how tight the fit is. 1e-6 (default) is usually fine. Decreasing the number will improve performance but
with rapidly diminishing returns and make fitting take longer. For noise free data or to get a small additional boost in
performance, use 1e-7. 1e-8 is (nearly always) overkill.

In [13]:
aav_model.num_rffs = 8192
start_time = time.time()
aav_model.fit(training_dset, mode = 'cg', tol = 1e-6)
end_time = time.time()
print(f"Wallclock: {end_time - start_time}")

starting fitting
Chunk 0 complete.
Chunk 10 complete.
Chunk 20 complete.
Chunk 30 complete.
Chunk 40 complete.
Chunk 50 complete.
Chunk 60 complete.
Chunk 70 complete.
Chunk 80 complete.
Chunk 90 complete.
Using rank: 512
Chunk 0 complete.
Chunk 10 complete.
Chunk 20 complete.
Chunk 30 complete.
Chunk 40 complete.
Chunk 50 complete.
Chunk 60 complete.
Chunk 70 complete.
Chunk 80 complete.
Chunk 90 complete.
Iteration 0
Iteration 5
Iteration 10
Iteration 15
Iteration 20
Now performing variance calculations...
Fitting complete.
Wallclock: 142.21557712554932


In [15]:
start_time = time.time()
all_preds, ground_truth = [], []
for xfile, yfile, sfile in zip(test_x_files, test_y_files, test_seqlen_files):
    x, y, s = np.load(xfile), np.load(yfile), np.load(sfile)
    ground_truth.append(y)
    preds = aav_model.predict(x, s, get_var = False)
    all_preds.append(preds)
    
all_preds, ground_truth = np.concatenate(all_preds), np.concatenate(ground_truth)
end_time = time.time()
print(f"Wallclock: {end_time - start_time}")

Wallclock: 2.0209038257598877


In [16]:
from scipy.stats import spearmanr

spearmanr(all_preds, ground_truth)

SignificanceResult(statistic=0.7587165981515888, pvalue=0.0)

Spearman's r of 0.76 plus matches the performance for a 1d-CNN reported by Dallago et al
for this dataset and is similar to the performance of a fine-tuned LLM (Spearman's r 0.79).
As discussed above, we can get further slight improvements in performance
just by tweaking this model. We can do even better by using a more informative
representation of the protein sequences. In our original paper we achieved a Spearman's r
of about 0.8 on this dataset, outperforming fine-tuned LLMs (and costing significantly less to train
than a fine-tuned LLM).
Whether small gains in performance from further "tweaking" or more informative representations is worthwhile
obviously depends on your application...

In [17]:
for testx, testy, tests in zip(test_x_files, test_y_files, test_seqlen_files):
    os.remove(testx)
    os.remove(testy)
    os.remove(tests)

In [18]:
for xfile, yfile, sfile in zip(train_x_files, train_y_files, train_seqlen_files):
    os.remove(xfile)
    os.remove(yfile)
    os.remove(sfile)