# GP Regression on Protein Sequences: Subsequence String Kernel #

An example notebook for string kernel-based GP regression on a dataset of protein sequences using the subsequence string kernel (SSK) model of [1, 2]. For the bag-of-amino acids representation of the protein sequence (analagous to the bag-of-SMILES model for molecules) see the 'protein fitness prediction - bag of amino acids notebook'. The protein dataset consists of 151 sequences with a 'fitness' function (target label) of the melting point in degrees Celcius. The dataset is collated from values reported in references [3,4,5]. The sequences are each of length 290 and so it is recommended that a GPU is used in conjunction with the SSK kernel.

In contrast to the bag of amino acids notebook, we do not report results on 20 random train/test splits because this would be too computationally intensive for the SSK kernel.


In [1]:
"""Imports"""

# Turn off Graphein warnings
import warnings
warnings.filterwarnings("ignore")

from botorch import fit_gpytorch_model
from botorch.models import SingleTaskGP
from botorch.models.transforms import Normalize, Standardize
from botorch.models.fully_bayesian import MIN_INFERRED_NOISE_LEVEL
from gpytorch.constraints import GreaterThan
from gpytorch.kernels import ScaleKernel
from gpytorch.likelihoods import GaussianLikelihood
from gpytorch.mlls import ExactMarginalLogLikelihood
from gpytorch.priors import GammaPrior
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_absolute_error
import torch

from gauche.dataloader.data_utils import transform_data
from gauche.kernels.string_kernels.sskkernel import pad, encode_string, build_one_hot, SubsequenceStringKernel

In [2]:
"""CPU/GPU"""

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tkwargs = {"dtype": torch.float, "device": device}
print(tkwargs)

{'dtype': torch.float32, 'device': device(type='cpu')}


## The Petase Dataset

The dataset consists of a set of petase protein sequences with amino acid chains of length 290. An example sequence is given below:

```
MNFPRASRLMQAAVLGGLMAVSAAATAQTNPYARGPPPTAASLEASAGPFTVRSFTVSRPSGYGAGTVYYPTNAGGTVGAIAIVPGYTARQSSIKWWGPRLASHGFVVITIDTNSTLDQPSSRSSQQMAALRQVASLNGTSSSPIYGKVDTARMGVMGWSMGGGGSLISAANNPSLKAAAPQAPWDSSTNFSSVTVPTLIFACENDSIAPVNSSALPIYDSMSRNAKQFLEINGGSHSCANSGNSNQALIGKKGVAWMKRFMDNDTRYSTFACENPNSTRVSDFRTANCS
```

For such long sequences the SSK kernel can struggle computationally and so a "bag of amino acids" model is also compared.

In [3]:
"""Regression experiments parameters, number of random splits and split size"""

n_trials = 20
test_set_size = 0.2

In [4]:
"""Load the petase dataset"""

import sys
sys.path.append('..')

df = pd.read_csv('../gauche/datasets/proteins/petase_151_mutants.csv')
x = df['sequence'].to_list()
y = df['fitness'].to_numpy().reshape(-1, 1)
print(f'len(sequences) {len(x)} | len(targets) {len(y)}')

len(sequences) 151 | len(targets) 151


In [5]:
"""Compute the required sequence properties for modelling with the SSK kernel GP."""

maxlen = np.max([len(seq) for seq in x])
# get alphabet of characters used in candidate set (to init SSK)
alphabet = list({l for word in x for l in word})
print(f'alphabet \n {alphabet} \n length of alphabet {len(alphabet)}')
print(f'maxlen {maxlen}')

alphabet 
 ['F', 'N', 'E', 'Y', 'A', 'K', 'S', 'P', 'M', 'G', 'R', 'I', 'H', 'Q', 'C', 'W', 'L', 'D', 'V', 'T'] 
 length of alphabet 20
maxlen 290


## GP Regression on the Petase Dataset

First we define the GP model for protein sequences.

In [6]:
"""Process the inputs x to the string kernel GPs"""

# Compute one-hot encodings and an integer index for the given amino acid alphabet
embds, index = build_one_hot(alphabet)
embds = embds.to(**tkwargs)

# Process the string inputs to the SSK model
x = torch.cat([pad(encode_string(seq, index), maxlen).unsqueeze(0) for seq in x], dim=0)

In [7]:
"""Compute the train/test split."""

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=test_set_size, random_state=0)
X_train = X_train.to(**tkwargs)
X_test = X_test.to(**tkwargs)
y_train = torch.tensor(y_train, **tkwargs)
y_test = torch.tensor(y_test, **tkwargs)

In [8]:
"""Intialize and fit the models"""

# Likelihood function
likelihood = GaussianLikelihood(
    noise_prior=GammaPrior(torch.tensor(0.9, **tkwargs), torch.tensor(10.0, **tkwargs)),
    noise_constraint=GreaterThan(MIN_INFERRED_NOISE_LEVEL),
)

# Covariance function
covar_module = ScaleKernel(SubsequenceStringKernel(embds, index, alphabet, maxlen, **tkwargs))


ssk_gp_model = SingleTaskGP(
    train_X=X_train,
    train_Y=y_train,
    outcome_transform=Standardize(1),
    likelihood=likelihood,
    covar_module=covar_module,
)

mll = ExactMarginalLogLikelihood(model=ssk_gp_model, likelihood=ssk_gp_model.likelihood)
# ideally we can optimize over the kernel hyper-parameters of the string kernel
# however, the gpu memory usage in batch (GPU) version of the kernel is quite high
# while the standard non-batch version is relatively slow for kernel evaluation.
# Nevertheless, the kernel is very robust to choices of the different hypers.
mll.model.covar_module.base_kernel.raw_order_coefs.requires_grad = False
mll.model.covar_module.base_kernel.raw_match_decay.requires_grad = False
mll.model.covar_module.base_kernel.raw_gap_decay.requires_grad = False

fit_gpytorch_model(mll)

ExactMarginalLogLikelihood(
  (likelihood): GaussianLikelihood(
    (noise_covar): HomoskedasticNoise(
      (noise_prior): GammaPrior()
      (raw_noise_constraint): GreaterThan(1.000E-04)
    )
  )
  (model): SingleTaskGP(
    (likelihood): GaussianLikelihood(
      (noise_covar): HomoskedasticNoise(
        (noise_prior): GammaPrior()
        (raw_noise_constraint): GreaterThan(1.000E-04)
      )
    )
    (mean_module): ConstantMean()
    (covar_module): ScaleKernel(
      (base_kernel): SubsequenceStringKernel(
        (raw_gap_decay_constraint): Interval(0.000E+00, 1.000E+00)
        (raw_match_decay_constraint): Interval(0.000E+00, 1.000E+00)
        (raw_order_coefs_constraint): Interval(0.000E+00, 1.000E+00)
      )
      (raw_outputscale_constraint): Positive()
    )
    (outcome_transform): Standardize()
  )
)

In [9]:
"""Evaluate the trained model."""

posterior = ssk_gp_model.posterior(X_test)
posterior_mean = posterior.mean.cpu().detach()
posterior_std = torch.sqrt(posterior.variance.cpu().detach())

r2 = r2_score(y_test, posterior_mean.numpy())
print(mean_absolute_error(posterior_mean.squeeze(1), y_test.cpu().detach().squeeze(1)))

KeyboardInterrupt: 

In [None]:
"""Plot the R^2"""

fig, ax = plt.subplots(1, 2, figsize = (16, 6))
ax = ax.reshape(-1)

ax.scatter(y_test, posterior_mean.numpy())
ax.set_title(f'Test set $R^2 = {r2:.2f}$')
ax.plot(np.unique(y_test), np.poly1d(np.polyfit(y_test, posterior_mean.numpy(), 1)(np.unique(y_test)), color='k', linewidth=0.4))


## References

[1] Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N. and Watkins, C., 2002. [Text classification using string kernels](https://jmlr.csail.mit.edu/papers/volume2/lodhi02a/lodhi02a.pdf). The Journal of Machine Learning Research, pp.419-444.

[2] Cancedda, N., Gaussier, E., Goutte, C. and Renders, J.M., 2003. [Word sequence kernels.](https://www.jmlr.org/papers/volume3/cancedda03a/cancedda03a.pdf) The Journal of Machine Learning Research, pp.1059-1082.

[3] Cui, Y., Chen, Y., Liu, X., Dong, S., Tian, Y.E., Qiao, Y., Mitra, R., Han, J., Li, C., Han, X. and Liu, W., 2021. [Computational redesign of a PETase for plastic biodegradation under ambient condition by the GRAPE strategy](https://pubs.acs.org/doi/abs/10.1021/acscatal.0c05126). ACS Catalysis, 11(3), pp.1340-1350.

[4] Liu, B., He, L., Wang, L., Li, T., Li, C., Liu, H., Luo, Y. and Bao, R., 2018. [Protein crystallography and site‐direct mutagenesis analysis of the poly (ethylene terephthalate) hydrolase PETase from Ideonella sakaiensis](https://chemistry-europe.onlinelibrary.wiley.com/doi/abs/10.1002/cbic.201800097). ChemBioChem, 19(14), pp.1471-1475.

[5] Joo, S., Cho, I.J., Seo, H., Son, H.F., Sagong, H.Y., Shin, T.J., Choi, S.Y., Lee, S.Y. and Kim, K.J., 2018. [Structural insight into molecular mechanism of poly (ethylene terephthalate) degradation](https://www.nature.com/articles/s41467-018-02881-1). Nature communications, 9(1), p.382.