# ProteusAI

In this notebook you are going to learn how we can use ProteusAI (PAI) to automate lots of stuff we did in the previous days. We are going to focus on computing representations, visualizing representations spaces, training models and model ensembles and evaluating the quality of the models.

We are also going to preview some other functions that can be used by PAI to handle other input data types like structures, how to visualize these, how to design novel sequences given a target structure and how to use protein language models (pLMs) to predict substitution probabilities.

## Loading tabular data and computing representations
first we have to import proteusAI, I always import it as pai so I don't always have to write the full name.

In [None]:
import proteusAI as pai
import matplotlib.pyplot as plt
%matplotlib inline

We are loading our 'source' data, the enzyme data using the 'Library'

In [None]:
enzyme_data = "Nitric_Oxide_Dioxygenase.csv"

lib = pai.Library(
    source=enzyme_data, 
    seqs_col="Sequence", 
    names_col="Description", 
    y_col="Data",
    y_type="num"
)

We can compute and plot BLOSUM62 representations in a single line

In [None]:
fig, ax, df = lib.plot("blosum62")
plt.show()

## Protein Language models in PAI

using protein language models in PAI is equally easy. We can simply use the function Library.compute() and specify the model we want to use. In this case we are going to use ESM-2, which is a state of the art protein language model. However, to speed the process up, we are going to use the 8 million parameter model (the smallest) which is not as powerful as the 650 million parameter model or the billion parameter models.

In [None]:
lib.compute("esm2_8M")

Again, we can visualize the representation space, by calling the Library.plot() method

In [None]:
fig, ax, df = lib.plot("esm2_8M")
plt.show()

## Training machine learning models

To train machine learning models, similar to the first day, we can simply create a Model object that uses our Library as input and use the Model.train() method. The training output is captured in a dictionary that we can inspect, or we can use it as data source to create a Library object, this could be useful if we now for example wanted to plot the representations space using the predicted y_values.

In [None]:
model = pai.Model(
    library=lib, 
    model_type="rf", 
    x="blosum62",
    k_folds=10
)
out = model.train()

## Model statistics

Training the model will print some model statistics, but we can go deeper. While training the model, PAI automatically computes models statistics that can be used to inspect the models performance. It will compute R-squared values, pearson correlation coefficients, Kendall-Tau correlation coefficients, and perform a conformal prediction analysis to estimate how noisy your model + experiments are. Let's print some of those values and show a model diagnostics plot.

In [None]:
print("Validation R-squared:\t", model.test_r2)
print("Validation Pearson:\t", model.test_pearson)
print("Validation Kendall-Tau:\t", model.test_ken_tau)
print("Model calibration:\t", model.calibration)

model.true_vs_predicted(model.y_test, model.y_test_pred)

We see, that the Pearson correlation and the Kendall-Tau correlation coefficients are also equipped with pvalues to indicate statistical significance of these metrics.

**Task**: Plot the correlation of the true model error, versus the model uncertainty values with the residuals of the predicted versus the true validation y values. Do we observe that the model uncertainty correlates with the observed error?

Tip to access values:
- true validation y values = model.y_val
- predicted validation y values = model.y_val_pred

## Bayesian optimization in PAI

Once we have trained a model in PAI, we can use it to predict the activity of novel sequences, or immediately to predict the next experiments we could do. below we are using the upper confidence bound acquisition function, we are aiming to maximize our y-value, evaluate 10000 sequences per run and are quite exploitative.

In [None]:
out = model.search(
    acq_fn="ucb",
    optim_problem="max",
    max_eval=10000,
    explore=0.1
)

In [None]:
out

**Optional Task**: Come up with a method to do balance multiple objectives at once - This is going to be difficult, as a tip, read about what a Pareto-front is, remember that you can train multiple models on the same data.

# Protein Class

The protein class can be used to load protein sequences and protein structural information. this is particularily useful if we want to engineer a protein, but don't know where to make the first mutations, or if we want to engineer the protein based on its structure (for example to increase the thermostability). The protein class can also be used to perform data visualizations.

## Loading a protein structure and visualizing it

In [None]:
prot = pai.Protein(source="1zb6.pdb")

In [None]:
prot.view_struc()

PAI also offers methods to extract interfaces, such as protein ligand interfaces. These interfaces can be important information when performing structure based design workflows.

In [None]:
interface = prot.get_contacts(target="ligand", dist=7)
highlight = {"A":interface}

prot.view_struc(
    highlight=highlight,
    #sticks=interface
)

## Structure based design

In the following we are going to use an inverse folding algorithm (ESM-IF) to sample sequences conditioned on our input structure. It has been shown, that sequences sampled through inverse folding algorithms often show increased themostability and expression levels. In some cases, even properties like catalytic activity can be increased.

In [None]:
out = prot.esm_if(
    num_samples=10,
    target_chain="A",
    fixed=interface
)

In [None]:
out