# Sequence and structure II: modern approaches

## Contact prediction from "direct interactions"

In the last lecture, we saw that structural constraints on proteins can be reflected in correlated amino acid sequence variation. In particular, pairs of sites in contact in the three-dimensional structure of the protein can (sometimes) be distinguished by amino acid covariation at those sites.

However, we also saw that other factors can lead to correlations between amino acids at pairs of sites. One reason for this is [phylogeny](https://doi.org/10.1073/pnas.1711913115). Another is indirect interactions. In general, if pairs of sites (A, B) and (B, C) interact, we will observe correlations not just between (A, B) and (B, C), but also (A, C). It can be challenging to tell which interactions are direct and which ones are indirect.

[Lapedes et al, 1999](https://www.jstor.org/stable/4356049) and, independently, [Weigt et al, PNAS 2009](https://doi.org/10.1073/pnas.0805923106) attempted to solve the indirect interaction problem by inferring the parameters of a generative sequence model including pairwise interactions that best reproduce correlations in the data. In this model, the probability of an amino acid sequence $(a_1, a_2, \ldots, a_L)$ is 

$$
P(a_1, a_2, \ldots, a_L) \propto \exp\left(\sum_i h_i(a_i) + \sum_{i, j}J_{ij}(a_i, a_j)\right)\,.
$$

In an ensemble of sequences sampled from this distribution, *couplings* $J_{ij}(a_i, a_j)$ will induce correlations between amino acid variation at sites $i$ and $j$. Other sites that interact with $i$ and $j$ will also be correlated due to indirect interactions. If we can find them, the $J_{ij}(a_i, a_j)$ may represent direct interactions better than correlations.


### A simple, approximate solution for inferring direct interactions

In general, finding the maximum likelihood parameters $h_i(a_i)$ and $J_{ij}(a_i, a_j)$ for a particular data set is computationally intensive.

However, we can make a simple approximation. If the sequence variables $(a_1, a_2, \ldots, a_L)$ were real-valued, then they would follow a Gaussian distribution. For a Gaussian distribution, the "interactions" $J_{ij}(a_i, a_j)$ will be elements of the inverse of the covariance matrix for amino acid frequencies across sites,

$$
C_{ij}(a_i, a_j) = p_{ij}(a_i, a_j) - p_i(a_i) p_j(a_j)\,,
$$

where $p_i(a_i)$ is the fraction of sequences with amino acid $a_i$ at site $i$, and $p_{ij}(a_i, a_j)$ is the fraction of sequences with amino acids $a_i$, $a_j$ at sites $i$, $j$. This is, of course, kind of a wild approximation, but it works reasonably well in practice (for inferring interactions -- the resulting generative model is *terrible*).


### Example: revisiting structure in trypsin inhibitor proteins

To show the association between "direct interactions" and structure, let's return to the example of the trypsin inhibitor proteins ([protein family PF00014](https://www.ebi.ac.uk/interpro/entry/pfam/PF00014) in PFAM). Like last time, we'll load in the [PDB file](https://www.rcsb.org/structure/5PTI) and compute the pairwise distances between sites. Then we'll attempt to find pairs of sites that are in contact.

In [None]:
# If we're running on Google colab, we need to install BioPython and 
# clone the repository
try:
    import google.colab
    !pip install biopython
    !git clone -l -s https://github.com/johnbarton/misc-lectures.git misc
    %cd misc
except ImportError:
    pass

In [None]:
# First, let's import the libraries that we'll need
from Bio import PDB              # BioPython for PDB files
import matplotlib.pyplot as plt  # matplotlib for plots
import numpy as np               # numpy for vectors/matrices and math
import figs                      # a helper file for making figures

In [None]:
# This is a data structure that stores information from PDB files
structure = PDB.PDBParser().get_structure('PF00014', 'data/pdb5pti.ent') 

# The structure is a bit complicated, but we just need a few elements
# The code below extracts a list of amino acids (and their positions, etc)
# from the PDB file
model = structure[0]
chain = model['A']
residues = [r for r in chain if PDB.is_aa(r)]

In [None]:
# The sequence alignment we'll be using omits the first three sites
# and the last two sites of the structure, so we should remove those
residues = residues[3:-2]
L = len(residues)

print('The sequence is %d amino acids long' % L)

In [None]:
# Now we'll compute the LxL distance matrix for the remaining sites
d_mat = np.zeros((L, L))
for i in range(L):
    for j in range(i+1, L):
        # Getting the distance is easy, we just "subtract" the residues
        # from each other, which returns a distance in Angstroms
        # Here we use the carbon alpha atoms as a reference for distance
        distance = residues[i]['CA'] - residues[j]['CA']
        d_mat[i, j] = distance
        d_mat[j, i] = distance

### Interactions from correlations

After defining residues that are less than 8 Angstroms apart as "in contact," we'll again check the accuracy of correlations in identifying nearby residues.

In [None]:
# Before we make a plot, we need to define which residues are in contact
contact_cutoff = 8
contacts = []
for i in range(L):
    for j in range(i+1, L):
        if d_mat[i, j]<contact_cutoff:
            contacts.append((i, j))

In [None]:
# Define the amino acid list, which we use to map sequences to numerical values
aas = ['A', 'R', 'N', 'D', 'C', 'Q', 'E', 'G', 'H', 'I',
       'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V', '-']

# Now we'll read in the sequences -- lines that start with '>' are labels,
# and the rest are the sequences
alignment = [list(i.split()[0]) for i in open('data/PF00014-alignment.fasta').readlines() if i[0]!='>']
alignment = np.array(alignment)

# Let's make sure that it looks right
shape = alignment.shape
print('Found %d sequences of length %d' % (shape[0], shape[1]))
print('The first sequence is %s' % ''.join(alignment[0]))

In [None]:
# Transform the sequences into one-hot encoded vectors (#AAs * #sites in length)
num_aas = len(aas)
num_seqs = shape[0]
num_sites = shape[1]

alignment_onehot = np.zeros((num_seqs, num_aas*num_sites))
for i in range(num_seqs):
    for j in range(num_sites):
        alignment_onehot[i][(j*len(aas)) + aas.index(alignment[i][j])] = 1

In [None]:
# Use numpy to compute covariances between amino acids at different sites
cov = np.cov(alignment_onehot.T)

# Get the top predicted contacts, measured by amino acid covariance between sites
pred_pairs = []
pred_vals = []
for i in range(num_sites):
    for j in range(i+1, num_sites):
        pred_pairs.append((i, j))
        # Take the submatrix of covariances between AAs at sites i and j,
        # square them, and take the sum
        pred_vals.append(np.sum(cov[num_aas*i:num_aas*(i+1), num_aas*j:num_aas*(j+1)]**2))

In [None]:
# Now, we'll take the top 50 predictions and sort pairs of sites into 
# true positives (correctly predicted contact), false positives (incorrect),
# and other contact sites -- to make it easier for later we'll write a function

def get_contact_predictions(contacts, pred_vals, pred_pairs, n_predictions=50):
    pred_sort = np.argsort(pred_vals)[::-1]
    
    true_positive = []
    false_positive = []
    for i in range(n_predictions):
        pair = pred_pairs[pred_sort[i]]
        if pair in contacts:
            true_positive.append(pair)
        else:
            false_positive.append(pair)
    
    other_contacts = []
    for pair in contacts:
        if (pair not in true_positive) and (pair not in false_positive):
            other_contacts.append(pair)

    return true_positive, false_positive, other_contacts

In [None]:
# How good are predictions from correlations?
true_positive, false_positive, other_contacts = get_contact_predictions(contacts, pred_vals, pred_pairs)
print('%d true positives, %d false positives in top %d predictions' 
      % (len(true_positive), len(false_positive), len(true_positive)+len(false_positive)))

In [None]:
# Here is the contact prediction plot based on correlations
figs.contact_plot_PF00014(other_contacts, true_positive, false_positive)
plt.show()

### Predicting contacts from couplings with the Gaussian approximation

Above, we saw that a simple way to find approximate couplings $J_{ij}(a_i, a_j)$ for the model 

$$
P(a_1, a_2, \ldots, a_L) \propto \exp\left(\sum_i h_i(a_i) + \sum_{i, j}J_{ij}(a_i, a_j)\right)
$$

is to assume that the amino acid variables are real-valued. In this case, the distribution is Gaussian, and the interactions are given by the elements of the inverse of the covariance matrix of amino acid frequencies.

We've already computed the covariance matrix in the previous section. We can simply take its inverse to test how well the (approximate) couplings $J_{ij}(a_i, a_j)$ predict contact sites.

In [None]:
# Compute the inverse of the covariance matrix, with small regularization so that the
# matrix is invertible
icov = np.linalg.inv(np.cov(alignment_onehot.T) + 0.05*np.identity(num_aas * num_sites))

# Predict contact sites based on the large entries of icov
pred_pairs = []
pred_vals = []
for i in range(num_sites):
    for j in range(i+1, num_sites):
        pred_pairs.append((i, j))
        # Take the submatrix of covariances between AAs at sites i and j,
        # square them, and take the sum
        pred_vals.append(np.sum(icov[num_aas*i:num_aas*(i+1), num_aas*j:num_aas*(j+1)]**2))

In [None]:
# How do these compare to predictions from correlations?
true_positive, false_positive, other_contacts = get_contact_predictions(contacts, pred_vals, pred_pairs)
print('%d true positives, %d false positives in top %d predictions' 
      % (len(true_positive), len(false_positive), len(true_positive)+len(false_positive)))

In [None]:
# That's a lot better. Here is the contact plot:
figs.contact_plot_PF00014(other_contacts, true_positive, false_positive)
plt.show()