<a href="https://colab.research.google.com/github/katarinagresova/GraSR/blob/master/Exploration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%load_ext autoreload
%autoreload 2

#Introduction

The goal of this notebook is to use GraSR ([Xia, Chunqiu, et al., 2022](https://scholar.google.com/scholar_url?url=https://journals.plos.org/ploscompbiol/article%3Fid%3D10.1371/journal.pcbi.1009986&hl=en&sa=T&oi=gsb-gga&ct=res&cd=0&d=12783685357761893768&ei=wJIOZdCBNYvymgH-l7Yo&scisig=AFWwaeYc4sKxYUXjnu-0pjDznxrk)) for generating structural embedding of protein.

**Input:** protein structure in .mmcif format  
**Output:** embedding vector with 400 features

The notebook is following the flow of `get_descriptors()` function from `encode.py` file to prepare the data, load the model and compute the embeddings.

# Setup

In this notebook, we will use my version of the repository. I forked the original repo to do some adjustments that we need. For now, the only change is supporting `.mmcif` format next to the `.pdb` format. This change is in the `get_ca_coordinate()` function in `encode.py` file.

In [2]:
!git clone https://github.com/katarinagresova/GraSR.git
%cd GraSR

Cloning into 'GraSR'...
remote: Enumerating objects: 74, done.[K
remote: Counting objects: 100% (25/25), done.[K
remote: Compressing objects: 100% (25/25), done.[K
remote: Total 74 (delta 11), reused 0 (delta 0), pack-reused 49[K
Receiving objects: 100% (74/74), 66.75 MiB | 32.10 MiB/s, done.
Resolving deltas: 100% (21/21), done.
/content/GraSR


Requirements stated in the repo are:

```
biopython==1.78
numpy==1.19.5
torch==1.1.0
```



In [3]:
!pip install biopython

Collecting biopython
  Downloading biopython-1.81-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m33.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: biopython
Successfully installed biopython-1.81


# Prepare data

Downloading random sequence from AlphaFold DB.

In [4]:
!wget https://alphafold.ebi.ac.uk/files/AF-O15552-F1-model_v4.cif

--2023-09-23 12:13:39--  https://alphafold.ebi.ac.uk/files/AF-O15552-F1-model_v4.cif
Resolving alphafold.ebi.ac.uk (alphafold.ebi.ac.uk)... 34.149.152.8
Connecting to alphafold.ebi.ac.uk (alphafold.ebi.ac.uk)|34.149.152.8|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/octet-stream]
Saving to: ‘AF-O15552-F1-model_v4.cif’

AF-O15552-F1-model_     [ <=>                ] 300.99K  --.-KB/s    in 0.004s  

2023-09-23 12:13:40 (70.1 MB/s) - ‘AF-O15552-F1-model_v4.cif’ saved [308216]



`get_raw_feature_tensor()` supports list of files as an input - prepared for the batch processing.

In [5]:
from encode import get_raw_feature_tensor

x, ld, am = get_raw_feature_tensor(["AF-O15552-F1-model_v4.cif"])

# Prepare model

`saved_models/` folder contains 5 models. For this example I decided to use only the first one. However, the original implementation in the `get_descriptors()` function can with with a list of models and the final embedding is average of individual embeddings.

In [6]:
from encode import load_model

model = load_model("saved_models/grasr_fold0.pkl")
model.eval()

MoCo(
  (encoder_q): Encoder(
    (mlp1): Sequential(
      (0): Conv2d(1, 64, kernel_size=(1, 32), stride=(1, 1))
      (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): LeakyReLU(negative_slope=0.01, inplace=True)
    )
    (bilstm): LSTM(64, 64, batch_first=True, bidirectional=True)
    (mlp2): Sequential(
      (0): Conv2d(128, 256, kernel_size=(1, 1), stride=(1, 1))
      (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): LeakyReLU(negative_slope=0.01, inplace=True)
    )
    (gcl): GraphConvLayer(
      (nonlinear): Sequential(
        (0): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1))
        (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (Leaky_Relu): LeakyReLU(negative_slope=0.01, inplace=True)
      )
    )
    (gcrb_1): GraphConvResBlock(
      (linear): Sequential(
        (0): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1))
  

# Compute embeddings

In [10]:
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
x = x.to(device)
# this variable `ld` containing the lenghts have to be on the cpu. reference: https://github.com/pytorch/pytorch/issues/43227
ld = ld.to(torch.device('cpu'))
am = am.to(device)

In [11]:
%timeit model((x, x, ld, ld, am, am), True)

14.3 ms ± 1.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [None]:
y = model((x, x, ld, ld, am, am), True).detach().numpy()

In [None]:
import pandas as pd

pd.DataFrame(y[0]).describe()

Unnamed: 0,0
count,400.0
mean,-0.00038
std,0.050061
min,-0.195637
25%,-0.032454
50%,0.000997
75%,0.029713
max,0.174196
