<a href="https://colab.research.google.com/github/hbp5181/Linear-Model-uisng-homolog-survey-data/blob/main/future_learning(sequence_to_numbers).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Inverse folding with ESM-IF1

The ESM-IF1 inverse folding model is built for predicting protein sequences from their backbone atom coordinates. We provide examples here 1) to sample sequence designs for a given structure and 2) to score sequences for a given structure.

Trained with 12M protein structures predicted by AlphaFold2, the ESM-IF1 model consists of invariant geometric input processing layers followed by a sequence-to-sequence transformer, and achieves 51% native sequence recovery on structurally held-out backbones. The model is also trained with span masking to tolerate missing backbone coordinates and therefore can predict sequences for partially masked structures.

See [GitHub README](https://github.com/facebookresearch/esm/tree/main/examples/inverse_folding) for the complete user guide, and see our [bioRxiv pre-print](https://doi.org/10.1101/2022.04.10.487779) for more details.

## Environment setup (colab)
This step might take up to 10 minutes the first time.

If using a local jupyter environment, instead of the following, we recommend configuring a conda environment upon first use in command line:
```
conda create -n inverse python=3.9
conda activate inverse
conda install pytorch cudatoolkit=11.3 -c pytorch
conda install pyg -c pyg -c conda-forge
conda install pip
pip install biotite
pip install git+https://github.com/facebookresearch/esm.git
```

Afterwards, `conda activate inverse` to activate this environment before starting `jupyter notebook`.

Below is the setup for colab notebooks:

We recommend using GPU runtimes on colab (Menu bar -> Runtime -> Change runtime type -> Hardware accelerator -> GPU)

In [2]:
# Colab environment setup

# Install the correct version of Pytorch Geometric.
import torch
import os

def format_pytorch_version(version):
  return version.split('+')[0]

TORCH_version = torch.__version__
TORCH = format_pytorch_version(TORCH_version)

def format_cuda_version(version):
  return 'cu' + version.replace('.', '')

CUDA_version = torch.version.cuda
CUDA = format_cuda_version(CUDA_version)

!pip install -q torch-scatter -f https://data.pyg.org/whl/torch-{TORCH}+{CUDA}.html
!pip install -q torch-sparse -f https://data.pyg.org/whl/torch-{TORCH}+{CUDA}.html
!pip install -q torch-cluster -f https://data.pyg.org/whl/torch-{TORCH}+{CUDA}.html
!pip install -q torch-spline-conv -f https://data.pyg.org/whl/torch-{TORCH}+{CUDA}.html
!pip install -q torch-geometric

# Install esm
!pip install -q git+https://github.com/facebookresearch/esm.git

# Install biotite
!pip install -q biotite

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.8/10.8 MB[0m [31m36.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.0/5.0 MB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m932.1/932.1 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for fair-esm (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.8/52.8 MB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[?25h

### Verify that pytorch-geometric is correctly installed

If the notebook crashes at the import, there is likely an issue with the version of torch_geometric and torch_sparse being incompatible with the torch version.

In [3]:
## Verify that pytorch-geometric is correctly installed
import torch_geometric
import torch_sparse
from torch_geometric.nn import MessagePassing

## Load model
This steps takes a few minutes for the model to download.

**UPDATE**: It is important to set the model in eval mode through `model = model.eval()` to disable random dropout for optimal performance.

In [4]:
import esm
model, alphabet = esm.pretrained.esm_if1_gvp4_t16_142M_UR50()
model = model.eval()

Downloading: "https://dl.fbaipublicfiles.com/fair-esm/models/esm_if1_gvp4_t16_142M_UR50.pt" to /root/.cache/torch/hub/checkpoints/esm_if1_gvp4_t16_142M_UR50.pt


## Extract encoder output as structure representation
The encoder output may also be used as a representation for the structure.

For a set of input coordinates with L amino acids, the encoder output will have shape L x 512.

In [7]:
! esm-extract esm2_t33_650M_UR50D /content/RBD_aa_aligned.fasta\
  coordoutputRBD.fasta --repr_layers 33 --include mean
! esm-extract esm2_t33_650M_UR50D /content/ACE2_aa_modified.fasta \
  coordoutputACE2.fasta --repr_layers 33 --include mean
! esm-extract esm2_t33_650M_UR50D /content/ancestors_unique.fasta\
  coordoutputRBDa.fasta --repr_layers 33 --include mean



Transferred model to GPU
Read /content/RBD_aa_aligned.fasta with 52 sequences
Processing 1 of 3 batches (20 sequences)
Processing 2 of 3 batches (20 sequences)
Processing 3 of 3 batches (12 sequences)
Transferred model to GPU
Read /content/ACE2_aa_modified.fasta with 62 sequences
Processing 1 of 13 batches (5 sequences)
Processing 2 of 13 batches (5 sequences)
Processing 3 of 13 batches (5 sequences)
Processing 4 of 13 batches (5 sequences)
Processing 5 of 13 batches (5 sequences)
Processing 6 of 13 batches (5 sequences)
Processing 7 of 13 batches (5 sequences)
Processing 8 of 13 batches (5 sequences)
Processing 9 of 13 batches (5 sequences)
Processing 10 of 13 batches (5 sequences)
Processing 11 of 13 batches (5 sequences)
Processing 12 of 13 batches (5 sequences)
Processing 13 of 13 batches (2 sequences)
Transferred model to GPU
Read /content/ancestors_unique.fasta with 34 sequences
Processing 1 of 2 batches (20 sequences)
Processing 2 of 2 batches (14 sequences)


In [8]:
# Specify the folders containing the .pt files
folder_paths = ['/content/coordoutputRBD.fasta', '/content/coordoutputACE2.fasta','/content/coordoutputRBDa.fasta']

# Flatten the list of filenames
pt_files = [os.path.join(folder, f) for folder in folder_paths for f in os.listdir(folder) if f.endswith('.pt')]

# Iterate over each .pt file
for file_path in pt_files:
    # Load the model using torch.load
    model_dict = torch.load(file_path, map_location=torch.device('cpu'))
    for key, value in model_dict.items():
        print(value)


RaTG13_MN996532
{33: tensor([ 0.0063, -0.0141, -0.0844,  ..., -0.0330, -0.0599, -0.0377])}
RmYN02_EPI_ISL_412977
{33: tensor([ 0.0337, -0.0104, -0.0443,  ...,  0.0044, -0.0968, -0.0463])}
Rs4084_KY417144
{33: tensor([ 0.0156, -0.0255, -0.0592,  ..., -0.0306, -0.0504, -0.0517])}
WIV1_KF367457
{33: tensor([ 0.0231, -0.0028, -0.0825,  ..., -0.0678, -0.0402, -0.0350])}
Rs4247_KY417148
{33: tensor([ 0.0245, -0.0184, -0.0386,  ..., -0.0501, -0.0798, -0.0673])}
Rf1_DQ412042
{33: tensor([ 0.0473, -0.0279, -0.0498,  ...,  0.0203, -0.1102, -0.0508])}
SARS-CoV-1_SZ13_PC03_AY304487
{33: tensor([ 0.0095, -0.0220, -0.0646,  ..., -0.0386, -0.0439, -0.0491])}
ZC45_MG772933
{33: tensor([ 0.0231, -0.0156, -0.0568,  ...,  0.0006, -0.0960, -0.0460])}
As6526_KY417142
{33: tensor([ 0.0552, -0.0108, -0.0494,  ..., -0.0112, -0.0860, -0.0490])}
SARS-CoV-2_MN908947
{33: tensor([ 0.0224, -0.0117, -0.0591,  ..., -0.0322, -0.0758, -0.0261])}
279-2005_DQ648857
{33: tensor([ 0.0680, -0.0047, -0.0513,  ..., -0.0156, 

In [15]:
# Specify the folders containing the .pt files
folder_paths = ['/content/coordoutputRBD.fasta','/content/coordoutputACE2.fasta',
'/content/coordoutputRBDa.fasta']

formatted_dict = {}

# Iterate over each folder
for folder_path in folder_paths:
    # List all files in the folder with .pt extension
    pt_files = [f for f in os.listdir(folder_path) if f.endswith('.pt')]

    # Iterate over each .pt file in the current folder
    for file_name in pt_files:
        # Construct the full path to the file
        file_path = os.path.join(folder_path, file_name)

        # Load the model using torch.load
        model_dict = torch.load(file_path, map_location=torch.device('cpu'))

        # Extract label and tensor values
        label = model_dict['label']
        tensor_values = model_dict['mean_representations'][33].numpy()

        # Convert tensor values to a list
        tensor_values_list = tensor_values.tolist()

        # Create a dictionary for the current file
        file_dict = {label: tensor_values_list}

        # Update the formatted_dict
        formatted_dict.update(file_dict)


print(formatted_dict)



IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [17]:
import pandas as pd


csv_file_path = '/content/future learn data.csv'

df = pd.read_csv(csv_file_path, header=None)
name_value_mapping =formatted_dict

# Replace all values in the DataFrame with the corresponding values from the mapping
df = df.applymap(lambda x: name_value_mapping.get(x, x) if pd.notna(x) else x)

# Save the modified DataFrame back to a CSV file
df.to_csv('/content/future learn data numnum.csv', index=False)

In [18]:
import csv

input_csv_file = "/content/future learn data numnum.csv"
output_csv_file = "/content/future learn data num.csv"


with open(input_csv_file, 'r') as csv_in, open(output_csv_file, 'w', newline='') as csv_out:
    reader = csv.reader(csv_in)
    writer = csv.writer(csv_out)

    for row in reader:
        cleaned_row = [value.strip('[]') for value in row]
        writer.writerow(cleaned_row)
