<a href="https://colab.research.google.com/github/nohasamir89/noha_project/blob/main/google_colabs/Embed_sequences_using_TM_Vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Notes:
1. In order to use TM-Vec and DeepBlast, you need to install TM-Vec, DeepBlast, and the huggingface transformers library.
2. You will also need to download the ProtT5-XL-UniRef50 encoder (large language model that TM-Vec and DeepBlast uses), the trained TM-Vec model, and the trained DeepBlast model. As the ProtT5-XL-UniRef50 encoder is very large (~11.3GB), unless you have the necessary RAM on your GPU (at least more than the model), you may have to use a CPU runtime on Google Colab.
3. This notebook demonstrates how TM-Vec can be used to embed protein sequences.


<h3>Embedding protein sequences using a trained TM-Vec model</h3>

**1. Install the relevant libraries including tm-vec, the huggingface transformers library, and faiss**

In [1]:
%pip install git+https://github.com/tymor22/tm-vec.git -q gwpy
%pip install -q SentencePiece transformers
%pip install faiss-cpu

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.2/48.2 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.0/51.0 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m20.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m50.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m815.2/815.2 kB[0m [31m32.9 MB/s[0m eta [36m0:00:00

<b>2. Load the relevant libraries<b>

In [2]:
import torch
from transformers import T5EncoderModel, T5Tokenizer
import re
import gc
import numpy as np
import pandas as pd
import torch
from torch.utils.data import Dataset
import faiss
from tm_vec.embed_structure_model import trans_basic_block, trans_basic_block_Config
from tm_vec.tm_vec_utils import featurize_prottrans, embed_tm_vec, encode
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import seaborn as sns

<b>3. Load the ProtT5-XL-UniRef50 tokenizer and model<b>

In [3]:
tokenizer = T5Tokenizer.from_pretrained("Rostlab/prot_t5_xl_uniref50", do_lower_case=False )
model = T5EncoderModel.from_pretrained("Rostlab/prot_t5_xl_uniref50")
gc.collect()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/24.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/238k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/546 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


pytorch_model.bin:   0%|          | 0.00/11.3G [00:00<?, ?B/s]

8

<b>3. Put the model onto your GPU if it is avilabile, switching the model to inference mode<b>

In [7]:
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
print(device)
model = model.to(device)
model = model.eval()

cpu


<b>4. Download a trained TM-Vec model, and its configuration file<b>

In [8]:
!wget https://users.flatironinstitute.org/thamamsy/public_www/tm_vec_cath_model.ckpt -q gwpy
!wget https://users.flatironinstitute.org/thamamsy/public_www/tm_vec_cath_model_params.json -q gwpy

<b> 5. Load a trained TM-Vec model<b>

In [9]:
#TM-Vec model paths
tm_vec_model_cpnt = "/content/tm_vec_cath_model.ckpt"
tm_vec_model_config = "/content/sample_data/tm_vec_cath_model_params.json"

#Load the TM-Vec model
tm_vec_model_config = trans_basic_block_Config.from_json(tm_vec_model_config)
model_deep = trans_basic_block.load_from_checkpoint(tm_vec_model_cpnt, config=tm_vec_model_config)
model_deep = model_deep.to(device)
model_deep = model_deep.eval()



FileNotFoundError: [Errno 2] No such file or directory: '/content/tm_vec_cath_model_params.json'

# New Section

<b> 6. Load or paste some sequences that you would like to embed <b>

In [None]:
!pip install biopython


In [None]:

from Bio import SeqIO
import pandas as pd

fasta_path = "/content/cath-domain-seqs-S35.fa.txt"
records = list(SeqIO.parse(fasta_path, "fasta"))

# Extract domain ID (the part like '1oaiA00') from each header.
# Example header: 'cath|4_3_0|1oaiA00/1-162' --> domain_id = '1oaiA00'
def parse_domain_id(fasta_id):
    # Fasta ID might look like: cath|4_3_0|1oaiA00/1-162
    # Split by '|' to get ['cath','4_3_0','1oaiA00/1-162']
    parts = fasta_id.split('|')
    if len(parts) == 3:
        domain_part = parts[-1]   # e.g. '1oaiA00/1-162'
        # Then split by '/' to remove the residue range, leaving '1oaiA00'
        domain_id = domain_part.split('/')[0]
        return domain_id
    else:
        # If the format differs, just return the entire header
        return fasta_id

domain_ids = [parse_domain_id(rec.id) for rec in records]
sequences = [str(rec.seq) for rec in records]

df_fasta = pd.DataFrame({
    'DomainID': domain_ids,
    'Sequence': sequences
})

print("FASTA DataFrame (first few rows):")
print(df_fasta.head())





<b> 7. Embed your sequences using TM-Vec <b>



In [None]:
from Bio import SeqIO
import pandas as pd

# Parse multi-FASTA file
fasta_path = "/content/cath-domain-seqs-S35.fa.txt"
records = list(SeqIO.parse(fasta_path, "fasta"))

def parse_domain_id(fasta_id):
    # Example FASTA ID: 'cath|4_3_0|1oaiA00/1-162'
    parts = fasta_id.split('|')
    if len(parts) == 3:
        domain_part = parts[-1]   # e.g. '1oaiA00/1-162'
        domain_id = domain_part.split('/')[0]
        return domain_id
    else:
        return fasta_id

domain_ids = [parse_domain_id(rec.id) for rec in records]
df_fasta = pd.DataFrame({
    'DomainID': domain_ids,
    'Sequence': [str(rec.seq) for rec in records]
})

# Load the metadata file
metadata_path = "/content/cath-domain-list-S35.txt"
df_metadata = pd.read_csv(metadata_path, sep=r"\s+", header=None)

# Rename columns: domain ID in col 1, class in col 2, architecture in col 3
df_metadata.columns = [
    "DomainID",      # col 1
    "Class",         # col 2
    "Architecture",  # col 3
] + [f"col{i}" for i in range(4, df_metadata.shape[1]+1)]

# Merge the parsed FASTA with metadata
df_merged = pd.merge(df_fasta, df_metadata[['DomainID','Class','Architecture']],
                     on='DomainID', how='inner')

print("Merged data sample:")
print(df_merged.head())



In [None]:
# Sample 50 sequences from the merged DataFrame
df_sample = df_merged.sample(50, random_state=42)

# Convert the 'Sequence' column to a list
sequences = df_sample['Sequence'].tolist()

# Encode the sequences with your function (pseudocode)
encoded_sequences = encode(sequences, model_deep, model, tokenizer, device)

print("encoded_sequences shape:", encoded_sequences.shape)
# Typically encoded_sequences might be (50, embedding_dim)


In [None]:
from sklearn.manifold import TSNE
import numpy as np

# If encode() returns a torch tensor, convert it to numpy if needed:
if not isinstance(encoded_sequences, np.ndarray):
    encoded_sequences = encoded_sequences.detach().cpu().numpy()

sequence_tsne = TSNE(
    n_components=2,
    learning_rate='auto',
    init='random',
    random_state=42
).fit_transform(encoded_sequences)

# Create a DataFrame for the t-SNE output
sequence_tsne_df = pd.DataFrame(sequence_tsne, columns=["Dim1", "Dim2"])
sequence_tsne_df['Class'] = df_sample['Class'].values
sequence_tsne_df['Architecture'] = df_sample['Architecture'].values

print("t-SNE DataFrame:")
print(sequence_tsne_df.head())


In [None]:
plt.figure(figsize=(6, 5))
sns.scatterplot(data=sequence_tsne_df, x='Dim1', y='Dim2', hue='Class')
plt.title("t-SNE of encoded sequences (colored by Architecture)")
plt.show()


In [None]:
plt.figure(figsize=(6, 5))
sns.scatterplot(data=sequence_tsne_df, x='Dim1', y='Dim2', hue='Architecture')
plt.title("t-SNE of encoded sequences (colored by Architecture)")
plt.show()


In [None]:
sequences = pd.merge(df_fasta, df_metadata[['DomainID','Class','Architecture']],
                     on='DomainID', how='inner')

print("Merged DataFrame (first few rows):")
print(df_merged.head())
