##Important Notes:
If you want to get ProtT5 embedding through Colab, you may need to upgrade Colab Pro.
1. ProtT5-XL-UniRef50 has both encoder and decoder, for feature extraction we only load and use the encoder part.
2. Loading only the encoder part, reduces the inference speed and the GPU memory requirements by half.
3. In order to use ProtT5-XL-UniRef50 encoder, you must install the latest huggingface transformers version from their GitHub repo.
4. If you are intersted in both the encoder and decoder, you should use TFT5Model rather than TFT5EncoderModel.
5. CPU is also allowed if you do not have GPUs. (About 20GB memory)

<h3>Extracting protein sequences' features using ProtT5-XL-UniRef50 pretrained-model</h3>


**1. Load necessry libraries including huggingface transformers**

In [1]:
!pip install -q SentencePiece transformers

In [71]:
import torch
from transformers import TFT5EncoderModel, T5Tokenizer, T5EncoderModel
import numpy as np
import re
import gc

<b>2. Load the vocabulary and ProtT5-XL-UniRef50 Model<b>

In [72]:
tokenizer = T5Tokenizer.from_pretrained("Rostlab/prot_t5_xl_uniref50", do_lower_case=False )

In [21]:
model = TFT5EncoderModel.from_pretrained("Rostlab/prot_t5_xl_uniref50", from_pt=True)             


Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFT5EncoderModel: ['decoder.block.5.layer.0.SelfAttention.v.weight', 'decoder.block.1.layer.0.SelfAttention.q.weight', 'decoder.block.12.layer.0.SelfAttention.o.weight', 'decoder.embed_tokens.weight', 'decoder.block.22.layer.0.SelfAttention.q.weight', 'decoder.block.6.layer.0.layer_norm.weight', 'decoder.block.18.layer.0.SelfAttention.q.weight', 'decoder.block.6.layer.1.EncDecAttention.o.weight', 'decoder.block.23.layer.0.SelfAttention.q.weight', 'decoder.block.14.layer.0.SelfAttention.k.weight', 'decoder.block.7.layer.1.EncDecAttention.o.weight', 'decoder.block.20.layer.1.EncDecAttention.v.weight', 'decoder.block.20.layer.2.DenseReluDense.wi.weight', 'decoder.block.21.layer.1.layer_norm.weight', 'decoder.block.22.layer.1.EncDecAttention.o.weight', 'decoder.block.9.layer.2.layer_norm.weight', 'decoder.block.0.layer.0.SelfAttention.o.weight', 'decoder.block.15.layer.1.EncDecAttention.k.weight', 'decoder.b

In [7]:
gc.collect()

9

<b>3. Create or load sequences and map rarely occured amino acids (U,Z,O,B) to (X)<b>

In [81]:
sequences_Example = "MGEVVRLTNSSTGGPVFVYVKDGKIIRMTPMDFDDAVDAPSWKIEARGKTFTPPRKTSIAPYTAGFKSMIYSDLRIPYPMKRKSFDPNGERNPQLRGAGLSKQDPWSDYERISWDEATDIVVAEINRIKHAYGPSAILSTPSSHHMWGNVGYRHSTYFRFMNMMGFTYADHNPDSWEGWHWGGMHMWGFSWRLGNPEQYDLLEDGLKHAEMIVFWSSDPETNSGIYAGFESNIRRQWLKDLGVDFVFIDPHMNHTARLVADKWFSPKIGTDHALSFAIAYTWLKEDSYDKEYVAANAHGFEEWADYVLGKTDGTPKTCEWAEEESGVPACEIRALARQWAKKNTYLAAGGLGGWGGACRASHGIEWARGMIALATMQGMGKPGSNMWSTTQGVPLDYEFYFPGYAEGGISGDCENSAAGFKFAWRMFDGKTTFPSPSNLNTSAGQHIPRLKIPECIMGGKFQWSGKGFAGGDISHQLHQYEYPAPGYSKIKMFWKYGGPHLGTMTATNRYAKMYTHDSLEFVVSQSIWFEGEVPFADIILPACTNFERWDISEFANCSGYIPDNYQLCNHRVISLQAKCIEPVGESMSDYEIYRLFAKKLNIEEMFSEGKDELAWCEQYFNATDMPKYMTWDEFFKKGYFVVPDNPNRKKTVALRWFAEGREKDTPDWGPRLNNQVCRKGLQTTTGKVEFIATSLKNFEEQGYIDEHRPSMHTYVPAWESQKHSPLAVKYPLGMLSPHPRFSMHTMGDGKNSYMNYIKDHRVEVDGYKYWIMRVNSIDAEARGIKNGDLIRAYNDRGSVILAAQVTECLQPGTVHSYESCAVYDPLGTAGKSADRGGCINILTPDRYISKYACGMANNTALVEIEKWDGDKYEIY"
seq_split = []
seq_split.append(str(' '.join([word for word in sequences_Example])))

In [82]:
sequences_Example = [re.sub(r"[UZOB]", "X", sequence) for sequence in seq_split]

<b>4. Tokenize, encode sequences and load it into the GPU if possibile<b>

In [83]:
ids = tokenizer.batch_encode_plus(sequences_Example, add_special_tokens=True, padding=True, return_tensors="tf")

In [84]:
input_ids = ids['input_ids']
attention_mask = ids['attention_mask']

<b>5. Extracting sequences' features and load it into the CPU if needed<b>

In [85]:
embedding = model(input_ids)

In [86]:
embedding = np.asarray(embedding.last_hidden_state)

In [87]:
attention_mask = np.asarray(attention_mask)

<b>7. Remove padding (\<pad\>) and special tokens (\</s\>) that is added by ProtT5-XL-UniRef50 model<b>

In [88]:
features = [] 
for seq_num in range(len(embedding)):
    seq_len = (attention_mask[seq_num] == 1).sum()
    seq_emd = embedding[seq_num][:seq_len-1]
    features.append(seq_emd)

In [92]:
print(np.array(features).shape)
np.save('./ProtT5.npy', np.array(features))

(1, 875, 1024)
