What to do with the local_representations and global_representations #6

rdenise · 2021-10-27T07:53:05Z

Hello everyone,

After using the model I have two array that are the local_representations and global_representations

# After parsing the sequences from the FASTA file into 'seqs' and choosing 'seq_len' (e.g. 512) and 'batch_size' (e.g. 32)

from proteinbert import load_pretrained_model
from proteinbert.conv_and_global_attention_model import get_model_with_hidden_layers_as_outputs

pretrained_model_generator, input_encoder = load_pretrained_model()
model = get_model_with_hidden_layers_as_outputs(pretrained_model_generator.create_model(seq_len))
X = input_encoder.encode_X(seqs, seq_len)
local_representations, global_representations= model.predict(X, batch_size = batch_size)

But now I don't know what to do to have the GO annotations of my sequences ?

all the best

nadavbra · 2021-10-27T17:46:34Z

Hi @rdenise,

global_representations is a 8943-dimensional vector, where each index corresponds to a specific GO annotation. Unfortunately, through a truly negligent act on our side, we have lost the file mapping these indices to the corresponding GO annotation (so we can't tell which GO annotation each of these probabilities corresponds to...)

We will probably fix it at some point in the future, by simply running the entire pretraining process from scratch (this time not losing any file...). Unfortunately, I cannot guarantee when this will happen.

Notably, the code of ProteinBERT does work properly, so if you run the entire pretraining from scratch (which should take you 1 month on a single GPU to get to the same performance as the published model), then you should end up with a model that provides GO annotations.

I truly apologize for this...

yelou2022 · 2023-09-20T06:55:40Z

hello @nadavbra
I have read all the relevant GO annotations on the question page, but I still have questions that I want to get an exact answer from you. In the current model, protein sequence input is required, while GO annotations are optional. So if I only input protein sequence data, will the global and local representations obtained in the end contain information predicted by the model for GO annotations? Or does it mean that both representations only extract information from the sequence, and the model does not predict corresponding GO annotations based on the sequence and then add them to the feature representation?

Flow: seq → local, global

def encode_X(self, seqs, seq_len):
    return [
        tokenize_seqs(seqs, seq_len),
        np.zeros((len(seqs), self.n_annotations), dtype = np.int8)
    ]
encoded_x = input_encoder.encode_X(seqs, seq_len)
local_representations, global_representations = model.predict(encoded_x, batch_size=batch_size)

The encode_X method will return the encoding of sequences and a zero array (representing GO annotations).

Because I read in other answers that the ProteinBert model can use sequence data to predict corresponding GO annotation information, this has left me a bit confused.

My goal is to treat the ProteinBert model as an encoder, input a protein sequence, and then obtain the corresponding local and global feature representations for downstream tasks.

Sincerely,
yelou

nadavbra · 2023-09-20T13:21:38Z

Hi @yelou2022,

The global features predicted by ProteinBERT are GO annotations, whether or not it gets them as input. If it gets some GO annotations as input, then it's easier to predict other GO annotations and you should expect more accurate predictions, but it will do its best to predict the GO annotations given the sequence even if you don't provide any annotations as input.

I hope it clarifies things.

yelou2022 · 2023-09-21T00:43:11Z

I understood, thank you.

srikrishnan-b · 2024-02-06T12:43:22Z

Hi @nadavbra,

The global representation vector that I am getting is not of the size 8943 as mentioned above. What I am getting is a vector of size 15599. Were there any changes done since the time this was discussed in this thread?

nadavbra · 2024-02-07T02:56:57Z

@srikrishnan-b I suppose that includes hidden layers via get_model_with_hidden_layers_as_outputs.

nadavbra mentioned this issue Mar 4, 2022

Error whilst evaluating fine-tuned model with categorical GO terms #11

Closed

nadavbra mentioned this issue Dec 14, 2022

Global feature extraction #36

Closed

rhysnewell mentioned this issue Jan 10, 2023

GO Annotation Vector #38

Open

nadavbra mentioned this issue May 31, 2023

Question related to Fine-Tuning Phase #53

Closed

thnhan mentioned this issue Feb 16, 2024

How to extract the embedding of an amino acid? #82

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What to do with the local_representations and global_representations #6

What to do with the local_representations and global_representations #6

rdenise commented Oct 27, 2021

nadavbra commented Oct 27, 2021

yelou2022 commented Sep 20, 2023 •

edited

Loading

nadavbra commented Sep 20, 2023

yelou2022 commented Sep 21, 2023

srikrishnan-b commented Feb 6, 2024

nadavbra commented Feb 7, 2024

What to do with the local_representations and global_representations #6

What to do with the local_representations and global_representations #6

Comments

rdenise commented Oct 27, 2021

nadavbra commented Oct 27, 2021

yelou2022 commented Sep 20, 2023 • edited Loading

nadavbra commented Sep 20, 2023

yelou2022 commented Sep 21, 2023

srikrishnan-b commented Feb 6, 2024

nadavbra commented Feb 7, 2024

yelou2022 commented Sep 20, 2023 •

edited

Loading