Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What to do with the local_representations and global_representations #6

Open
rdenise opened this issue Oct 27, 2021 · 6 comments
Open

Comments

@rdenise
Copy link

rdenise commented Oct 27, 2021

Hello everyone,

After using the model I have two array that are the local_representations and global_representations

# After parsing the sequences from the FASTA file into 'seqs' and choosing 'seq_len' (e.g. 512) and 'batch_size' (e.g. 32)

from proteinbert import load_pretrained_model
from proteinbert.conv_and_global_attention_model import get_model_with_hidden_layers_as_outputs

pretrained_model_generator, input_encoder = load_pretrained_model()
model = get_model_with_hidden_layers_as_outputs(pretrained_model_generator.create_model(seq_len))
X = input_encoder.encode_X(seqs, seq_len)
local_representations, global_representations= model.predict(X, batch_size = batch_size)

But now I don't know what to do to have the GO annotations of my sequences ?

all the best

@nadavbra
Copy link
Owner

Hi @rdenise,

global_representations is a 8943-dimensional vector, where each index corresponds to a specific GO annotation. Unfortunately, through a truly negligent act on our side, we have lost the file mapping these indices to the corresponding GO annotation (so we can't tell which GO annotation each of these probabilities corresponds to...)

We will probably fix it at some point in the future, by simply running the entire pretraining process from scratch (this time not losing any file...). Unfortunately, I cannot guarantee when this will happen.

Notably, the code of ProteinBERT does work properly, so if you run the entire pretraining from scratch (which should take you 1 month on a single GPU to get to the same performance as the published model), then you should end up with a model that provides GO annotations.

I truly apologize for this...

@yelou2022
Copy link

yelou2022 commented Sep 20, 2023

hello @nadavbra
I have read all the relevant GO annotations on the question page, but I still have questions that I want to get an exact answer from you. In the current model, protein sequence input is required, while GO annotations are optional. So if I only input protein sequence data, will the global and local representations obtained in the end contain information predicted by the model for GO annotations? Or does it mean that both representations only extract information from the sequence, and the model does not predict corresponding GO annotations based on the sequence and then add them to the feature representation?

Flow: seq → local, global

def encode_X(self, seqs, seq_len):
    return [
        tokenize_seqs(seqs, seq_len),
        np.zeros((len(seqs), self.n_annotations), dtype = np.int8)
    ]
encoded_x = input_encoder.encode_X(seqs, seq_len)
local_representations, global_representations = model.predict(encoded_x, batch_size=batch_size)

The encode_X method will return the encoding of sequences and a zero array (representing GO annotations).

Because I read in other answers that the ProteinBert model can use sequence data to predict corresponding GO annotation information, this has left me a bit confused.

My goal is to treat the ProteinBert model as an encoder, input a protein sequence, and then obtain the corresponding local and global feature representations for downstream tasks.

Sincerely,
yelou

@nadavbra
Copy link
Owner

Hi @yelou2022,

The global features predicted by ProteinBERT are GO annotations, whether or not it gets them as input. If it gets some GO annotations as input, then it's easier to predict other GO annotations and you should expect more accurate predictions, but it will do its best to predict the GO annotations given the sequence even if you don't provide any annotations as input.

I hope it clarifies things.

@yelou2022
Copy link

I understood, thank you.

@srikrishnan-b
Copy link

Hi @nadavbra,

The global representation vector that I am getting is not of the size 8943 as mentioned above. What I am getting is a vector of size 15599. Were there any changes done since the time this was discussed in this thread?

@nadavbra
Copy link
Owner

nadavbra commented Feb 7, 2024

@srikrishnan-b I suppose that includes hidden layers via get_model_with_hidden_layers_as_outputs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants