Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incorrect m3a format in alphafold #34

Closed
aai97 opened this issue Feb 28, 2020 · 3 comments
Closed

incorrect m3a format in alphafold #34

aai97 opened this issue Feb 28, 2020 · 3 comments

Comments

@aai97
Copy link

aai97 commented Feb 28, 2020

When describing the deletion_probability feature for alphafold, you specify the fact that you use the m3a format from hhblits.
To quote hhblits' github on the m3a format:

residues emitted by Match states of the HMM are in upper case, residues emitted by Insert states are in lower case and deletions are written -.

In the m3a format, the sequences of the MSA are not necessarily of equal length, and deletions are denoted by "-", whereas lowercase letters denote insertions and cause the disparities in sequence length:

A B C        Original sequence
A - E        Sequence where residue 2 was deleted, residue 3 was substituded
A d B C      Sequence where a residue d was inserted between residue 2 and 3. 
             Note that now the residue B no longer aligns with that of the original sequence.

This means your description of the deletion_probability feature makes no sense: not just should we count "-" rather than lowercase letters if we are looking for deletions, but aligning the residues by column makes no sense in the a3m format, since the lengths dont match.

Assuming that the name deletion_probability is not a misnomer, one has to instead remove all lowercase letters form the a3m MSA and then count the number of "-" per column to obtain the probability of a deletion of a particular residue in the MSA.

Is my reasoning here correct, or am I missing something important?

@huhlim
Copy link

huhlim commented Mar 2, 2020

I agree with your opinion. The description was strange to me for the same reason.

@Augustin-Zidek
Copy link
Collaborator

Hi, thanks for the feedback. You are right, the description of the deletion_probability feature wasn't correct, I fixed it and added a code snippet to show how we compute the deletion_probability feature to make it clearer.

On the insertion vs deletion comment: I agree that our naming is misleading -- we call them 'deleted' residues because they have to be deleted in order for the sequence to align to the query.

@aai97
Copy link
Author

aai97 commented Mar 4, 2020

Thank you for the update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants