You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
residues emitted by Match states of the HMM are in upper case, residues emitted by Insert states are in lower case and deletions are written -.
In the m3a format, the sequences of the MSA are not necessarily of equal length, and deletions are denoted by "-", whereas lowercase letters denote insertions and cause the disparities in sequence length:
A B C Original sequence
A - E Sequence where residue 2 was deleted, residue 3 was substituded
A d B C Sequence where a residue d was inserted between residue 2 and 3.
Note that now the residue B no longer aligns with that of the original sequence.
This means your description of the deletion_probability feature makes no sense: not just should we count "-" rather than lowercase letters if we are looking for deletions, but aligning the residues by column makes no sense in the a3m format, since the lengths dont match.
Assuming that the name deletion_probability is not a misnomer, one has to instead remove all lowercase letters form the a3m MSA and then count the number of "-" per column to obtain the probability of a deletion of a particular residue in the MSA.
Is my reasoning here correct, or am I missing something important?
The text was updated successfully, but these errors were encountered:
Hi, thanks for the feedback. You are right, the description of the deletion_probability feature wasn't correct, I fixed it and added a code snippet to show how we compute the deletion_probability feature to make it clearer.
On the insertion vs deletion comment: I agree that our naming is misleading -- we call them 'deleted' residues because they have to be deleted in order for the sequence to align to the query.
When describing the deletion_probability feature for alphafold, you specify the fact that you use the m3a format from hhblits.
To quote hhblits' github on the m3a format:
In the m3a format, the sequences of the MSA are not necessarily of equal length, and deletions are denoted by "-", whereas lowercase letters denote insertions and cause the disparities in sequence length:
This means your description of the deletion_probability feature makes no sense: not just should we count "-" rather than lowercase letters if we are looking for deletions, but aligning the residues by column makes no sense in the a3m format, since the lengths dont match.
Assuming that the name deletion_probability is not a misnomer, one has to instead remove all lowercase letters form the a3m MSA and then count the number of "-" per column to obtain the probability of a deletion of a particular residue in the MSA.
Is my reasoning here correct, or am I missing something important?
The text was updated successfully, but these errors were encountered: