Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dealing with templates that have missing residues #6

Closed
luhong88 opened this issue Jun 7, 2023 · 3 comments
Closed

Dealing with templates that have missing residues #6

luhong88 opened this issue Jun 7, 2023 · 3 comments

Comments

@luhong88
Copy link

luhong88 commented Jun 7, 2023

What is the "proper" way to use a template that has missing residues? Let's say I have a template that has 20 missing residues due to an unresolved loop region, and I'm not interested in the region for the purpose of assessing pLDDT. I tried to give AF2Rank the template as is and a sequence with the gap token represented by "X", but it seems that AF2Rank processes the template as a contiguous chain and as such gives an error message due to sequence length mismatch. In this case, am I supposed to give AF2Rank a sequence with no gap tokens? Alternatively, I know that I can treat the template as a dimer and use a alphafold_multimer model, but I'm not sure if dealing with a monomeric protein with a multimer model is the best way to score the structure.

@jproney
Copy link
Owner

jproney commented Jun 10, 2023

Hello! Thanks for the question. The way to handle this scenario is to create a template with the full sequence length, but use the template_all_atom_masks feature to indicate which residues are fully or partially missing from the template. I believe the existing code should be able to handle this scenario, although clearly something is going wrong in this case. The following logic in the score_decoy function is meant to accommodate missing residues so long as the residues that do exist match the target sequence:

  decoy_seq_in = "".join([residue_constants.restypes[x] for x in decoy_prot.aatype]) # the sequence in the decoy PDB file

  mismatch = False
  if decoy_seq_in == target_seq:
    assert jnp.all(decoy_prot.residue_index - 1 == np.arange(len(target_seq)))
  else: # case when template is missing some residues
    if args.verbose:
      print("Sequece mismatch: {}".format(name))
    mismatch=True

    assert "".join(target_seq[i-1] for i in decoy_prot.residue_index) == decoy_seq_in 

Is it one of these assertions that is failing? If so, there could potentially be an issue with the numbering of the residues in the PDB file, which should match their position in the sequence regardless of whether some residues are missing. Any more specific information you can provide on the issue would be helpful. Thanks!

@luhong88
Copy link
Author

luhong88 commented Jun 21, 2023

Okay, I guess there's a higher level question here. So far the way I've been running af2rank is through a local installation of the colab notebook code (https://colab.research.google.com/github/sokrypton/ColabDesign/blob/main/af/examples/AF2Rank.ipynb#scrollTo=UCUZxJdbBjZt). A quick glance at the test_templates.py seems to suggest that this is not quite the same as what's provided with the colab notebook. Should I be switching over to using test_templates.py?

Edit: after reading through the code more carefully, it seems to me that the main differences between test_templates.py and the colab notebook are:

  • test_templates.py directly calls the alphafold module, while the colab notebook goes through colabdesign
  • test_templates.py cannot take in multimers, while the notebook can.
  • test_templates.py can take in templates with missing residues and will produce a predicted structure with the missing residues filled in through af, but the notebook ignores missing residues.

But in the end, both test_templates.py and the notebook will create fake CB coordinates for glycines, and both have the option to mask template sequence and sidechain atom coordinates. Is this accurate?

@jproney
Copy link
Owner

jproney commented Jul 3, 2023

I think your assessment is accurate, although the notebook can handle templates with missing residues with a few minor modifications! I've put together the following notebook to handle missing residues:

https://colab.research.google.com/drive/1lFg0zem4-dm70JdZEhXJuigZb0NFW3fH?usp=sharing

The notebook contains an example of ranking a template with deleted residues, which should help give some insight into how to address this issue. In this new notebook, calling af.predict(pdb=pdb_path, seq=seq) will ensure that seq is used as the target sequence. The residues in the template will then be correctly indexed and masked within the native sequence, so long as the PDB residues are numbered correctly (see the example in the notebook for reference, but essentially the index associated with each residue in the PDB needs to be its index within the full sequence, meaning that some numbers will be skipped in a PDB with unresolved residues).

@sokrypton if all seems well could you integrate this update to the notebook into the ColabFold repo?

@jproney jproney closed this as completed Sep 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants