Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Neutralise systems with formal charges from the PDB #301

Merged
merged 24 commits into from
Sep 27, 2024

Conversation

Yoshanuikabundi
Copy link
Member

@Yoshanuikabundi Yoshanuikabundi commented Jul 30, 2024

This PR addresses #299 by:

  • Refactoring code that looks up the CCD into a single private method _downloadCCDDefinition() that parses and caches the definition.
  • Adding the private method _downloadFormalCharges() which gets the formal charges of the atoms in a residue from the CCD
  • Using _downloadFormalCharges() in _createForceField(water=True) to provide charges only when the residue is missing from the base force field and the (non-leaving) atom names in the CCD exactly match the atom names in the PDB
  • Adding a test for the above behavior

The refactor avoids re-downloading a CCD entry up to 3 times (once for protonation, once for missing heavy atoms, and a third time for formal charge) and reduces duplicate code. However, it is not essential for the other functionality and can be removed or split into its own PR if desired.

This solution works around the issues discussed in #299 where CCD residues tend to hard-code a single protonation state by only using the CCD formal charge when the force field does not provide the residue.

My reasoning is that there are two cases: first, when the force field includes a residue, it is capable of identifying the protonation state and gives the correct net charge and so it can only introduce regressions to interfere with this. In the second case, when a residue is introduced to the force field by _createForceField(water=True), the atom's partial charge is currently hard-coded to 0 and so using more information to sometimes set that to better values is appropriate.

We can break down the effect of this change (modulo bugs) like so:

  1. If an atom and residue is in amber14-all.xml, behavior is unchanged. Otherwise:
  2. If an atom and residue is in the CCD and the PDB file is consistent with that definition and the atom is charged, this change corrects solvent neutralisation.
  3. If a residue is in the CCD, and the PDB file re-uses identical residue and atom names for the entire residue, and that residue is neutral in the PDB but has a formal charge in the CCD, a regression is introduced. I expect this is very rare.
  4. Otherwise, behavior is unchanged.

I think 2. should be much more common than 3., so this is a step in the right direction. In particular, in the very common case that PDBFixer is used to protonate a residue (with addMissingHydrogens()), this change guarantees that the protonation state will be consistent with the assumed charge - either both will be taken from the CCD, or the assumed charge will be inferred from the protonation state by the force field.

I think something like an offsetCharge argument to addSolvent() and addMembrane() which would allow a user to correct any remaining errors in neutralising charge would be great, but this requires changes to Modeller and so is not included in this PR.

Copy link
Member

@peastman peastman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay in getting back to you. Your implementation looks good, and as you say, it's probably better than not doing it.

I'm still a little concerned about residues that can appear in multiple forms. Take a look at the CCD definition of HIS, for example. (I know it's a standard residue, and therefore this code won't be used for it. I'm just using it as an example of what we can probably expect to find in nonstandard residues too.) It can be positive, negative, or neutral depending on which hydrogens are present. The CCD entry describes the positively charged, doubly protonated form. But that isn't the most common form at pH 7.

On the positive side, if PDBFixer just added the hydrogens based on the CCD entry, that's the form it will use. So this does ensure that if PDBFixer adds both hydrogens and solvent, they'll be consistent with each other.

The other thing I'm a bit unsure about is the handling of leaving atoms. You require that no leaving atoms be present. If any of them are, it won't match. Would it be better to make it match whether or not the leaving atoms are present?

Comment on lines 158 to 163
smilesCol = descriptorsData.getAttributeIndex("descriptor")
smiles = None
for row in descriptorsData.getRowList():
if row[typeCol] in ["SMILES", "SMILES_CANONICAL"]:
smiles = row[smilesCol]
break
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You record the SMILES string but never use it anywhere. What is it for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I was using it in a previous iteration. I'll remove it.

@@ -1448,20 +1567,34 @@ def _createForceField(self, newTopology, water):
template = app.ForceField._TemplateData(resName)
forcefield._templates[resName] = template
indexInResidue = {}
# If we can't find formal charges in the CCD, make everything uncharged
formalCharges = defaultdict(lambda: 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or more simply defaultdict(int)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, fixed

Comment on lines 36 to 37
# Capping with GLY because ACE/NME currently breaks addMissingHydrogens
# Adding a cap keeps the protein compact and the addSolvent call quick
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what way does it break it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was running into an issue where if I unconditionally capped the ends of a chain, chains that didn't have missing terminal loops would wind up with unprotonated caps. I mentioned it briefly in #299. I ended up attributing this to a bug in my own code and fixed it by only capping terminals with missing loops, and I've transferred that code over to this PR just now, but I can dig into it further if you like.

Comment on lines 466 to 468
checkCache : bool
If ``False``, attempt to re-download the CCD entry regardless of
what is in the cache. Defaults to ``True``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for including this flag? You never specify it when calling this method, and it's a private internal method not meant to be called by users, so the value is never False.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another leftover from an earlier iteration and I've removed it.

@Yoshanuikabundi
Copy link
Member Author

Yoshanuikabundi commented Aug 8, 2024

Sorry for the delay in getting back to you. Your implementation looks good, and as you say, it's probably better than not doing it.

Yay!

I'm still a little concerned about residues that can appear in multiple forms. Take a look at the CCD definition of HIS, for example. (I know it's a standard residue, and therefore this code won't be used for it. I'm just using it as an example of what we can probably expect to find in nonstandard residues too.) It can be positive, negative, or neutral depending on which hydrogens are present. The CCD entry describes the positively charged, doubly protonated form. But that isn't the most common form at pH 7.

An example of this is the CCD definition of CIT, citric acid. At pH 7, citric acid is most commonly triply deprotonated, but the CCD entry is neutral. I agree it would be nice to handle this case more carefully, both here and in the protonation code.

I can imagine some simple algorithm along the lines of "add 1 formal charge to any atom that has an extra hydrogen bonded to it relative to the CCD, and remove 1 formal charge from any atom that is missing a hydrogen" which might improve the behavior in this PR for "nonstandard" protonation states. This would have pathological behavior on a PDB file that is missing hydrogens though - carbanions aplenty! And if _createForcefield(water=True) is used for any process that commonly happens before protonation this could be very bad. So we'd want some further checks, like only applying this algorithm to certain elements or skipping it if the entire residue, molecule, chain, or file is missing hydrogens. But it's difficult to see how to differentiate in general between PDB files where a proton is missing and PDB files representing chemicals that lack that proton. For instance, we can contrive a case like fluoroformic acid, FCOOH, in which the deprotonated form FCOO- is indistinguishable in a OpenMM topology from the protonated form without hydrogens. We might be able to use the PDB charge column for this?

In general, I don't think this should hold up this PR, because the PR will only introduce regressions if the current behavior was correct by luck. I think detecting alternative forms is a substantial and orthogonal design problem.

EDIT: And actually, even then there's no regression, as the set of atom names will differ so the formal charges won't be applied.

On the positive side, if PDBFixer just added the hydrogens based on the CCD entry, that's the form it will use. So this does ensure that if PDBFixer adds both hydrogens and solvent, they'll be consistent with each other.

This'll need to be factored in if the hydrogen adding system becomes pH aware for CCD sourced residues - the same mechanism will want to be applied here.

The other thing I'm a bit unsure about is the handling of leaving atoms. You require that no leaving atoms be present. If any of them are, it won't match. Would it be better to make it match whether or not the leaving atoms are present?

Yeah I think I misunderstood what the point of leaving atoms were - they're for defining polymer linkages, but I thought they were about alternate protonation states. I'll fix this!

@peastman
Copy link
Member

peastman commented Aug 8, 2024

The test case is failing:

>       assert soluteCharge == numCl - numNa
E       assert -21 == (45 - 67)

@peastman
Copy link
Member

peastman commented Aug 8, 2024

I think you're right: this won't always get the right answer, but it's more likely to get the right answer than the current code, so it's still an improvement.

I think it would be worth also allowing the user to pass a ForceField to addSolvent(), which would be used instead of creating one based on templates. It's hardly any extra code, and it would allow users with custom residues to make sure they're getting the right charge based on the actual force field they'll be using to simulate it. If you prefer though, we can do that in a separate PR.

@peastman
Copy link
Member

Just checking in on this so it doesn't get forgotten. The test case is still failing.

@Yoshanuikabundi
Copy link
Member Author

Sorry Peter, I've been distracted! The failing test was a timeout from code I didn't touch downloading PDB files from the PDB. Hopefully it'll succeed this time - I think this one's good to go.

pdbfixer/pdbfixer.py Outdated Show resolved Hide resolved
@peastman
Copy link
Member

Just checking in again. :) See the one comment above, where if it can't download one definition, it won't return any others either. I think that line needs to be changed?

@Yoshanuikabundi
Copy link
Member Author

Sorry I've been off sick - I agree, and have made that change. Thanks for being so patient!

@peastman
Copy link
Member

Thanks, that looks good. I'm sorry you've been sick!

@peastman peastman merged commit 2744212 into openmm:master Sep 27, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants