What to do about pdb files when residue index goes above 9999 #3501

BillSwope · 2022-03-04T01:23:42Z

I am not sure this is a bug fix or a feature I am asking for, but here goes.

Now that we are using pdb files to hold systems of protein+ligands+water+ions the number of water molecules is often greater than 9999. Common practice is to give each water molecule a unique residue id in the pdb file, but the field is only 4 characters wide in the format. I have seen pdb files produced by Maestro switch after residue id 9999 to a base 36 numbering system, using characters 0-9 and A-Z. The next index after 9999 is A000 then A001 ..... This may work within the suite of Schrodinger programs but openmm chokes on this. I noticed, however, that openmm does seem to take hexidecimal values in this field, so I can convert the Maestro pdb files to work with openmm. The problem is that when I WRITE out a pdb file from within openmm, it does not convert to hexidecimal. In fact, after residue id 9999, it stays in a decimal mode, but goes back to 0, and starts over again with residue id 0, 1, etc. This leaves multiple water molecules with the same residue id, and/or an ion and a water molecule with the same residue id. Consequently, the pdb file that was written out by openmm can't then be read back into openmm. There are probably a bunch of ways to handle this. You could increment the chain id when you start the sequence over again, or something like that. Is there a way to handle this so that the pdb file openmm writes out after reading one in is consistent with what it just read? Right now, it seems broken if you can't read in a pdb file that was written out by the same code.

peastman · 2022-03-04T17:19:58Z

This is one of the really annoying things about PDB format. Here's what the spec says about it.

A model cannot have more than 99,999 atoms. Where the entry does not contain an ensemble of models, then the entry cannot have more than 99,999 atoms. Entries that go beyond this atom limit must be split into multiple entries, each containing no more than the limits specified above.

But in practice, almost no one ever splits their models like that. Instead they try to cram larger models in using a variety of nonstandard methods. The approach used by VMD is to switch over to hex when it passes 99,999. OpenMM supports that method for reading but not writing. A more complicated one is the "hybrid-36" encoding. It uses decimal up to 99,999, then switches over to base 36 with uppercase letters, and then switches to base 36 with lowercase letters.

The best solution, of course, is that we should all be switching away from PDB format and using PDBx/mmCIF instead! It doesn't have these problems.

But I'd also be open to using the VMD style hex system when writing files. It's nonstandard, but so is what we do right now. And since we already support it when reading files, it's at least consistent.

BillSwope · 2022-03-04T18:45:25Z

I like the hex idea, Peter. Going to a different PDB format is not available right now to me, since I am working with output from Schrodinger's Maestro and they don't support the PDBx/mmCIF yet. If there were a robust conversion tool, I would use that. I agree that the new standard is the right solution long term.

One thing: my specific comment was not about the atom indices, but the residue id, which is more constrained, since it is in a 4-character field (max is 9999).

Defaulting to hex after hitting that limit (for either the atom index or residue index fields) would work well enough it that would allow us to write out PDB files that we can subsequently read back into openmm.

peastman · 2022-03-04T19:19:56Z

I'm reopening this and marking it as a feature request. Let's plan to support the VMD style hex numbering for both atom and residue ids.

swails · 2022-03-04T19:22:07Z

VMD's approach can be really frustrating, and I personally hate it. It kicks the can down the road, and really is only useful if you want to keep track of atom numbers (rather than just assigning something trivial like a simple series 1 to # atoms). If you blow past the hex limit (not unusual, especially for residue counts), I think you get useless ****s which are neither unique, numerical, or anything else.

It makes writing parsers needlessly complex (you can't use simple int() casting on an integer field and need to add yet more state when parsing to identify when the sequence has "switched" to hex). The approach I hate the least is to simply cycle the atom counts. It's easy to parse (int() works fine) and easy to print (idx % 10000).

I don't know of any program except VMD that spits letters in index numbers.

peastman · 2022-03-04T19:39:00Z

Switching to hex does have one clear advantage: it doesn't break CONECT records. Cycling means you get two atoms with the same ID, and then you don't know which one the bond is for. This situation comes up regularly, like in #3343 and #3393.

But I agree, these are all annoying hacks. Schrodinger really needs to get their act together. PDBx/mmCIF has been around for ten years now. It's crazy that they still don't support it.

jchodera · 2022-03-04T21:39:49Z

@peastman : What would you think of supporting popular "flavors" of PDB reading/writing to enable maximum (but optional) compatibility with the large ecosystem of files produced by Schrodinger, VMD, and Amber? We could have three flavors:

default : stick to the PDB spec, raising an exception if it is unsupported
base16 : read/write hexidecimal (base16) to support AMBER, VMD, etc.
hybrid36 : read/write base36 to support Schrodinger

This has the advantage of being strict by default but allowing folks to interoperate with this large ecosystem with a single optional argument.

peastman · 2022-03-04T22:23:23Z

The more things we try to support, the harder it gets. Right now we check for the presence of a letter, and if we find one, we parse that ID and all subsequent ones as base 16. But if we try to support both hex and hybrid-36, we don't know whether it's base 16 or base 36. And what if it's a lower case letter? For hex, upper case and lower case are equivalent, but for hybrid-36 they mean different things.

You can handle all these things with enough work. But the more complexity you add, the more bugs you're likely to have. Looking through the code, I realized that even our current implementation of reading hex doesn't quite match what VMD does. We simply interpret it as a hex number and use that as the ID. But VMD subtracts 555361 since all numbers up to 99999 are interpreted as decimal. Should I change that to conform to VMD? But perhaps there are other programs out there that simply use hex without the offset?

with a single optional argument

I don't think providing an argument is a viable approach. We have to assume that most people don't know which numbering system their files use.

swails · 2022-03-04T23:38:56Z

I tried going down this route with ParmEd. I ended up with a thousands-loc unmaintainable disaster that I ripped out because I couldn't fix a bug without breaking something.

jchodera · 2022-03-05T00:31:57Z

I tried going down this route with ParmEd. I ended up with a thousands-loc unmaintainable disaster that I ripped out because I couldn't fix a bug without breaking something.

@swails: Do you mean you went down the route of auto-detection/handling?

@peastman: My proposal is much simpler than what you describe: We implement a few simple behaviors for how we interpret and render indices:

None (default): If we see non-numeric entries, we raise an exception, providing a helpful message that the indices behave in a way not allowed by the spec, and that the user should figure out how these indices behave and provide the appropriate optional argument
maestro: Follow the Schrodinger convention
vmd: Follow the VMD convention
There are just a few conventions we need to support here to provide significant functionality.

Consider the alternative: We're asking people to create their own converters to transform into a PDBx/mm-CIF format. This is a nightmare.

peastman · 2022-03-05T00:55:10Z

You're assuming people know what numbering system their file uses. Very often they won't. You're also assuming there are three well defined standards and no other, but that also isn't a safe assumption. For example, it's not clear to me that Schrodinger actually uses hybrid-36. I can't find any information saying that, so it's possible it uses a different system (also using base 36). And then in the ChimeraX documentation I found an option for writing files with yet another numbering system which they refer to as "amber", in which it steals an extra column beyond what the spec says it should use.

Consider the alternative: We're asking people to create their own converters to transform into a PDBx/mm-CIF format. This is a nightmare.

Who proposed that?

swails · 2022-03-05T02:14:57Z

@swails: Do you mean you went down the route of auto-detection/handling?

Yes. Requiring parsing calls to specify a scheme requires that the person calling the parse routine knows what kind of numbering they have. Even if you assume most users will know what numbering their PDB files have (color me skeptical), anybody that writes a utility to try and present a simple interface will be forced to pass that complexity on to their users or deal with it themselves.

I found an option for writing files with yet another numbering system which they refer to as "amber", in which it steals an extra column beyond what the spec says it should use.

CHARMM does this, but only residue numbers have an unused column that can be coopted like this.

jchodera · 2022-03-05T04:23:00Z

I'm looking to the 80/20 rule here: How can we get 80% of the way there with 20% of the effort.

It's going to be much easier for someone to try a couple of the popular options we provide than to write their own converter to PDBx/mm-CIF. We can provide helpful error messages for where they should look for help if the default reader/writer runs into non-digits or runs out of digits.

I'm not sure what you're envisioning their alternatives are. Lobby Schrodinger to support PDBx/mm-CIF in an upcoming release?

We don't need to cover everything---just some widely used cases without blowing up complexity with auto-detection.

BillSwope · 2022-03-07T21:46:17Z

Just a couple of comments.

What Schrodinger does is this; after resid 9999 is reached, the next one is A000. Note that there is a gap in the numbering since A000 is actually 466560 (decimal). The advantage of this is that all subsequent resid fields have at least one character in the field, so it is easily detectable with isnumeric() returning false. You can do the same thing with hexidecimal if you go to 999A after 9999. This makes the decoding a bit easier. So you don't really have to know when the change happened so that you know the mode has switched. (But you do have to know whether you are in the 16 vs 36 mode.)

I agree with Peter, that restarting the resid count at 0 after hitting 9999 has caused me problems when, for example, an ion and a water molecule end up with the same resid.

Right now we have a hybrid approach in that we can READ the VMD-style pdb files, but we don't WRITE them. This means we can't read back and use a pdb file we may have just written out. So, read and write to the standard, or not, but don't do it for one and not the other.

I think a useful approach is John's idea. The default should be to crash with an obnoxious traceback whenever you try to write out a pdb with >9999 residues, And to crash on read if you encounter a nonnumeric in the resid field. Then maybe a nasty message about the user not following the standard, followed by useful information directing them to look at their pdb file, determine the scheme in use and to set the appropriate flag and try again. This educates people that they are in nonstandard (potentially unsupported in the future) territory, and it might result in some pressure going from paying users of Schrodinger to support the newer standard.

peastman · 2022-03-07T23:10:07Z

If code that used to work starts throwing exceptions, users are going to start screaming at me. At least, John certainly will. :)

I can see the argument that if we can't write a fully compliant PDB file, the default behavior should be to throw an exception. In fact, in most cases that's what I'd be arguing for. But in this case, it's a part of the spec that nearly everyone ignores. All programs routinely write PDB files with more than 9999 residues, and they never do it by splitting it into multiple files, which is what the spec says you're supposed to do. And a lot of people don't even realize this is a limitation of the spec, much less understand the different nonstandard behaviors used by different programs. If we force them to add an argument choosing which nonstandard behavior they want, that will mostly annoy and confuse them.

That's why I'd prefer to just make it use the VMD style automatically. Most programs can already handle it, since VMD is so widely used. It's less likely to produce wrong results than the (also nonstandard) behavior we currently have. And it's consistent with the way we already read files.

BillSwope closed this as completed Mar 4, 2022

peastman added the enhancement label Mar 4, 2022

peastman reopened this Mar 4, 2022

peastman mentioned this issue Mar 18, 2022

Use hex for large atom and residue IDs in PDB files #3522

Merged

peastman closed this as completed in #3522 Mar 22, 2022

jchodera mentioned this issue Mar 22, 2022

Address change in padding behavior for addSolvent #3502

Closed

sef43 mentioned this issue Feb 9, 2023

Print a warning and put a note in the documentation or FAQs about reading and writing PDB files with >=100,000 atoms #3956

Open

jeremyleung521 mentioned this issue Jan 16, 2024

ValueError: invalid literal for int() with base 16: mdtraj/mdtraj#1808

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What to do about pdb files when residue index goes above 9999 #3501

What to do about pdb files when residue index goes above 9999 #3501

BillSwope commented Mar 4, 2022 •

edited

Loading

peastman commented Mar 4, 2022

BillSwope commented Mar 4, 2022

peastman commented Mar 4, 2022

swails commented Mar 4, 2022

peastman commented Mar 4, 2022

jchodera commented Mar 4, 2022

peastman commented Mar 4, 2022

swails commented Mar 4, 2022

jchodera commented Mar 5, 2022

peastman commented Mar 5, 2022

swails commented Mar 5, 2022

jchodera commented Mar 5, 2022

BillSwope commented Mar 7, 2022

peastman commented Mar 7, 2022

What to do about pdb files when residue index goes above 9999 #3501

What to do about pdb files when residue index goes above 9999 #3501

Comments

BillSwope commented Mar 4, 2022 • edited Loading

peastman commented Mar 4, 2022

BillSwope commented Mar 4, 2022

peastman commented Mar 4, 2022

swails commented Mar 4, 2022

peastman commented Mar 4, 2022

jchodera commented Mar 4, 2022

peastman commented Mar 4, 2022

swails commented Mar 4, 2022

jchodera commented Mar 5, 2022

peastman commented Mar 5, 2022

swails commented Mar 5, 2022

jchodera commented Mar 5, 2022

BillSwope commented Mar 7, 2022

peastman commented Mar 7, 2022

BillSwope commented Mar 4, 2022 •

edited

Loading