Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Residue number strange behavior #201

Open
MauriceKarrenbrock opened this issue Apr 16, 2020 · 2 comments
Open

Residue number strange behavior #201

MauriceKarrenbrock opened this issue Apr 16, 2020 · 2 comments

Comments

@MauriceKarrenbrock
Copy link

Hello,

I found a strange behavior when repairing the 6w02 from the wwPDB (this happens both for the PDB and mmCIF file, the example below is for the PDB file):

When i repair the 6w02 protein the first lines of the repaired PDB look like this:

ATOM      1  N   SER A9998     -22.468 -28.218   2.850  1.00  0.00           N
...

ATOM      7  N   ASN A9999     -20.179 -26.195   2.240  1.00  0.00           N
...

ATOM     15  N   ALA A   0     -17.786 -23.244   1.425  1.00  0.00           N
...

ATOM     20  N   GLY A   1     -14.267 -21.382   1.160  1.00  0.00           N

As you can see the first residue starts from number 9998 an then this brings in having a residue 0 (zero), and this happens for both chain A and B. And as the original PDB starts from residue 4 it doesn't make much sense.

And having this 2 residue 0 (chain A and chain B) gives big problems when dealing with the protein structures with tools like Biopython

import pdbfixer
import simtk.openmm.app

input_file_name = pdb6w02.pdb
output_file_name = output6w02.pdb

with open(input_file_name, 'r') as f:
    fixer = pdbfixer.PDBFixer(pdbfile = f)

    fixer.findMissingResidues()

    fixer.findNonstandardResidues()

    fixer.replaceNonstandardResidues()

    fixer.findMissingAtoms()

    fixer.addMissingAtoms()

with open(output_file_name, 'w') as f:
     simtk.openmm.app.PDBFile.writeFile(fixer.topology, fixer.positions, f, keepIds = True)

both the input and output files are attached as .txt files

Thank you very much and have a nice day

output6w02.txt
pdb6w02.txt

@peastman
Copy link
Member

This problem is intrinsic to the PDB format. It only gives four columns for the residue ID, which means a strictly compliant PDB file can never have more than 10,000 residues. It also only gives five columns for the atom ID, so you're limited to 100,000 atoms, and one column for the chain ID, which limits you to 26 chains (since chain IDs are supposed to be upper case letters).

Of course, people frequently try to write larger systems to PDB files, so a variety of non-compliant hacks get used to deal with that. Wrapping the IDs back to 0 is one of the more common ones.

The real solution, though, is to write to a PDBx/mmCIF file instead. It's the successor to the PDB format, and it fixes these problems and many others. Just change PDBFile to PDBxFile.

@MauriceKarrenbrock
Copy link
Author

I see, but as said, even if I didn't put it in the example, the exact same thing happens when using the mmCIF/PDBx file. This means that the problem is format independent.
And in any case it would still make no sense as the 6w02 protein does only have few hundred residues, and the residues labeled as 9998 and 9999 are the first and the second ones and not the last ones.

This problem happened only with this specific protein, so I guess that it might be a "patological" situation but it could reveal some kind of sneaky bug.
And as the 6w02 is a protein of the SARS Cov-2 virus many other researchers could benefit from understanding why pdbfixer is behaving like this.

Here is the .cif file with the exact same problem:
6w02_test_output.txt

Thank you very much for your time and have a nice day.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants