# Structure checking tutorial

A complete checking analysis of a single structure follows.
use .revert_changes() at any time to recover the original structure

Structure checking is a key step before setting up a protein system for simulations. 
A number of normal issues found in structures at Protein Data Bank may compromise the success of the simulation, or may suggest that longer equilibration procedures are necessary.

The biobb_structure_checking modules allow to 
- Do basic manipulations on structures (selection of models, chains, alternative locations
- Detect and fix amide assignments, wrong chirality
- Detect and fix protein backbone issues (missing fragments, and atoms, capping)
- Detect and fix missing side-chain atoms
- Add hydrogen atoms according to several criteria
- Detect and classify clashes
- Detect possible SS bonds

biobb_structure_checking modules can used at the command line biobb_structure_checking/bin/check_structure


In [3]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Installation

#### Basic imports and initialization

In [4]:
import biobb_structure_checking
import biobb_structure_checking.constants as cts
from biobb_structure_checking.structure_checking import StructureChecking
base_dir_path=biobb_structure_checking.__path__[0]
args = cts.set_defaults(base_dir_path,{'notebook':True})


## General help

In [23]:
with open(args['commands_help_path']) as help_file:
    print(help_file.read())
#TODO: prepare a specific help method
# print_help(command)


BioBB's check_structure.py performs MDWeb structure checking set as a command line
utility.

commands:     Help on available commands
command_list: Run all tests from conf file or command line list
checkall:     Perform all checks without fixes
load:         Stores structure on local cache and provides basic statistics

1. System Configuration
sequences 
    Print canonical and structure sequences in FASTA format
models [--select model_num] [--superimpose] [--save_split]
    Detect/Select Models
    --superimpose Superimposes currently selected models
    --save_split Split models as separated output files. 
chains [--select chain_ids | molecule_type]
    Detect/Select Chains
inscodes 
    Detects residues with insertion codes. No fix provided (yet)
altloc [--select occupancy| alt_id | list of res_id:alt_id]
    Detect/Select Alternative Locations
metals [--remove All | None | Met_ids_list | Residue_list]
    Detect/Remove Metals
ligands [--remove All | None | Res_type_list | Residue_

Set input (PDB or local file, pdb or mmCif formats allowed) and output (local file, pdb format).  
Use pdb:pdbid for downloading structure from PDB (RCSB)

In [5]:
base_path = '/home/ilia/Escritorio/Bioinformatics/1_Second_year/BioPhysics/Seminars/Project/PDB/'
args['input_structure_path'] = base_path + '6m0j.pdb'
args['output_format'] = 'pdb'
args['output_structure_path'] = base_path + '6m0j_fixed.pdb'
args['output_structure_path_charges'] = base_path + '6m0j_fixed.pdbqt'
args['keep_canonical'] = False
args['debug'] = False
args['verbose'] = False

Initializing checking engine, loading structure and showing statistics

In [6]:

st_c = StructureChecking(base_dir_path, args)


Structure /home/ilia/Escritorio/Bioinformatics/1_Second_year/BioPhysics/Seminars/Project/PDB/6m0j.pdb loaded
 Title: crystal structure of sars-cov-2 spike receptor-binding domain bound with ace2
 Experimental method: x-ray diffraction
 Keywords: spike, receptor binding, viral protein-hydrolase complex
 Resolution (A): N.A.

 Num. models: 1
 Num. chains: 2 (A: Protein, E: Protein)
 Num. residues:  878
 Num. residues with ins. codes:  0
 Num. residues with H atoms: 0
 Num. HETATM residues:  87
 Num. ligands or modified residues:  7
 Num. water mol.:  80
 Num. atoms:  6558
Metal/Ion residues found
 ZN A901
 CL A902
Small mol ligands found
NAG A903
NAG A904
NAG A905
NAG A906
NAG E601



#### models
Checks for the presence of models in the structure. 
MD simulations require a single structure, although some structures (e.g. biounits) may be defined as a series of models, in such case all of them are usually required.  
Use models('--select N') to select model num N for further analysis

In [7]:
st_c.models()

Running models.
1 Model(s) detected
Single model found


#### chains
Checks for chains (also obtained from print_stats), and allow to select one or more.   
MD simulations are usually performed with complete structures. However input structure may contain several copies of the system, or contains additional chains like peptides or nucleic acids that may be removed. 
Use chains('X,Y') to select chain(s) X and Y to proceed

In [8]:
st_c.chains()


Running chains.
2 Chain(s) detected
 A: Protein
 E: Protein


#### altloc
Checks for the presence of residues with alternative locations. Atoms with alternative coordinates and their occupancy are reported.  
MD simulations requires a single position for each atom.  
Use altloc('occupancy | alt_ids | list of res:id) to select the alternative


In [9]:
st_c.altloc()

Running altloc.
Detected 2 residues with alternative location labels
HIS A228
  CA   A (0.50) B (0.50)
  CB   A (0.50) B (0.50)
  CG   A (0.50) B (0.50)
  ND1  A (0.50) B (0.50)
  CD2  A (0.50) B (0.50)
  CE1  A (0.50) B (0.50)
  NE2  A (0.50) B (0.50)
GLN E493
  CA   A (0.50) B (0.50)
  CB   A (0.50) B (0.50)
  CG   A (0.50) B (0.50)
  CD   A (0.50) B (0.50)
  OE1  A (0.50) B (0.50)
  NE2  A (0.50) B (0.50)


We need to choose one of the alternative forms for each residue

In [10]:
st_c.altloc('occupancy')

Running altloc. Options: occupancy
Detected 2 residues with alternative location labels
HIS A228
  CA   A (0.50) B (0.50)
  CB   A (0.50) B (0.50)
  CG   A (0.50) B (0.50)
  ND1  A (0.50) B (0.50)
  CD2  A (0.50) B (0.50)
  CE1  A (0.50) B (0.50)
  NE2  A (0.50) B (0.50)
GLN E493
  CA   A (0.50) B (0.50)
  CB   A (0.50) B (0.50)
  CG   A (0.50) B (0.50)
  CD   A (0.50) B (0.50)
  OE1  A (0.50) B (0.50)
  NE2  A (0.50) B (0.50)
Selecting location occupancy


In [9]:
st_c.altloc()

Running altloc.
No residues with alternative location labels detected


#### metals
Detects HETATM being metal ions allow to selectively remove them.  
To remove use metals (' All | None | metal_type list | residue list ')

In [11]:
st_c.metals()

Running metals.
1 Metal ions found
  ZN A901.ZN 


#### ligands
Detects HETATM (excluding Water molecules) to selectively remove them.  
To remove use ligands('All | None | Residue List (by id, by num)')


In [12]:
st_c.ligands()

Running ligands.
7 Ligands detected
  ZN A901
  CL A902
 NAG A903
 NAG A904
 NAG A905
 NAG A906
 NAG E601


In [13]:
st_c.ligands('All')

Running ligands. Options: All
7 Ligands detected
  ZN A901
  CL A902
 NAG A903
 NAG A904
 NAG A905
 NAG A906
 NAG E601
Ligands removed All (7)


In [14]:
st_c.ligands()

Running ligands.
No ligands found


#### rem_hydrogen
Detects and remove hydrogen atoms. 
MD setup can be done with the original H atoms, however to prevent from non standard labelling, remove them is safer.  
To remove use rem_hydrogen('yes')


In [15]:
st_c.rem_hydrogen()

Running rem_hydrogen.
No residues with Hydrogen atoms found


#### water
Detects water molecules and allows to remove them
Crystallographic water molecules may be relevant for keeping the structure, however in most cases only some of them are required. These can be later added using other methods (titration) or manually.

To remove water molecules use water('yes')


In [16]:
st_c.water()

Running water.
80 Water molecules detected


In [17]:
st_c.water("yes")

Running water. Options: yes
80 Water molecules detected
80 Water molecules removed


#### amide
Amide terminal atoms in Asn ang Gln residues can be labelled incorrectly.  
amide suggests possible fixes by checking the sourrounding environent.

To fix use amide ('All | None | residue_list')

Note that the inversion of amide atoms may trigger additional contacts. 

In [18]:
st_c.amide()

Running amide.
7 unusual contact(s) involving amide atoms found
 LYS A31.NZ   GLN E493.NE2    2.926 A
 GLN A42.NE2  GLN E498.NE2    2.927 A
 ASN A103.OD1 ASN A194.OD1    2.807 A
 ASN A134.OD1 GLU A140.OE2    2.785 A
 ASN A134.ND2 ASN A137.N      3.082 A
 GLU A150.O   ASN A154.OD1    2.895 A
 ARG E357.NH1 ASN E394.ND2    2.963 A


Fix all amide residues and recheck

In [19]:
st_c.amide('all')

Running amide. Options: all
7 unusual contact(s) involving amide atoms found
 LYS A31.NZ   GLN E493.NE2    2.926 A
 GLN A42.NE2  GLN E498.NE2    2.927 A
 ASN A103.OD1 ASN A194.OD1    2.807 A
 ASN A134.OD1 GLU A140.OE2    2.785 A
 ASN A134.ND2 ASN A137.N      3.082 A
 GLU A150.O   ASN A154.OD1    2.895 A
 ARG E357.NH1 ASN E394.ND2    2.963 A
Amide residues fixed all (8)
Rechecking
4 unusual contact(s) involving amide atoms found
 GLN A42.OE1  GLN E498.OE1    2.927 A
 ASN A103.ND2 ASN A194.ND2    2.807 A
 ARG E357.NH1 ASN E394.ND2    3.022 A
 ASN E394.OD1 GLU E516.OE2    2.850 A


Comparing both checks it becomes clear that GLN A42, GLN E498, ASN A103, and ASN A194 are getting new contacts as thay have both changed, ASN E394 is worse as it has now two contacts

In [20]:
st_c.amide('A42,A103')

Running amide. Options: A42,A103
4 unusual contact(s) involving amide atoms found
 GLN A42.OE1  GLN E498.OE1    2.927 A
 ASN A103.ND2 ASN A194.ND2    2.807 A
 ARG E357.NH1 ASN E394.ND2    3.022 A
 ASN E394.OD1 GLU E516.OE2    2.850 A
Amide residues fixed A42,A103 (2)
Rechecking
2 unusual contact(s) involving amide atoms found
 ARG E357.NH1 ASN E394.ND2    3.022 A
 ASN E394.OD1 GLU E516.OE2    2.850 A


In [21]:
st_c.amide('E394')

Running amide. Options: E394
2 unusual contact(s) involving amide atoms found
 ARG E357.NH1 ASN E394.ND2    3.022 A
 ASN E394.OD1 GLU E516.OE2    2.850 A
Amide residues fixed E394 (1)
Rechecking
1 unusual contact(s) involving amide atoms found
 ARG E357.NH1 ASN E394.ND2    2.963 A


#### chiral
Side chains of Thr and Ile are chiral, incorrect atom labelling lead to the wrong chirality.  
To fix use chiral('All | None | residue_list')

In [22]:
st_c.chiral()

Running chiral.
No residues with incorrect side-chain chirality found


#### Backbone
Detects and fixes several problems with the backbone
use any of 
--fix_atoms All|None|Residue List 
--fix_chain All|None|Break list
--add_caps All|None|Terms|Breaks|Residue list
--no_recheck
--no_check_clashes


In [23]:
st_c.backbone()

Running backbone.
2 Residues with missing backbone atoms found
 ASP A615   OXT
 GLY E526   OXT
No backbone breaks
No unexpected backbone links


In [24]:
st_c.backbone('--fix_atoms All --fix_chain none --add_caps none')

Running backbone. Options: --fix_atoms All --fix_chain none --add_caps none
2 Residues with missing backbone atoms found
 ASP A615   OXT
 GLY E526   OXT
No backbone breaks
No unexpected backbone links
Capping terminal ends
True terminal residues:  A19,A615,E333,E526
No caps added
Fixing missing backbone atoms
Adding missing backbone atoms
ASP A615
  Adding new atom OXT
GLY E526
  Adding new atom OXT
Fixed 2 backbone atom(s)
Checking for steric clashes
No severe clashes detected
No apolar clashes detected
No polar_acceptor clashes detected
No polar_donor clashes detected
No positive clashes detected
No negative clashes detected


#### fixside
Detects and re-built missing protein side chains.   
To fix use fixside('All | None | residue_list')

In [24]:
st_c.fixside()

Running fixside.
No residues with missing or unknown side chain atoms found


#### getss
Detects possible -S-S- bonds based on distance criteria.
Proper simulation requires those bonds to be correctly set. Use All|None|residueList to mark them

In [25]:
st_c.getss()

Running getss.
7 Possible SS Bonds detected
 CYS A133.SG  CYS A141.SG     4.237
 CYS A344.SG  CYS A361.SG     4.159
 CYS A530.SG  CYS A542.SG     4.095
 CYS E336.SG  CYS E361.SG     4.152
 CYS E379.SG  CYS E432.SG     4.177
 CYS E391.SG  CYS E525.SG     4.191
 CYS E480.SG  CYS E488.SG     4.269


In [26]:
st_c.getss('all')

Running getss. Options: all
7 Possible SS Bonds detected
 CYS A133.SG  CYS A141.SG     4.237
 CYS A344.SG  CYS A361.SG     4.159
 CYS A530.SG  CYS A542.SG     4.095
 CYS E336.SG  CYS E361.SG     4.152
 CYS E379.SG  CYS E432.SG     4.177
 CYS E391.SG  CYS E525.SG     4.191
 CYS E480.SG  CYS E488.SG     4.269


#### Add_hydrogens
 Add Hydrogen Atoms. Auto: std changes at pH 7.0. His->Hie. pH: set pH value
    list: Explicit list as [*:]HisXXHid, Interactive[_his]: Prompts for all selectable residues
    Fixes missing side chain atoms unless --no_fix_side is set
    Existing hydrogen atoms are removed before adding new ones unless --keep_h set.

In [28]:
st_c.add_hydrogen()

Running add_hydrogen.
226 Residues requiring selection on adding H atoms
 CYS A261,A498
 ASP A30,A38,A67,A111,A136,A157,A198,A201,A206,A213,A216,A225,A269,A292,A295,A299,A303,A335,A350,A355,A367,A368,A382,A427,A431,A471,A494,A499,A509,A543,A597,A609,A615,E364,E389,E398,E405,E420,E427,E428,E442,E467
 GLU A22,A23,A35,A37,A56,A57,A75,A87,A110,A140,A145,A150,A160,A166,A171,A181,A182,A189,A197,A208,A224,A227,A231,A232,A238,A310,A312,A329,A375,A398,A402,A406,A430,A433,A435,A457,A467,A479,A483,A489,A495,A527,A536,A549,A564,A571,A589,E340,E406,E465,E471,E484,E516
 HIS A34,A195,A228,A239,A241,A265,A345,A373,A374,A378,A401,A417,A493,A505,A535,A540,E519
 LYS A26,A31,A68,A74,A94,A112,A114,A131,A174,A187,A234,A247,A288,A309,A313,A341,A353,A363,A416,A419,A441,A458,A465,A470,A475,A476,A481,A534,A541,A553,A562,A577,A596,A600,E356,E378,E386,E417,E424,E444,E458,E462
 ARG A115,A161,A169,A177,A192,A204,A219,A245,A273,A306,A357,A393,A460,A482,A514,A518,A559,A582,E346,E355,E357,E403,E408,E454,E457,E466,E509

In [29]:
st_c.add_hydrogen('auto')

Running add_hydrogen. Options: auto
226 Residues requiring selection on adding H atoms
 CYS A261,A498
 ASP A30,A38,A67,A111,A136,A157,A198,A201,A206,A213,A216,A225,A269,A292,A295,A299,A303,A335,A350,A355,A367,A368,A382,A427,A431,A471,A494,A499,A509,A543,A597,A609,A615,E364,E389,E398,E405,E420,E427,E428,E442,E467
 GLU A22,A23,A35,A37,A56,A57,A75,A87,A110,A140,A145,A150,A160,A166,A171,A181,A182,A189,A197,A208,A224,A227,A231,A232,A238,A310,A312,A329,A375,A398,A402,A406,A430,A433,A435,A457,A467,A479,A483,A489,A495,A527,A536,A549,A564,A571,A589,E340,E406,E465,E471,E484,E516
 HIS A34,A195,A228,A239,A241,A265,A345,A373,A374,A378,A401,A417,A493,A505,A535,A540,E519
 LYS A26,A31,A68,A74,A94,A112,A114,A131,A174,A187,A234,A247,A288,A309,A313,A341,A353,A363,A416,A419,A441,A458,A465,A470,A475,A476,A481,A534,A541,A553,A562,A577,A596,A600,E356,E378,E386,E417,E424,E444,E458,E462
 ARG A115,A161,A169,A177,A192,A204,A219,A245,A273,A306,A357,A393,A460,A482,A514,A518,A559,A582,E346,E355,E357,E403,E408,E454,

#### clashes
Detects steric clashes based on distance criteria.  
Contacts are classified in: 
* Severe: Too close atoms, usually indicating superimposed structures or badly modelled regions. Should be fixed.
* Apolar: Vdw colissions.Usually fixed during the simulation.
* Polar and ionic. Usually indicate wrong side chain conformations. Usually fixed during the simulation


In [30]:
st_c.clashes()

Running clashes.
No severe clashes detected
9 Steric apolar clashes detected
 HIE A34.CD2  TYR E453.OH     2.860 A
 ASN A53.CG   NAG A906.C1     2.486 A
 ASN A90.CG   NAG A904.C1     2.431 A
 ASN A121.O   THR A125.CG2    2.890 A
 ASN A322.CG  NAG A905.C1     2.433 A
 LEU A333.C   MET A360.O      2.881 A
 ASN A546.CG  NAG A903.C1     2.455 A
 ASN E343.CG  NAG E601.C1     2.478 A
 TYR E380.O   THR E430.C      2.758 A
6 Steric polar_acceptor clashes detected
 MET A152.O   GLY A268.O      3.063 A
 LEU A333.O   MET A360.O      2.881 A
 TYR E351.O   ASP E467.O      3.074 A
 TYR E380.O   THR E430.O      2.728 A
 ASN E394.OD1 GLU E516.OE2    2.850 A
 GLY E485.O   CYX E488.O      3.046 A
1 Steric polar_donor clashes detected
 ARG E357.NH1 ASN E394.ND2    3.022 A
No positive clashes detected
No negative clashes detected


Complete check in a single method

In [31]:
st_c.checkall()

Running models.
1 Model(s) detected
Single model found
Running chains.
2 Chain(s) detected
 A: Protein
 E: Protein
Running inscodes.
No residues with insertion codes found
Running altloc.
No residues with alternative location labels detected
Running rem_hydrogen.
791 Residues containing H atoms detected
Running add_hydrogen.
209 Residues requiring selection on adding H atoms
 CYS A261,A498
 ASP A30,A38,A67,A111,A136,A157,A198,A201,A206,A213,A216,A225,A269,A292,A295,A299,A303,A335,A350,A355,A367,A368,A382,A427,A431,A471,A494,A499,A509,A543,A597,A609,A615,E364,E389,E398,E405,E420,E427,E428,E442,E467
 GLU A22,A23,A35,A37,A56,A57,A75,A87,A110,A140,A145,A150,A160,A166,A171,A181,A182,A189,A197,A208,A224,A227,A231,A232,A238,A310,A312,A329,A375,A398,A402,A406,A430,A433,A435,A457,A467,A479,A483,A489,A495,A527,A536,A549,A564,A571,A589,E340,E406,E465,E471,E484,E516
 LYS A26,A31,A68,A74,A94,A112,A114,A131,A174,A187,A234,A247,A288,A309,A313,A341,A353,A363,A416,A419,A441,A458,A465,A470,A475,A476,A48

In [32]:
st_c._save_structure(args['output_structure_path'])

'/home/ilia/Escritorio/Bioinformatics/1_Second_year/BioPhysics/Seminars/Project/Scripts/6m0j_fixed.pdb'

In [86]:
st_c.rem_hydrogen('yes')

Running rem_hydrogen. Options: yes
791 Residues containing H atoms detected
Hydrogen atoms removed from 791 residues


In [72]:
#st_c.add_hydrogen('--add_charges --add_mode auto')
#Alternative way calling through command line
import os
os.system('check_structure -i ' + args['output_structure_path'] + ' -o ' + args['output_structure_path_charges'] + ' add_hydrogen --add_charges --add_mode auto')

=                   BioBB structure checking utility v3.10.1                   =
=            P. Andrio, A. Hospital, G. Bayarri, J.L. Gelpi 2018-22            =

Structure /home/ilia/Escritorio/Bioinformatics/1_Second_year/BioPhysics/Seminars/Project/6m0j_fixed.pdb loaded
 Title: 
 Experimental method: unknown
 Resolution (A): N.A.

 Num. models: 1
 Num. chains: 2 (A: Protein, E: Protein)
 Num. residues:  871
 Num. residues with ins. codes:  0
 Num. residues with H atoms: 791 (total 6102 H atoms)
 Num. HETATM residues:  80
 Num. ligands or modified residues:  0
 Num. water mol.:  80
 Num. atoms:  12590

Running add_hydrogen. Options: --add_charges --add_mode auto
240 Residues requiring selection on adding H atoms
 CYS A133,A141,A261,A344,A361,A498,A530,A542,E336,E361,E379,E391,E432,E480,E488,E525
 ASP A30,A38,A67,A111,A136,A157,A198,A201,A206,A213,A216,A225,A269,A292,A295,A299,A303,A335,A350,A355,A367,A368,A382,A427,A431,A471,A494,A499,A509,A543,A597,A609,A615,E364,E389,E398,E405,E420

usage: add_hydrogen [-h] [--add_mode ADD_MODE] [--pH PH] [--list LIST]
                    [--no_fix_side] [--keep_h] [--add_charges ADD_CHARGES]
add_hydrogen: error: argument --add_charges: expected one argument


512

In [35]:
#st_c._save_structure(args['output_structure_path_charges'])

'/home/gelpi/DEVEL/BioPhysics/wdir/6m0j_fixed.pdbqt'

In [36]:
#st_c.revert_changes()