# Package for PDB Processing. Module I &mdash; regular expressions a parsing

## Data formats in PDB files

<style>
p.big {
    line-height: 3;
    align: left;
}
</style>

DATA TYPE            | <p class="big">   DESCRIPTION</p>
---------------------|-------------------------------------------------------------
AChar                | <p class="big">   An alphabetic character (A-Z, a-z).</p>
Atom                 | <p class="big">   Atom name.</p>
Character            | <p class="big">   Any non-control character in the  ASCII character set or a space.</p>
Continuation         | <p class="big">   A two-character field that is either blank (for the first record of a set) or contains a two  digit number right-justified and blank-filled which counts continuation records starting with 2. The continuation number must be followed by a blank.</p>
Date                 | <p class="big">   A 9 character string in the form DD-MMM-YY  where DD is the day of the month, zero-filled on the left  (e.g., 04); MMM is the common English 3-letter  abbreviation of the month; and YY is the last two digits of the year.  This must represent a valid date.</p>
IDcode               | <p class="big">   A PDB identification code which  consists of 4 characters, the first of which is a digit  in the range 0 - 9; the remaining 3 are  alpha-numeric, and letters are upper case only. Entries with a 0 as  the first character do not contain coordinate data.</p>
Integer              | <p class="big">   Right-justified blank-filled integer  value.</p>
Token                | <p class="big">   A sequence of non-space characters  followed by a colon and a space.</p>
List                 | <p class="big">   A String that is composed of text  separated with commas.</p>
LString              | <p class="big">   A literal string of characters. All spacing  is significant and must be preserved.</p>
LString(n)           | <p class="big">   An LString with exactly n characters.</p>
Real(n,m)            | <p class="big">   Real (floating point) number in the  FORTRAN format Fn.m.</p>
Record name          | <p class="big">   The name of the record: 6 characters,  left-justified and blank-filled.</p>
Residue name         | <p class="big">   One of the standard amino acid or nucleic acids, as listed below, or the non-standard group  designation as defined in the HET dictionary. Field is  right-justified.</p>
SList                | <p class="big">   A String that is composed of text  separated with semi-colons.</p>
Specification        | <p class="big">   A String composed of a token and its  associated value separated by a colon.</p>
Specification List   | <p class="big">   A sequence of Specifications, separated by semi-colons.</p>
String               | <p class="big">   A sequence of characters. These  characters may have arbitrary spacing, but should be  interpreted as directed below.</p>
String(n)            | <p class="big">   A String with exactly n characters.</p>
SymOP                | <p class="big">   An integer field of from 4 to 6  digits, right-justified, of the form nnnMMM where nnn is the symmetry  operator number and MMM is the translation vector.</p>



Making mandatory imports

In [1]:
import re
import requests

Function to fetch PDB content from RCSB database

In [2]:
def get_pdb(id=None, path=None):
    """
    Input: path to .pdb file or PDB ID.
    
    Output: string with content of PDB file.
    """
    if id is not None:
        r = requests.get('https://files.rcsb.org/view/{}.pdb'.format(id))
        if r.status_code == 200:
            return r.content.decode()
        
    elif path is not None:
        try:
            with open(path, 'r') as pdb_file:
                return pdb_file.read()
        except FileNotFoundError:
            raise ValueError("Invalid path to PDB file.")
    raise ValueError("PDB ID not found or invalid.")

## DBREF entry processing

DBREF field contains information about sequences in PDB file

COLUMNS   |<p align="left">  DATA TYPE  </p>|<p align="left"> FIELD        </p>|<p align="left">    DEFINITION</p>
----------|---------------|----------------|-----------------------------------------
1 -  6    |<p align="left"> Record name </p>|<p align="left"><font color="red">DBREF<font></p>|
8 - 11    |<p align="left"> IDcode      </p>|<p align="left">idCode        </p>|<p align="left">    ID code of this entry.</p>
13        |<p align="left">  Character  </p>|<p align="left"> chainID      </p>|<p align="left">    Chain  identifier.</p>
15 - 18   |<p align="left">  Integer    </p>|<p align="left"> seqBegin     </p>|<p align="left">    Initial sequence number of the PDB sequence segment.</p>
19        |<p align="left">  AChar      </p>|<p align="left"> insertBegin  </p>|<p align="left">    Initial  insertion code of the PDB  sequence segment.</p>
21 - 24   |<p align="left">  Integer    </p>|<p align="left"> seqEnd       </p>|<p align="left">    Ending sequence number of the PDB  sequence segment.</p>
25        |<p align="left">  AChar      </p>|<p align="left"> insertEnd    </p>|<p align="left">    Ending insertion code of the PDB  sequence segment.</p>
27 - 32   |<p align="left">  LString    </p>|<p align="left"> database     </p>|<p align="left">    Sequence database name.</p>
34 - 41   |<p align="left">  LString    </p>|<p align="left"> dbAccession  </p>|<p align="left">    Sequence database accession code.</p>
43 - 54   |<p align="left">  LString    </p>|<p align="left"> dbIdCode     </p>|<p align="left">    Sequence  database identification code.</p>
56 - 60   |<p align="left">  Integer    </p>|<p align="left"> dbseqBegin   </p>|<p align="left">    Initial sequence number of the database seqment.</p>
61        |<p align="left">  AChar      </p>|<p align="left"> idbnsBeg     </p>|<p align="left">    Insertion code of initial residue of the segment, if PDB is the reference.</p>
63 - 67   |<p align="left">  Integer    </p>|<p align="left"> dbseqEnd     </p>|<p align="left">    Ending sequence number of the database segment.</p>
68        |<p align="left">  AChar      </p>|<p align="left"> dbinsEnd     </p>|<p align="left">    Insertion code of the ending residue of the segment, if PDB is the reference.</p>

In [3]:
dbref = re.compile(r"""(?P<record_name>dbref)\s+
                   (?P<pdb_id>\d\w{3})\s+
                   (?P<chain>[a-z])\s+
                   (?P<start>-?\d{1,4})
                   (?P<s_icode>[a-z])?\s+
                   (?P<end>-?\d{1,4})
                   (?P<e_icode>[a-z])?\s+
                   (?P<database>\b.{0,6}\b)\s+
                   (?P<db_accession>\b.{0,8}\b)\s+
                   (?P<db_id>\b.{0,12}\b)\s+
                   (?P<db_start>-?\d{1,4})
                   (?P<db_s_icode>[a-z])?\s+
                   (?P<db_end>-?\d{1,4})
                   (?P<db_e_icode>[a-z])?\s+$
                   """, re.I|re.X|re.M)
dbref_columns = re.findall(r"(?<=\(\?P<)\w+(?=>)", dbref.pattern, re.S)

In [4]:
%%time
pdb = get_pdb('4oo8')

CPU times: user 48.4 ms, sys: 16.1 ms, total: 64.5 ms
Wall time: 11.5 s


In [43]:
dbref.findall(pdb)[0]

('DBREF',
 '4OO8',
 'A',
 '1',
 '',
 '1368',
 '',
 'UNP',
 'Q99ZW2',
 'CAS9_STRP1',
 '1',
 '',
 '1368',
 '')

In [179]:
pdbx = '\n'.join(re.findall(r"^HELIX.+$", pdb, re.M))
#print(pdbx)
pdbz = """
HELIX    1   1 ALA A   59  ASP A   94  1          I am :the comment.      36    
HELIX    2   2 SER A   96  GLU A  102  1                                   7    
HELIX    3   3 ASN A  121  TYR A  132  1           Me too!                12    
HELIX    4   4 THR A  134  SER A  145  1                                  12    
HELIX    5   5 ASP A  150  PHE A  164  1                                  15    
HELIX    6   6 ASP A  180  PHE A  196  1                                  17    
HELIX    7   7 ASP A  207  SER A  213  1                                   7    
HELIX    8   8 SER A  217  GLN A  228  1                                  12    
HELIX    9   9 GLY A  236  GLY A  247  1                                  12    
HELIX   10  10 THR A  270  GLY A  283  1                                  14    
HELIX   11  11 TYR A  286  LEU A  306  1                                  21    
HELIX   12  12 ALA A  315  LEU A  343  1                                  29    
HELIX   13  13 LYS A  346  PHE A  352  1                                   7    
HELIX   14  14 GLY A  358  ASP A  364  1                                   7    
HELIX   15  15 SER A  368  MET A  383  1                                  16    
HELIX   16  16 THR A  386  ARG A  395  1                                  10         
"""


In [18]:
%%time
import pandas as pd
d = pd.DataFrame(dbref.findall(pdb), columns=dbref_columns)

CPU times: user 71.9 ms, sys: 5.91 ms, total: 77.8 ms
Wall time: 77.4 ms


In [19]:
d

Unnamed: 0,record_name,pdb_id,chain,start,s_icode,end,e_icode,database,db_accession,db_id,db_start,db_s_icode,db_end,db_e_icode
0,DBREF,4OO8,A,1,,1368,,UNP,Q99ZW2,CAS9_STRP1,1,,1368,
1,DBREF,4OO8,D,1,,1368,,UNP,Q99ZW2,CAS9_STRP1,1,,1368,
2,DBREF,4OO8,B,1,,98,,PDB,4OO8,4OO8,1,,98,
3,DBREF,4OO8,E,1,,98,,PDB,4OO8,4OO8,1,,98,
4,DBREF,4OO8,C,-2,,20,,PDB,4OO8,4OO8,-2,,20,
5,DBREF,4OO8,F,-2,,20,,PDB,4OO8,4OO8,-2,,20,


---

## HELIX entry processing

HELIX records contain an information about helices of all types (usually alpha).

COLUMNS     | <p align="left">   DATA  TYPE   </p> | <p align="left"> FIELD      </p> | <p align="left"> DEFINITION</p>
---|---|---|---
 1 -  6     | <p align="left">   Record name  </p> | <p align="left"> HELIX   </p>    |
 8 - 10     | <p align="left">   Integer      </p> | <p align="left"> serNum     </p> | <p align="left"> Serial number of the helix. This starts at 1  and increases incrementally.</p>
12 - 14     | <p align="left">   LString(3)   </p> | <p align="left"> helixID    </p> | <p align="left"> Helix  identifier. In addition to a serial number, each helix is given an  alphanumeric character helix identifier.</p>
16 - 18     | <p align="left">   Residue name </p> | <p align="left"> initResName</p> | <p align="left"> Name of the initial residue.</p>
20          | <p align="left">   Character    </p> | <p align="left"> initChainID</p> | <p align="left"> Chain identifier for the chain containing this  helix.</p>
22 - 25     | <p align="left">   Integer      </p> | <p align="left"> initSeqNum </p> | <p align="left"> Sequence number of the initial residue.</p>
26          | <p align="left">   AChar        </p> | <p align="left"> initICode  </p> | <p align="left"> Insertion code of the initial residue.</p>
28 - 30     | <p align="left">   Residue  name</p> | <p align="left"> endResName </p> | <p align="left"> Name of the terminal residue of the helix.</p>
32          | <p align="left">   Character    </p> | <p align="left"> endChainID </p> | <p align="left"> Chain identifier for the chain containing this  helix.</p>
34 - 37     | <p align="left">   Integer      </p> | <p align="left"> endSeqNum  </p> | <p align="left"> Sequence number of the terminal residue.</p>
38          | <p align="left">   AChar        </p> | <p align="left"> endICode   </p> | <p align="left"> Insertion code of the terminal residue.</p>
39 - 40     | <p align="left">   Integer      </p> | <p align="left"> helixClass </p> | <p align="left"> Helix class (see below).</p>
41 - 70     | <p align="left">   String       </p> | <p align="left"> comment    </p> | <p align="left"> Comment about this helix.</p>
72 - 76     | <p align="left">   Integer      </p> | <p align="left"> length     </p> | <p align="left"> Length of this helix.</p>

In [180]:
helix = re.compile(r"""^
            (?P<record_name>helix)\s+
            (?P<ser_num>\d+)\s+
            (?P<id>\w{0,2}\w)\s+
            (?P<start_res>\w{0,2}\w)\s+
            (?P<chain>\w)\s+
            (?P<start>-?\d+)
            (?P<s_icode>[a-z])?\s+
            (?P<end_res>\w{0,2}\w)\s+
            (?P=chain)\s+
            (?P<end>-?\d+)
            (?P<e_icode>[a-z])?\s*
            (?P<helix_class>\d{1,2})\s*
            (?P<comment>\S.{0,28}\S)?\s+
            (?P<length>\d+)
            \s+$""", re.I|re.X|re.M)
helix_columns = re.findall(r"(?<=\(\?P<)\w+(?=>)", helix.pattern, re.S)

In [181]:
df = pd.DataFrame(helix.findall(pdbz), columns=helix_columns, dtype=int)

In [183]:
df.head()

Unnamed: 0,record_name,ser_num,id,start_res,chain,start,s_icode,end_res,end,e_icode,helix_class,comment,length
0,HELIX,1,1,ALA,A,59,,ASP,94,,1,I am :the comment.,36
1,HELIX,2,2,SER,A,96,,GLU,102,,1,,7
2,HELIX,3,3,ASN,A,121,,TYR,132,,1,Me too!,12
3,HELIX,4,4,THR,A,134,,SER,145,,1,,12
4,HELIX,5,5,ASP,A,150,,PHE,164,,1,,15


---

## SHEET entry processing
Similar to HELIX, contains information about sheet structure.

