<a href="https://colab.research.google.com/github/pjmartel/python-for-scientists/blob/master/Programmatic_Access_PDB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Programatic access to PDB using Python

The Protein Data Bank (PDB) is a on-line database and repository of all knwo structures of biological macromolecules, maintained by the Research Colaboratory on Structural Bioonformatics (RCSB). The web site offers multiple ways of searrch the structures includign name, size, date of deposition, experimental method, etc...Once a set of structure entities is selected, they can be downloaded as coordinate files, to be visualized or used in structural modelling methods.  The PDB website  is available at the following URL: https://www.rcsb.org.

The site has a very nicely designed user interface, but there are times were we would like to access the information in a *programmatic* way, meaning by getting the information using program code and not by clicking pulldowns or filling forms on a web page. The later is particularly impratical if you are dealing with very large amounts of data or massively repetitive operations (nobody deserves having to to click 10000 times on a web page!). There several ways in which you can access  information avaliable on the PDB, different methods being more or less sutied depending on the type of information you required.

### Using the PDB REST API

The PDB RESTful (or REST for short) API is an interface that allows querying the databank using special URLs encoding our request. As an example, let's say we want to get information about the structure with id code "sdfdf". All we have to is to enter the following URL in our web browser: https://www.rcsb.org/pdb/rest/describePDB?structureId=4hhb . Try it now and see what happens. The browser should display a *XML file* file, a kind of universal format for enconding structured information. While very easy to parse with a computer program, this format is clearly not the best to present data for human consumption - this information needs some pre-processing.  On the other hand, the idea here is not that you, as the user, will by typing the REST URL in the browser, it should instead be done by a program!

So, there are two things we want to do here:

1. Programatically query the websever to obtain the XML file
2. Programatically parse the XML file and present the information in a more human readable form

#### 1. Access general information about a PDB entry

First we import the module `request`from the urllib library. This module is used to retrieve information from a PDB server:

In [0]:
import urllib.request

The returned data will in the form of **XML file**. In order to read these, we need to import the `ElementTree` module from the `xml` library:

In [0]:
import xml.etree.ElementTree 

Select a specific PDB file by proving its PDB Id

In [0]:
pdb_code = "1beo" # change thid to your desired PDB id

Build the request URL (as a string) and pass it as argument to the `urlopen` method, which will return the server response. This return value is assigned to `fp`:

In [0]:
fp = urllib.request.urlopen("https://www.rcsb.org/pdb/rest/describePDB?structureId="+pdb_code)

In [0]:
print(fp)

<http.client.HTTPResponse object at 0x7f652bfc0390>


`fp` is an object storing the resonse given by the http server to the client. That resonse is a XML file that needs to be read (*parsed*) to extract the information therein:

In [0]:
e = xml.etree.ElementTree.parse(fp).getroot()

In [0]:
print(e)

<Element 'PDBdescription' at 0x7f652bfbb368>


The response `e` contains a top (or root) element called `PDBdescription`. Please open a tab in your broweser session and open the following URL: https://www.rcsb.org/pdb/rest/describePDB?structureId=1beo . The contents of the XML response will be shown. That is what your program is analyzing!

Let's find all the  information in the tag "PDB"

In [0]:
PDB = e.findall("PDB")[0]  # findall returns a list with a single element, we need the [0] to grab that element

The `items` method will return all the information in the PDB tag. Let's check that this is true:

In [0]:
print(PDB.items())

[('structureId', '1BEO'), ('title', 'BETA-CRYPTOGEIN'), ('pubmedId', '8994969'), ('expMethod', 'X-RAY DIFFRACTION'), ('resolution', '2.20'), ('keywords', 'FUNGAL TOXIC ELICITOR'), ('nr_entities', '1'), ('nr_residues', '98'), ('nr_atoms', '717'), ('deposition_date', '1996-08-02'), ('release_date', '1997-05-15'), ('last_modification_date', '2011-07-13'), ('structure_authors', 'Boissy, G., De La Fortelle, E., Kahn, R., Huet, J.C., Bricogne, G., Pernollet, J.C., Brunie, S.'), ('citation_authors', 'Boissy, G., de La Fortelle, E., Kahn, R., Huet, J.C., Bricogne, G., Pernollet, J.C., Brunie, S.'), ('status', 'CURRENT')]


No we will **loop** through that list of *tuples*, print their content and stored it a `dict` variable called "myDict":

In [0]:
myDict = {}
for a in PDB.items():
        print(a[0]+" : "+a[1])
        myDict[a[0]] = a[1]

structureId : 1BEO
title : BETA-CRYPTOGEIN
pubmedId : 8994969
expMethod : X-RAY DIFFRACTION
resolution : 2.20
keywords : FUNGAL TOXIC ELICITOR
nr_entities : 1
nr_residues : 98
nr_atoms : 717
deposition_date : 1996-08-02
release_date : 1997-05-15
last_modification_date : 2011-07-13
structure_authors : Boissy, G., De La Fortelle, E., Kahn, R., Huet, J.C., Bricogne, G., Pernollet, J.C., Brunie, S.
citation_authors : Boissy, G., de La Fortelle, E., Kahn, R., Huet, J.C., Bricogne, G., Pernollet, J.C., Brunie, S.
status : CURRENT


In [0]:
myDict  # myDict contains all the information

{'citation_authors': 'Boissy, G., de La Fortelle, E., Kahn, R., Huet, J.C., Bricogne, G., Pernollet, J.C., Brunie, S.',
 'deposition_date': '1996-08-02',
 'expMethod': 'X-RAY DIFFRACTION',
 'keywords': 'FUNGAL TOXIC ELICITOR',
 'last_modification_date': '2011-07-13',
 'nr_atoms': '717',
 'nr_entities': '1',
 'nr_residues': '98',
 'pubmedId': '8994969',
 'release_date': '1997-05-15',
 'resolution': '2.20',
 'status': 'CURRENT',
 'structureId': '1BEO',
 'structure_authors': 'Boissy, G., De La Fortelle, E., Kahn, R., Huet, J.C., Bricogne, G., Pernollet, J.C., Brunie, S.',
 'title': 'BETA-CRYPTOGEIN'}

We can use myDict like so:

In [0]:
myDict['title']

'BETA-CRYPTOGEIN'

Another way to get the same information:

In [0]:
for a in PDB.keys():
    print(a,":",PDB.get(a))

structureId : 1BEO
title : BETA-CRYPTOGEIN
pubmedId : 8994969
expMethod : X-RAY DIFFRACTION
resolution : 2.20
keywords : FUNGAL TOXIC ELICITOR
nr_entities : 1
nr_residues : 98
nr_atoms : 717
deposition_date : 1996-08-02
release_date : 1997-05-15
last_modification_date : 2011-07-13
structure_authors : Boissy, G., De La Fortelle, E., Kahn, R., Huet, J.C., Bricogne, G., Pernollet, J.C., Brunie, S.
citation_authors : Boissy, G., de La Fortelle, E., Kahn, R., Huet, J.C., Bricogne, G., Pernollet, J.C., Brunie, S.
status : CURRENT


#### 2. Describe all molecular entities in a PDB entry

This can be done with the following type of URL: https://www.rcsb.org/pdb/rest/describeMol?structureId=4hhb
(in this case for the structure with id=4hhb)

**HANDS-ON:** Inspect the XML output of the following URL. Based on the above example, write the necessary code to get the XML and parse it into human readable output.

#### 3. Getting representative chains for a given % id cluster

In [0]:
#@title
import urllib.request   # import the module for reading data from http server

In [0]:
#@title
import xml.etree.ElementTree   # import library to parse XML files

In [0]:
#@title
cluster = "100" # Cluster % identity (100,95,90,70,50,40,30)

In [0]:
#@title
urllib.request.urlopen?

In [0]:
#@title
fp = urllib.request.urlopen("https://www.rcsb.org/pdb/rest/representatives?cluster="+cluster,timeout=3600)

In [0]:
#@title
e = xml.etree.ElementTree.parse(fp).getroot()

In [0]:
#@title
chain_codes = []

In [0]:
#@title
for atype in e.findall('pdbChain'):
    chain_codes.append(atype.get('name'))

In [0]:
#@title
len(chain_codes)

71459

In [0]:
#@title
chain_codes[0:10]

['6B2K.A',
 '1K5N.B',
 '1GTF.A',
 '2VB1.A',
 '5NVG.A',
 '2HS1.A',
 '5E7W.A',
 '5E7W.B',
 '5Y2S.A',
 '4B1Y.B']

#### 4. Getting all chains in the same % id cluster of a given chain

Entries in the PDB are *clustered* according to sequence similarity, each cluster containing proteins whose sequences are have percent identity no smaller than the cluster level. For instance the 60% clusters contain proteins whose sequences have 60% identity or greater. The 100% clusters contain PDB entries whose sequences are 100% identical. 

In [0]:
import urllib.request   # import the module for reading data from http server

In [0]:
import xml.etree.ElementTree as ET  # import library to parse XML files

In [0]:
nmax = None # maximum number of entries to return

In [0]:
cluster = "100" # Cluster % identity

In [0]:
pdbFile = "2a8f"  # PDB id

In [0]:
chain = "A"  # PDB chain

In [0]:
fp = urllib.request.urlopen("https://www.rcsb.org/pdb/rest/sequenceCluster?cluster="+\
    cluster+"&structureId="+pdbFile+"."+chain)

In [0]:
e = ET.parse(fp).getroot()

In [0]:
chain_codes = []

In [0]:
for atype in e.findall('pdbChain')[:nmax]:
    pdbName = atype.get('name')
    pdbRank = atype.get('rank')
    pp = urllib.request.urlopen(
        "https://www.rcsb.org/pdb/rest/describePDB?structureId="+pdbName[:4])
    root = ET.parse(pp).getroot()
    #for curPDB in root.findall("PDB")
    curPDB = root.find("PDB")
    structureId = curPDB.get("structureId")
    resolution = curPDB.get("resolution")
    release_date = curPDB.get("release_date")[:4]
    nr_entities = curPDB.get("nr_entities")
    nr_residues = curPDB.get("nr_residues")
    title = curPDB.get("title")[:90]
    if resolution == None:
        resolution = -1 # NMR
    print("{:1} {:6} {:4} {:1} {:>4} {:5} {} ".format(\
        pdbRank, pdbName,resolution, nr_entities, nr_residues, release_date[:4],title))
    #print(pdbRank, pdbName,resolution, nr_entities, nr_residues, release_date[:4],title)


1 2AIB.A 1.10 1  196 2006  beta-cinnamomin in complex with ergosterol 
1 2AIB.B 1.10 1  196 2006  beta-cinnamomin in complex with ergosterol 
2 2A8F.A 1.35 1  196 2006  beta-cinnamomin after sterol removal 
2 2A8F.B 1.35 1  196 2006  beta-cinnamomin after sterol removal 
3 1LRI.A 1.45 1   98 2002  BETA-CRYPTOGEIN-CHOLESTEROL COMPLEX 
4 1LJP.A 1.80 1  196 2002  Crystal Structure of beta-Cinnamomin Elicitin 
4 1LJP.B 1.80 1  196 2002  Crystal Structure of beta-Cinnamomin Elicitin 
5 1BEO.A 2.20 1   98 1997  BETA-CRYPTOGEIN 
6 1BEG.A   -1 1   98 1997  STRUCTURE OF FUNGAL ELICITOR, NMR, 18 STRUCTURES 


### Using the BioPython Library

In [0]:
!pip install biopython

Collecting biopython
[?25l  Downloading https://files.pythonhosted.org/packages/28/15/8ac646ff24cfa2588b4d5e5ea51e8d13f3d35806bd9498fbf40ef79026fd/biopython-1.73-cp36-cp36m-manylinux1_x86_64.whl (2.2MB)
[K    100% |████████████████████████████████| 2.2MB 13.4MB/s 
Installing collected packages: biopython
Successfully installed biopython-1.73


In [0]:
from Bio.PDB import PDBParser, PDBList

In [0]:
pdbl = PDBList()

In [0]:
idList = pdbl.get_all_entries()

Retrieving index file. Takes about 27 MB.


In [0]:
print("The current number of PDB files is:",len(idList))

The current number of PDB files is: 150861


In [0]:
pdbl.retrieve_pdb_file("4ekz",file_format="pdb")

Downloading PDB structure '4ekz'...


'/content/ek/pdb4ekz.ent'

In [0]:
parser = PDBParser()

In [0]:
structure = parser.get_structure('4ekz', 'ek/pdb4ekz.ent')

In [0]:
view = nv.show_biopython(structure)

In [0]:
view.background = "black"
view

### Viewing PDB files with the nlviewer

In [0]:
!pip install nglview

In [0]:
import nglview as nv

In [0]:
view = nv.NGLWidget()

In [0]:
c = view.add_component('rcsb://2vb1.pdb')

In [0]:
view