# Using PDB's python API

The value of computational skills is especially apparent when you find a way to automate boring and monotonous tasks, like gathering data from a database such as [the Protein Data Bank](https://www.rcsb.org/). This is known as **data mining**, and to our delight, a special purpose library for this usage has been made: `pypdb`.



In [3]:
#import requests
import numpy as np
import pandas as pd
# from pypdb import *
import pypdb
from time import time

### Scenario:

You read a publication about a protein you're interested in, and want to access the researchers' entry on PDB to obtain information about that protein, say which ligand it binds to. For this you copy the PMID or DOI (try it for [this example](https://www.sciencedirect.com/science/article/pii/S000629521500132X?via%3Dihub)), and paste it into the search bar, and find what you're looking for, this will probably take you around 30 seconds. Now imagine having to do that 1200 times over. This is why we want to automate.

We start with a list of IDs in the data directory. 

In [4]:
path = 'data/PMIDS.txt'

#### Exercise 1. read the data into a string variable called `pmids`.

In [5]:
# %load solutions/ex3_1.py
with open(path, 'r') as f:
    pmids = f.read()

In [6]:
print(pmids[:25])

PMID: 25772737
PMID: 2490


#### Exercise 2. Currently, the data is juts a single long string. Obtain a list of the IDs.

In [7]:
# load solutions/ex3_1.py
pmids = pmids.splitlines()

In [8]:
pmids[:10]

['PMID: 25772737',
 'PMID: 24900475',
 'PMID: 26249349',
 'PMID: 29510947',
 'PMID: 26872970',
 'PMID: 21358046',
 'PMID: 20298671',
 'PMID: 29348071',
 'PMID: 29608391',
 'PMID: 26806381']

Notice that if you scroll down some of the PMIDs are followed by some empty spaces after the number? Make a new list which contains only the number, not followed by the white space. 

In [9]:
# %load solutions/ex3_2.py
pmids = [pmid[6:14] for pmid in pmids]

In [10]:
#from pypdb import * # this statement will import all functions and classes from pypdb

In [11]:
fst = pmids[0]
fst

'25772737'

## Do a search on PDB

Once you get started with a new library, you would commonly just follow the examples from the official [docs](http://www.wgilpin.com/pypdb_docs/html/). The exact procedures you don't need to remember by heart, you can always just go back to the docs. However, we do the query as follows:

In [12]:
search_dict = pypdb.make_query(fst)
hits = pypdb.do_search(search_dict)

In [13]:
hits

['4X9P']

This gives us a list of matching PDB IDs. We can get a summary of an entry:

In [14]:
pypdb.get_all_info(hits[0])

{'polymer': {'@entityNr': '1',
  '@length': '343',
  '@type': 'protein',
  '@weight': '38985.5',
  'chain': {'@id': 'A'},
  'Taxonomy': {'@name': 'Bos taurus', '@id': '9913'},
  'synonym': {'@name': 'Annexin II,Annexin-2,Calpactin I heavy chain,Calpactin-1 heavy chain,Chromobindin-8,Lipocortin II,Placental anticoagulant protein IV,PAP-IV,Protein I,p36'},
  'macroMolecule': {'@name': 'Annexin A2', 'accession': {'@id': 'P04272'}},
  'polymerDescription': {'@description': 'Annexin A2'}},
 'id': '4X9P'}

and some metadata

In [15]:
pypdb.describe_pdb(hits[0])

{'structureId': '4X9P',
 'title': 'Crystal structure of bovine Annexin A2',
 'pubmedId': '25772737',
 'expMethod': 'X-RAY DIFFRACTION',
 'resolution': '2.01',
 'keywords': 'CALCIUM BINDING PROTEIN',
 'nr_entities': '1',
 'nr_residues': '343',
 'nr_atoms': '2544',
 'deposition_date': '2014-12-11',
 'release_date': '2015-01-14',
 'last_modification_date': '2015-04-29',
 'structure_authors': 'Shumilin, I.A., Hollas, H., Vedeler, A., Kretsinger, R.H.',
 'citation_authors': 'Raddum, A.M., Hollas, H., Shumilin, I.A., Henklein, P., Kretsinger, R., Fossen, T., Vedeler, A.',
 'status': 'CURRENT'}

but now we were interested in the ligands and use the specialized function to find ligands: `get_ligands`

In [16]:
ligands = pypdb.get_ligands(hits[0])

In [17]:
ligands

{'ligandInfo': {'ligand': [{'@structureId': '4X9P',
    '@chemicalID': 'CA',
    '@type': 'non-polymer',
    '@molecularWeight': '40.078',
    'chemicalName': 'CALCIUM ION',
    'formula': 'Ca 2',
    'InChIKey': 'BHPQYMZQTOCNFJ-UHFFFAOYSA-N',
    'InChI': 'InChI=1S/Ca/q+2',
    'smiles': '[Ca+2]'},
   {'@structureId': '4X9P',
    '@chemicalID': 'CL',
    '@type': 'non-polymer',
    '@molecularWeight': '35.453',
    'chemicalName': 'CHLORIDE ION',
    'formula': 'Cl -1',
    'InChI': 'InChI=1S/ClH/h1H/p-1',
    'InChIKey': 'VEXZGXHMUGYJMC-UHFFFAOYSA-M',
    'smiles': '[Cl-]'}]},
 'id': '4X9P'}

#### Exercise 3. How many ligands does this protein have?

In [18]:
# %load solutions/ex3_3a.py

# len(ligands) <--- you would expect this to work, but if you look at ligands.keys() you see the data is structured a little strangely

print(ligands.keys())

# the right way turns out to be:
len(ligands['ligandInfo']['ligand'])

# the best way to find this is through gradually inspecting the datastructure, calling `type` when necessary etc

dict_keys(['ligandInfo', 'id'])


2

#### 3.b. What are those ligands?
Hint: to make it easier, look simultaneously at the entry on the [webpage](https://www.rcsb.org/structure/4X9P).  

In [19]:
# %load solutions/ex3_3b.py
for lig in ligands['ligandInfo']['ligand']:
    print(lig['chemicalName'])

CALCIUM ION
CHLORIDE ION


In [20]:
pypdb.describe_chemical('CA')

{'describeHet': {'ligandInfo': {'ligand': {'@chemicalID': 'CA',
    '@type': 'non-polymer',
    '@molecularWeight': '40.078',
    'chemicalName': 'CALCIUM ION',
    'formula': 'Ca 2',
    'InChIKey': 'BHPQYMZQTOCNFJ-UHFFFAOYSA-N',
    'InChI': 'InChI=1S/Ca/q+2',
    'smiles': '[Ca+2]'}}}}

#### Exercise 4. Put all into a single function that does it all in one function call

hint: you can make one function per sub-task.

In [25]:
def get_pdb_ids(pmid):
    search_dict = pypdb.make_query(pmid)
    hits = pypdb.do_search(search_dict)
    return hits

# sometimes we get multiple hits (or none!). For now, we are only concerned with the first one.

def get_ligands(pdb_id):
    ligs = pypdb.get_ligands(pdb_id)
    return [lig['chemicalName'] for lig in ligs['ligandInfo']['ligand']]

def whole_pipeline(pmid):
    hits = get_pdb_ids(pmid)
    if len(hits)<1: return []
    try: ligands = get_ligands(hits[0])
    except: return
    print(ligands, '\n')
    return ligands

In [26]:
for pmid in pmids[:4]:
    whole_pipeline(pmid)

['CALCIUM ION', 'CHLORIDE ION'] 

['{3-[(5-methyl-2-phenyl-1,3-oxazol-4-yl)methoxy]phenyl}methanol', '1,2-ETHANEDIOL'] 

['ISOQUINOLIN-5-AMINE', 'DIMETHYL SULFOXIDE', 'ethane-1,2-diol', 'HISTIDINE', 'SULFATE ION'] 



In [5]:
#make_query 	Structure a search request into a dict() 
#do_search 	Perform a search for PDB IDs 
#get_all 	Get all active PDB IDs 
#describe_pdb 	All metadata about PDB entry 
#get_all_info 	All information deposited in PDB entry 
#get_pdb_file 	Retrieve.pdb/.cif/.xml file for PDB entry 
#get_blast 	BLAST search results for PDB entry 
#find_papers 	Find papers associated with keyword 
#find_authors 	Find authors associated with keyword 

In [7]:
#get_pdbs(pmids[0])