# Using PDB's python API

The value of computational skills is especially apparent when you find a way to automate boring and monotonous tasks, like gathering data from a database such as [the Protein Data Bank](https://www.rcsb.org/). This is known as **data mining**, and to our delight, a special purpose library for this usage has been made: `pypdb`.



In [1]:
import numpy as np
import pandas as pd
import pypdb
from time import time

### Scenario:

You read a publication about a protein you're interested in, and want to access the researchers' entry on PDB to obtain information about that protein, say which ligand it binds to. For this you copy the PMID or DOI (try it for [this example](https://www.sciencedirect.com/science/article/pii/S000629521500132X?via%3Dihub)), and paste it into the search bar, and find what you're looking for, this will probably take you around 30 seconds. Now imagine having to do that 1200 times over. This is why we want to automate.

We start with a list of IDs in the data directory. 

In [2]:
path = 'data/PMIDS.txt'

#### Exercise 1. read the data into a string variable called `pmids`.

In [3]:
# %load solutions/ex3_1.py
with open(path, 'r') as f:
    pmids = f.read()

In [4]:
print(pmids[:25])

PMID: 25772737
PMID: 2490


#### Exercise 2. Currently, the data is juts a single long string. Obtain a list of the IDs (i.e. a list of strings).

In [5]:
# load solutions/ex3_1.py
pmids = pmids.splitlines()

In [6]:
pmids[:10]

['PMID: 25772737',
 'PMID: 24900475',
 'PMID: 26249349',
 'PMID: 29510947',
 'PMID: 26872970',
 'PMID: 21358046',
 'PMID: 20298671',
 'PMID: 29348071',
 'PMID: 29608391',
 'PMID: 26806381']

In [7]:
pmids

['PMID: 25772737',
 'PMID: 24900475',
 'PMID: 26249349',
 'PMID: 29510947',
 'PMID: 26872970',
 'PMID: 21358046',
 'PMID: 20298671',
 'PMID: 29348071',
 'PMID: 29608391',
 'PMID: 26806381',
 'PMID: 27708405',
 'PMID: 30060461',
 'PMID: 28495381',
 'PMID: 20158266',
 'PMID: 24847974',
 'PMID: 19827080',
 'PMID: 30784877',
 'PMID: 20818756',
 'PMID: 25690569',
 'PMID: 24870364',
 'PMID: 31362921',
 'PMID: 25666181',
 'PMID: 27645997',
 'PMID: 29910922',
 'PMID: 21192659',
 'PMID: 21972967',
 'PMID: 21763501',
 'PMID: 25636146',
 'PMID: 27726357',
 'PMID: 20711197',
 'PMID: 28437106',
 'PMID: 28940929',
 'PMID: 31494535',
 'PMID: 24900229',
 'PMID: 26637553',
 'PMID: 21091447',
 'PMID: 29877794',
 'PMID: 23072895',
 'PMID: 27438815',
 'PMID: 30160524',
 'PMID: 22890978',
 'PMID: 27676184',
 'PMID: 31973307',
 'PMID: 20797617',
 'PMID: 20043700',
 'PMID: 25556635',
 'PMID: 24704437',
 'PMID: 24505392',
 'PMID: 29364664',
 'PMID: 24677424',
 'PMID: 23517028',
 'PMID: 22963052',
 'PMID: 2564

Notice that if you scroll down some of the PMIDs are followed by some empty spaces after the number? Make a new list which contains only the number, not followed by the white space.

In [8]:
# %load solutions/ex3_2.py
pmids = [pmid[6:14] for pmid in pmids]

In [9]:
#from pypdb import * # this statement will import all functions and classes from pypdb

## Do a search on PDB

Once you get started with a new library, you would commonly just follow the examples from the official [docs](http://www.wgilpin.com/pypdb_docs/html/). The exact procedures you don't need to remember by heart, you can always just go back to the docs. However, we do the query as follows:

In [11]:
search_dict = pypdb.make_query(pmids[0])
hits = pypdb.do_search(search_dict)

In [12]:
hits

['4X9P']

This gives us a list of matching PDB IDs. We can get a summary of an entry:

In [13]:
pypdb.get_all_info(hits[0])

{'polymer': {'@entityNr': '1',
  '@length': '343',
  '@type': 'protein',
  '@weight': '38985.5',
  'chain': {'@id': 'A'},
  'Taxonomy': {'@name': 'Bos taurus', '@id': '9913'},
  'synonym': {'@name': 'Annexin II,Annexin-2,Calpactin I heavy chain,Calpactin-1 heavy chain,Chromobindin-8,Lipocortin II,Placental anticoagulant protein IV,PAP-IV,Protein I,p36'},
  'macroMolecule': {'@name': 'Annexin A2', 'accession': {'@id': 'P04272'}},
  'polymerDescription': {'@description': 'Annexin A2'}},
 'id': '4X9P'}

and some metadata

In [14]:
pypdb.describe_pdb(hits[0])

{'structureId': '4X9P',
 'title': 'Crystal structure of bovine Annexin A2',
 'pubmedId': '25772737',
 'expMethod': 'X-RAY DIFFRACTION',
 'resolution': '2.01',
 'keywords': 'CALCIUM BINDING PROTEIN',
 'nr_entities': '1',
 'nr_residues': '343',
 'nr_atoms': '2544',
 'deposition_date': '2014-12-11',
 'release_date': '2015-01-14',
 'last_modification_date': '2015-04-29',
 'structure_authors': 'Shumilin, I.A., Hollas, H., Vedeler, A., Kretsinger, R.H.',
 'citation_authors': 'Raddum, A.M., Hollas, H., Shumilin, I.A., Henklein, P., Kretsinger, R., Fossen, T., Vedeler, A.',
 'status': 'CURRENT'}

but now we were interested in the ligands and use the specialized function to find ligands: `get_ligands`

In [15]:
ligands = pypdb.get_ligands(hits[0])

In [16]:
ligands

{'ligandInfo': {'ligand': [{'@structureId': '4X9P',
    '@chemicalID': 'CA',
    '@type': 'non-polymer',
    '@molecularWeight': '40.078',
    'chemicalName': 'CALCIUM ION',
    'formula': 'Ca 2',
    'InChIKey': 'BHPQYMZQTOCNFJ-UHFFFAOYSA-N',
    'InChI': 'InChI=1S/Ca/q+2',
    'smiles': '[Ca+2]'},
   {'@structureId': '4X9P',
    '@chemicalID': 'CL',
    '@type': 'non-polymer',
    '@molecularWeight': '35.453',
    'chemicalName': 'CHLORIDE ION',
    'formula': 'Cl -1',
    'InChI': 'InChI=1S/ClH/h1H/p-1',
    'InChIKey': 'VEXZGXHMUGYJMC-UHFFFAOYSA-M',
    'smiles': '[Cl-]'}]},
 'id': '4X9P'}

#### Exercise 3. How many ligands does this protein have?

In [18]:
# %load solutions/ex3_3a.py

# len(ligands) <--- you would expect this to work, but if you look at ligands.keys() you see the data is structured a little strangely

print(ligands.keys())

# the right way turns out to be:
len(ligands['ligandInfo']['ligand'])


dict_keys(['ligandInfo', 'id'])


2

#### 3.b. What are those ligands?
Hint: to make it easier, look simultaneously at the entry on the [webpage](https://www.rcsb.org/structure/4X9P).  

In [17]:
# %load solutions/ex3_3b.py
for lig in ligands['ligandInfo']['ligand']:
    print(lig['chemicalName'])
    
# you could also have used @chemicalID or formula instead of chemicalName

CALCIUM ION
CHLORIDE ION


In [21]:
pypdb.describe_chemical('CA')

{'describeHet': {'ligandInfo': {'ligand': {'@chemicalID': 'CA',
    '@type': 'non-polymer',
    '@molecularWeight': '40.078',
    'chemicalName': 'CALCIUM ION',
    'formula': 'Ca 2',
    'InChIKey': 'BHPQYMZQTOCNFJ-UHFFFAOYSA-N',
    'InChI': 'InChI=1S/Ca/q+2',
    'smiles': '[Ca+2]'}}}}

#### Exercise 4. Put all into a single function `whole_pipeline` that let's you extract the ligands in a single function call.

hint: you can make one function per sub-task.

In [28]:
# %load solutions/ex3_4.py

def get_pdb_ids(pmid):
    search_dict = pypdb.make_query(pmid)
    hits = pypdb.do_search(search_dict)
    return hits

# sometimes we get multiple hits (or none!). For now, we are only concerned with the first one.

def get_ligands(pdb_id):
    ligs = pypdb.get_ligands(pdb_id)
    return [lig['@chemicalID'] for lig in ligs['ligandInfo']['ligand']]

def whole_pipeline(pmid):
    hits = get_pdb_ids(pmid)
    if len(hits)<1: return []
    try: ligands = get_ligands(hits[0])
    except: return []
    print(ligands, '\n')
    return ligands


# explanation: 
# the try and except blocks lets you first attempt (try) to execute some code.
# if that code throws an error, we will jump to the code in the except block.
# if no error occurs, it will simply execute, and ignore the except block, continuing as usual.

Depending on exactly how you solved exercise 4, we can call this pipeline in a loop for each pmid, and append the result to a dictionary we can export.

In [29]:
result = {} #initialize empty dict

for pmid in pmids[:20]:
    ligands = whole_pipeline(pmid)
    result[pmid] = ligands

['CA', 'CL'] 

['BPK', 'EDO'] 

['5IQ', 'DMS', 'EDO', 'HIS', 'SO4'] 

['5UH', 'ZN'] 

['SEP', 'XIX'] 

['CL', 'EDO', 'KCX', 'TI7'] 

['ADP', 'DS0', 'GOL', 'MG'] 

['ADN', 'GOL', 'TRS'] 

['13P', 'NAD', 'SO4'] 

['54P', 'ACT'] 

['A4D', 'EOH', 'GOL', 'SO4'] 

['FCA', 'FCB'] 

['CEM', 'SO4', 'TEM'] 



Pandas can be used for display purposes:

In [35]:
df = pd.DataFrame(data=result.items(), columns=['PMID','ligands'])
df

Unnamed: 0,PMID,ligands
0,25772737,"[CA, CL]"
1,24900475,"[BPK, EDO]"
2,26249349,"[5IQ, DMS, EDO, HIS, SO4]"
3,29510947,[]
4,26872970,"[5UH, ZN]"
5,21358046,"[SEP, XIX]"
6,20298671,[]
7,29348071,"[CL, EDO, KCX, TI7]"
8,29608391,"[ADP, DS0, GOL, MG]"
9,26806381,[]


Now we can export this into a new file.

In [36]:
df.to_csv('result.csv')