<a href="https://colab.research.google.com/github/khanfs/ComputationalBiology-xGenomics/blob/main/PDB_APIs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Protein Data Bank APIs**
PyPDB and Biotite packages provide Protein Data Bank (PDB) APIs. Outlined is code to install them and examples of querying the PDB - using Biotite to filter structures by given criteria and PyPDB to download metadata and structural files. 
### **Research Collaboratory for Structural Bioinformatics PDB**
The [RCSB Protein Data Bank](https://www.rcsb.org/) is a structural biology database consisting of 3D structures of proteins, nucleic acids and [complex assemblies](https://www.ebi.ac.uk/QuickGO/term/GO:0034622). Experiments comprising *X-ray crystallography*, the de facto method for decades, *nuclear magnetic resonance (NMR)*, and *cryo-electron microscopy (cryo-EM)* generate structural data. cryo-EM is rapidly replacing X-ray crystallography, which is over a century old. PDB contains over **160,000 protein structures**. Each entry contains:

1. 3D coordinates of the atoms and the bonds connecting these atoms for proteins, ligand, cofactors, water molecules, and ions
2. 3D visualisations of the protein structures, with ligand interactions if available
3. meta-information on the structural data, e.g.'s, the PDB ID, the authors, the deposition date, the structural determination method used
4. structural quality metrics - structural resolution - it measures the collected data quality, which has the unit Å (Angstrom); the lower the value, the higher the quality of the structure.

**Resources:**
* [PDB Meta-information list of fields for chemicals](https://search.rcsb.org/chemical-search-attributes.html)
* [PDB Meta-information list of fields for structures](https://search.rcsb.org/structure-search-attributes.html) 

**References:**
* Rose, P. W., Prlić, A., Altunkaya, A., Bi, C., Bradley, A. R., Christie, C. H., Costanzo, L. D., Duarte, J. M., Dutta, S., Feng, Z., Green, R. K., Goodsell, D. S., Hudson, B., Kalro, T., Lowe, R., Peisach, E., Randle, C., Rose, A. S., Shao, C., Tao, Y. P., … Burley, S. K. (2017). The RCSB protein data bank: integrative view of protein, gene and 3D structural information. Nucleic acids research, 45(D1), D271–D281. [https://doi.org/10.1093/nar/gkw1000](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5210513/)
* Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N., & Bourne, P. E. (2000). The Protein Data Bank. Nucleic acids research, 28(1), 235–242. [https://doi.org/10.1093/nar/28.1.235](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC102472/)

In [1]:
# Install Conda environment
! pip install -q condacolab

In [2]:
import condacolab
condacolab.install()

✨🍰✨ Everything looks OK!


In [33]:
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np

# **1. PyPDB**
PyPDB can search the PDB by keyword, author, publication date and experimental method, as well as by specific sequence motifs. For individual PDB IDs, retrievable information ranges from metadata (date of posting, authors, etc.) to the full. pdb data file stored with an entry. Standard information such as gene ontology, ligand information and protein family information can be accessed. PyPDB can also perform BLAST searches (Gilpin, 2016). The Python package pypdb offers an interface for the PDB to not only query for PDB IDs but also to download associated metadata and structural files. 

**Resources:**

* [Primary PyPDB functions](https://academic.oup.com/view-large/35641249)
* [Functions for searching the RCSB PDB for lists of PDB IDs](https://github.com/williamgilpin/pypdb/blob/master/pypdb/pypdb.py)

**References:**

*  William Gilpin, PyPDB: a Python API for the Protein Data Bank, Bioinformatics, Volume 32, Issue 1, 1 January 2016, Pages 159–160, [https://doi.org/10.1093/bioinformatics/btv543](https://academic.oup.com/bioinformatics/article/32/1/159/1743800) 

In [4]:
# Install PyPDB 
! pip install pypdb

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [5]:
! pytest

platform linux -- Python 3.7.13, pytest-3.6.4, py-1.11.0, pluggy-0.7.1
rootdir: /content, inifile:
plugins: typeguard-2.7.1
[1mcollecting 0 items                                                             [0m[1mcollected 0 items                                                              [0m



## **PyPDB Functions**

There are generally 3 types of functions:

* Functions that perform searches and return lists of PDB IDs
* Functions that get information about specific PDB IDs
* Other general-purpose lookup functions

In [6]:
# (*) will import every function from the pydb package
from pypdb import *

### 1. Get a list of PDBs for a specific search term

In [7]:
found_pdbs = Query('ACE2').search()
print(found_pdbs)

['7U0N', '7JVO', '6M1D', '6M18', '1R4L', '1R42', '7RPV', '6M17', '6ACD', '6ACC', '7DK3', '7V61', '7KMB', '7DF4', '6VW1', '7DF3', '6ACK', '6ACJ', '6ACG', '7WRV', '7C8D', '7A93', '7KNI', '7KNH', '7KNE', '7KNB', '7KMZ', '7KMS', '3D0I', '3D0H', '3D0G', '2G75', '7XO5', '7WPC', '7WPB', '7WPA', '7VX5', '7VX4', '7V8A', '7V89', '7V88', '7T9K', '7KJ4', '7KJ3', '7KJ2', '7DMU', '7A98', '7A97', '7A96', '7A95', '7A94', '7A92', '7A91', '6M0J', '6LZG', '7W9I', '7WP9', '7KJ5', '7JWY', '6XM5', '6XM4', '6XM3', '6XM0', '6XLU', '7WPF', '7WPE', '7WPD', '7XOC', '7XOB', '7XOA', '7XO9', '7XO8', '7XO7', '7XO6', '7XO4', '7VXM', '7VXK', '7VXF', '7VXD', '7VXA', '7VX9', '7MJN', '7MJM', '7EFR', '7EFP', '7WEV', '7VXI', '7VXE', '7VX1', '7F5R', '7W9C', '7W9B', '7W99', '7W98', '7E7E', '7XOD', '7T9J', '7E3J', '7W94', '7W92', '2DD8', '7VXC', '7VXB', '7V86', '7V85', '7V83', '7V82', '7V81', '7V7Z', '7TEZ', '7TEW', '7T9L', '7SY8', '7SY6', '7SY4', '7SY2', '7SY0', '7SXY', '7EDJ', '7DX8', '7DX6', '3SCL', '3SCK', '7DDP', '7DDO',

### 2. Search by PubMed ID Number


In [8]:
found_pdbs = Query('34132541', 'PubmedIdQuery').search()
print(found_pdbs)

None




### 3. Search by source organism using NCBI TaxId


In [9]:
found_pdbs = Query('6239', 'TreeEntityQuery').search() #TaxID for C elegans
print(found_pdbs[:5])

['1D4X', '1DYW', '1E3B', '1E8K', '1EMS']


# **2. Biotite**
[Biotite](https://www.biotite-python.org/) is an open source comprehensive computational molecular biology framework for analysis of sequence and structure data, where most of the data can be handled internally, without the usage of additional software. It handles the complete bioinformatics workflow, from fetching, reading and writing relevant files to the efficient and intuitive analysis and manipulation of data (Kunzmann, P., Hamacher, K., 2018). Biotite provides read and write capabilities for the FASTA format, hence FASTA files can be used to load and save nucleotide and protein sequences. Biotite makes heavy use of NumPy and some code is written in [Cython](https://cython.org/), resulting in performance comparable to native C code.

**Resources:**

* [API Reference](https://www.biotite-python.org/apidoc/index.html)

**References:**

* Kunzmann, P., Hamacher, K. Biotite: a unifying open source computational biology framework in Python. BMC Bioinformatics 19, 346 (2018). [https://doi.org/10.1186/s12859-018-2367-z](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2367-z#citeas)

In [10]:
# Install Biotite
! pip install biotite

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## **Biotite Functions**

**Biotite is divided into four subpackages**: sequence and structure provide tools for handling sequences or biomolecular structures, respectively. database is used for fetching files from biological databases and application offers interfaces for external software. 

1. **Sequence subpackage:** contains functionality for working with sequence information of any kind. The package contains by default sequence types for nucleotides and proteins.

2. **Structure subpackage:** enables handling of 3D structures of biomolecules. Simplified, a structure is represented by a list of atoms and their properties, based on ndarray objects. Optionally, this representation can be enriched with chemical bond information. Biotite supports different structure formats, including the ones provided by the RCSB and Gromacs trajectory formats. The subpackage offers a wide range of functions for atom filtering, coordinate transformations, angle and bond measurements, accessible surface area calculation, structure superimposition and more.

3. **Application subpackage:** provides interfaces for external software, in case Biotite’s integrated functionality is not sufficient for your tasks. These interfaces range from locally installed software (e.g. MSA software) to web services (e.g. BLAST). The speciality is that the interfaces are seamless: You do not have to write input files and read output files, you only have to input Python objects and you get Python objects. It is basically very similar to using normal Python functions.

4. **Database subpackage:** is all about searching and downloading data from biological databases, including the arguably most important ones: the RCSB PDB and the NCBI Entrez database.


In [11]:
# Import Biotite PDB API
import biotite.database.rcsb as rcsb

**Tempfile:** is a Python module used in a situation, where we need to read multiple files, change or access the data in the file, and gives output files based on the result of processed data.

In [45]:
import tempfile
from tempfile import gettempdir

### 1. Database Subpackage
biotite.database subpackage provides interfaces for RCSB PDB and the NCBI Entrez database.

Specify the PDB ID, the file format and the target directory for the fetch() function. The function returns the path to the downloaded file, so can load the file via the other Biotite subpackages. 

In [50]:
# Example: download protein structure of the miniprotein TC5b (PDB: 1L2Y) into a temporary directory
file_path = rcsb.fetch("1l2y", "pdb", gettempdir())
print(file_path)

/tmp/1l2y.pdb


If want to download multiple files, we're able to specify a list of PDB IDs, which in return gives a list of file paths.

[mmCIF](https://www.ebi.ac.uk/pdbe/docs/documentation/mmcif.html) is a flexible and extensible tag-value format for representing macromolecular structural data.

In [51]:
# Download files in the more modern mmCIF format
file_paths = rcsb.fetch(["1l2y", "1aki"], "cif", gettempdir())
print([file_path for file_path in file_paths])

['/tmp/1l2y.cif', '/tmp/1aki.cif']


By default fetch() checks whether the file to be fetched already exists in the directory and downloads it, if it does not exist yet. If want to download files irrespectively, set overwrite to true.

In [52]:
# Download file in the fast and small binary MMTF format
file_path = rcsb.fetch("1l2y", "mmtf", gettempdir(), overwrite=True)