# DrugBank - Fetch Raw Data

**Date:** 26/01/23

**Done by:** Gustavo H. M. Sousa

This notebook describes the process of gathering FDA approved small drugs from [DrugBank](https://go.drugbank.com/). 

**Basic description:** DrugBank is a richly annotated resource that combines detailed drug data with comprehensive drug target and drug action information. Since its first release in 2006, DrugBank has been widely used to facilitate in silico drug target discovery, drug design, drug docking or screening, drug metabolism prediction, drug interaction prediction and general pharmaceutical education. 

After sending an email to DrugBank collaborators, they sent us a key to access and download raw data. For us, right now, it is important to fetch all the FDA approved small-drugs.

In [1]:
import pandas as pd

In [2]:
# Loading small molecules list:
small_molecules_list = pd.read_csv('data/MANUAL/MANUAL_drug_bank_small_molecules.csv')

# Loading approved drugs
approved_drugs = pd.read_csv('data/MANUAL/MANUAL_structures_approved_drug_bank.csv')

In [3]:
print(f'Small molecules list variables and size: {small_molecules_list.columns}, {small_molecules_list.shape[0]}\n')
print(f'Approved drugs variables: {approved_drugs.columns} and size: {approved_drugs.shape[0]}')

Small molecules list variables and size: Index(['DrugBank ID', 'Name', 'Drug Type'], dtype='object'), 11912

Approved drugs variables: Index(['DrugBank ID', 'Name', 'CAS Number', 'Drug Groups', 'InChIKey', 'InChI',
       'SMILES', 'Formula', 'KEGG Compound ID', 'KEGG Drug ID',
       'PubChem Compound ID', 'PubChem Substance ID', 'ChEBI ID', 'ChEMBL ID',
       'HET ID', 'ChemSpider ID', 'BindingDB ID'],
      dtype='object') and size: 2715


We can drop a lot of columns so it could be more readable:

['CAS Number', 'InChIKey', 'InChI', 'Formula', 'KEGG Compound ID', 'KEGG Drug ID', 'PubChem Compound ID', 'PubChem Substance ID', 'ChEBI ID', 'HET ID', 'ChemSpider ID', 'BindingDB ID' ]

In [4]:
approved_drugs.drop(['CAS Number', 'InChIKey', 'InChI', 'Formula', 'KEGG Compound ID', 'KEGG Drug ID', 'PubChem Compound ID', 'PubChem Substance ID', 'ChEBI ID', 'HET ID', 'ChemSpider ID', 'BindingDB ID'], axis=1, inplace=True)

approved_drugs

Unnamed: 0,DrugBank ID,Name,Drug Groups,SMILES,ChEMBL ID
0,DB00006,Bivalirudin,approved; investigational,CC[C@H](C)[C@H](NC(=O)[C@H](CCC(O)=O)NC(=O)[C@...,CHEMBL2103749
1,DB00007,Leuprolide,approved; investigational,CCNC(=O)[C@@H]1CCCN1C(=O)[C@H](CCCNC(N)=N)NC(=...,CHEMBL1201199
2,DB00014,Goserelin,approved,CC(C)C[C@H](NC(=O)[C@@H](COC(C)(C)C)NC(=O)[C@H...,CHEMBL1201247
3,DB00027,Gramicidin D,approved,CC(C)C[C@@H](NC(=O)CNC(=O)[C@@H](NC=O)C(C)C)C(...,CHEMBL557217
4,DB00035,Desmopressin,approved,NC(=O)CC[C@@H]1NC(=O)[C@H](CC2=CC=CC=C2)NC(=O)...,CHEMBL1429
...,...,...,...,...,...
2710,DB16627,Melphalan flufenamide,approved; withdrawn,CCOC(=O)[C@H](CC1=CC=C(F)C=C1)NC(=O)[C@@H](N)C...,CHEMBL4303060
2711,DB16628,Fosdenopterin,approved,[H][C@@]12COP(O)(=O)O[C@]1([H])C(O)(O)[C@]1([H...,CHEMBL2338675
2712,DB16629,Serdexmethylphenidate,approved,[H][C@@]1(CCCCN1C(=O)OC[N+]1=CC=CC(=C1)C(=O)N[...,
2713,DB16703,Belumosudil,approved; investigational,CC(C)NC(=O)COC1=CC=CC(=C1)C1=NC2=C(C=CC=C2)C(N...,CHEMBL2005186


Now we can join those information of approved drugs to only small molecules:

In [10]:
approved_small_molecules = pd.merge(approved_drugs, small_molecules_list[['DrugBank ID','Drug Type']], how='left', on='DrugBank ID')

print(approved_small_molecules.shape[0])
print(approved_small_molecules['Drug Groups'].unique())

approved_small_molecules

2715
['approved; investigational' 'approved'
 'approved; investigational; vet_approved'
 'approved; investigational; withdrawn'
 'approved; investigational; nutraceutical' 'approved; nutraceutical'
 'approved; investigational; nutraceutical; vet_approved'
 'approved; nutraceutical; vet_approved'
 'approved; nutraceutical; withdrawn' 'approved; illicit; investigational'
 'approved; illicit; withdrawn' 'approved; illicit'
 'approved; experimental' 'approved; vet_approved' 'approved; withdrawn'
 'approved; vet_approved; withdrawn'
 'approved; illicit; investigational; withdrawn'
 'approved; illicit; vet_approved'
 'approved; illicit; investigational; vet_approved'
 'approved; investigational; vet_approved; withdrawn'
 'approved; experimental; investigational; withdrawn'
 'approved; experimental; investigational'
 'approved; experimental; vet_approved']


Unnamed: 0,DrugBank ID,Name,Drug Groups,SMILES,ChEMBL ID,Drug Type
0,DB00006,Bivalirudin,approved; investigational,CC[C@H](C)[C@H](NC(=O)[C@H](CCC(O)=O)NC(=O)[C@...,CHEMBL2103749,SmallMoleculeDrug
1,DB00007,Leuprolide,approved; investigational,CCNC(=O)[C@@H]1CCCN1C(=O)[C@H](CCCNC(N)=N)NC(=...,CHEMBL1201199,SmallMoleculeDrug
2,DB00014,Goserelin,approved,CC(C)C[C@H](NC(=O)[C@@H](COC(C)(C)C)NC(=O)[C@H...,CHEMBL1201247,SmallMoleculeDrug
3,DB00027,Gramicidin D,approved,CC(C)C[C@@H](NC(=O)CNC(=O)[C@@H](NC=O)C(C)C)C(...,CHEMBL557217,SmallMoleculeDrug
4,DB00035,Desmopressin,approved,NC(=O)CC[C@@H]1NC(=O)[C@H](CC2=CC=CC=C2)NC(=O)...,CHEMBL1429,SmallMoleculeDrug
...,...,...,...,...,...,...
2710,DB16627,Melphalan flufenamide,approved; withdrawn,CCOC(=O)[C@H](CC1=CC=C(F)C=C1)NC(=O)[C@@H](N)C...,CHEMBL4303060,SmallMoleculeDrug
2711,DB16628,Fosdenopterin,approved,[H][C@@]12COP(O)(=O)O[C@]1([H])C(O)(O)[C@]1([H...,CHEMBL2338675,SmallMoleculeDrug
2712,DB16629,Serdexmethylphenidate,approved,[H][C@@]1(CCCCN1C(=O)OC[N+]1=CC=CC(=C1)C(=O)N[...,,SmallMoleculeDrug
2713,DB16703,Belumosudil,approved; investigational,CC(C)NC(=O)COC1=CC=CC(=C1)C1=NC2=C(C=CC=C2)C(N...,CHEMBL2005186,SmallMoleculeDrug


We have to remove everything that was withdrawn so our analysis is a little more robust.

In [20]:
# Removing withdrawn molecules
approved_small_molecules = approved_small_molecules[~approved_small_molecules['Drug Groups'].str.contains('withdrawn')].reset_index(drop=True)

# Removing nutraceuticals
approved_small_molecules = approved_small_molecules[~approved_small_molecules['Drug Groups'].str.contains('nutraceutical')].reset_index(drop=True)

print(f'We are left with {approved_small_molecules.shape[0]} structures')

We are left with 2482 structures


In [21]:
approved_small_molecules

Unnamed: 0,DrugBank ID,Name,Drug Groups,SMILES,ChEMBL ID,Drug Type
0,DB00006,Bivalirudin,approved; investigational,CC[C@H](C)[C@H](NC(=O)[C@H](CCC(O)=O)NC(=O)[C@...,CHEMBL2103749,SmallMoleculeDrug
1,DB00007,Leuprolide,approved; investigational,CCNC(=O)[C@@H]1CCCN1C(=O)[C@H](CCCNC(N)=N)NC(=...,CHEMBL1201199,SmallMoleculeDrug
2,DB00014,Goserelin,approved,CC(C)C[C@H](NC(=O)[C@@H](COC(C)(C)C)NC(=O)[C@H...,CHEMBL1201247,SmallMoleculeDrug
3,DB00027,Gramicidin D,approved,CC(C)C[C@@H](NC(=O)CNC(=O)[C@@H](NC=O)C(C)C)C(...,CHEMBL557217,SmallMoleculeDrug
4,DB00035,Desmopressin,approved,NC(=O)CC[C@@H]1NC(=O)[C@H](CC2=CC=CC=C2)NC(=O)...,CHEMBL1429,SmallMoleculeDrug
...,...,...,...,...,...,...
2477,DB16625,Prasterone enantate,approved,[H][C@@]12CCC(=O)[C@@]1(C)CC[C@@]1([H])[C@@]2(...,,SmallMoleculeDrug
2478,DB16628,Fosdenopterin,approved,[H][C@@]12COP(O)(=O)O[C@]1([H])C(O)(O)[C@]1([H...,CHEMBL2338675,SmallMoleculeDrug
2479,DB16629,Serdexmethylphenidate,approved,[H][C@@]1(CCCCN1C(=O)OC[N+]1=CC=CC(=C1)C(=O)N[...,,SmallMoleculeDrug
2480,DB16703,Belumosudil,approved; investigational,CC(C)NC(=O)COC1=CC=CC(=C1)C1=NC2=C(C=CC=C2)C(N...,CHEMBL2005186,SmallMoleculeDrug
