# Data Collection 2: DrugBank.com

##### Disclaimer
Use of DrugBank.com data is for research purposes only.

## Overview
This notebook includes codes used to scrape drug information on the [DrugBank.com](https://www.drugbank.com/) website.

In order to create my own dataset, I gathered more information on DrugBank.com, by searching each drug from the Yellow Card reports.

Again, I first imported the nexessary libraries.

In [1]:
# import necessary pacakges
import requests
import bs4
from bs4 import BeautifulSoup
import numpy as np
from time import sleep
import random
import pandas as pd
from tqdm import tqdm
import string

## Obtain list of drugs
I first load the `yellow_card_links.csv` file created from the previous notebook `1a_yellow_card` to obtain the list of drugs in the Yellow Card reports.

I then create a new dataframe, with just the drug names and Yellow Card IDs.

In [41]:
# import yellow card links csv to get drug name list
yc = pd.read_csv('yellow_card_links.csv', dtype = object)

# create dataframe with an empty drug column
drugbank = pd.DataFrame(columns = ['drug'])
# convert drug names to lowercase and remove [] brackets
drugbank['drug'] = yc['drug_name'].str.lower().str.replace('[','').str.replace(']','')
# copy over the yc_id column
drugbank['yc_id'] = yc['yc_id']
drugbank.head()

Unnamed: 0,drug,yc_id
0,abacavir,40046536
1,abatacept,561378321
2,abciximab,231911819
3,abemaciclib,369408139
4,abiraterone,968368347


## Use Selenium to scrape DrugBank.com
### Set Up Selenium
Let's first set up the necessary settings for Selenium Web Driver:

In [3]:
import selenium
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

# set up webdriver for Selenium
options = webdriver.ChromeOptions()
chrome_options=options
options.add_argument('--enable-javascript')
driver = webdriver.Chrome(executable_path='/Applications/chromedriver', options=options)

# go to drugbank website
driver.get('https://go.drugbank.com/drugs/DB00316')

### Obtain DrugBank IDs
Now let's start scraping. This was done in smaller chunks of 500 drugs to prevent errors. First, I obtain the unique DrugBank IDs for each drug.

In [30]:
# get list of drugs to loop through
drugs = drugbank['drug']

# set up an empty list to save scraped information
db_ids = []

# loop through each drug
for drug in tqdm(drugs):
    
    try:
        # click on search box
        driver.find_element_by_xpath('/html/body/header/nav[2]/div[1]/form/div[2]').click()
        # enter search keyword
        searchbox = driver.find_element_by_xpath('//*[@id="query"]')
        # clear search box
        searchbox.clear()
        # enter search word
        searchbox.send_keys(drug)
        # click enter
        searchbox.send_keys(Keys.ENTER)
        
        # wait 5 seconds
        sleep(5)
        
        try:
            # method 1 - get DrugBank number using xpath
            num = driver.find_element_by_xpath('/html/body/main/div/div/div[2]/div[2]/dl[1]/dd[2]').text
            db_ids.append(num)
        
        except:
            # method 2 - get DrugBank number using class
            num = driver.find_element_by_class_name("col-xl-4 col-md-9 col-sm-8").text
            db_ids.append(num)
            
        # sleep for a few seconds
        sleep(random.randint(5, 8))
    
    
    except:
        # if not successfully scraped, add a null value
        db_ids.append(np.nan)

100%|██████████| 421/421 [1:21:27<00:00, 11.61s/it]


In [42]:
# check length of list
len(db_ids)

2339

In [45]:
# add to drugbank data frame
drugbank['db_id'] = db_ids
# show first 5 rows
drugbank.head()

Unnamed: 0,drug,yc_id,db_id
0,abacavir,40046536,DB01048
1,abatacept,561378321,DB01281
2,abciximab,231911819,DB00054
3,abemaciclib,369408139,DB12001
4,abiraterone,968368347,DB05812


In [46]:
# brief overview
drugbank.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2339 entries, 0 to 2338
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   drug    2339 non-null   object
 1   yc_id   2339 non-null   object
 2   db_id   1981 non-null   object
dtypes: object(3)
memory usage: 54.9+ KB


"Drugs" that do not have `db_id` are reports that involve plant extracts, experimental compounds, and other products that do not have any clinical data. As I am only focusing on medicinal drugs, these are excluded from the capstone.

### Obtain information on drug targets, pathways, and categories
Next, we use `pd.read_html` to get all the tables on each drug's page.
Since each drug has a different number of tables, I looped through each dataframe and checked for the column names.
Due to time constraint, only the drug targets (`targets`), pathways (`pathways`), and drug categories (`drug_cat`) are scraped.

> Note: `pathways` and `drug_cat` were not used in the end as less than a third of the drugs had the information available.

In [43]:
# set up headers for Selenium web driver
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.183 Safari/537.36"
accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9"
encoding = "gzip, deflate, br"
lang = "en-GB,en;q=0.9"
headers = {'accept': accept, 'accept-encoding': encoding,
           'accept-language': lang, 'user-agent': user_agent}

In [100]:
# create empty lists to store scraped information
targets = []
pathways = []
drug_cats = []

# loop throguh each DrugBank ID scraped earlier above
for db_id in tqdm(drugbank['db_id']):

    # add the drugbank id to the url to access each drug's page
    url = f'https://go.drugbank.com/drugs/{db_id}'
    r = requests.get(url, headers=headers)
    
    # get ahfs code (drug category)
    try:
        soup = BeautifulSoup(r.text, 'html.parser')
        results = soup.find_all("ul", class_ = "list-unstyled table-list")
        drug_cats.append(results[3].text)
    except:
        drug_cats.append(np.nan)
      
    
    # read all the tables on the webpage
    try:    
        # get dataframes
        dfs = pd.read_html(r.text)
    
    # error message that there was no table for the specific url
    except:
        print('no tables')
        targets.append(np.nan)
        pathways.append(np.nan)
    
    else:
        target = 0
        pathway = 0
        
        # loop through each dataframe stored in dfs
        for df in dfs:
            
            # extract targets
            if df.columns[0] == 'Target':
                target = 1
                targets.append([x[1:] for x in df['Target'].values])

            # extract pathways
            elif df.columns[0] == 'Pathway':
                pathway = 1
                pathways.append(list(df['Pathway'].values))

            else:
                pass
            
        # once all the dfs are looped through for each drug,
        # if no target table is found, append nan value
        if target == 0:
            targets.append(np.nan)
            
        # same for pathway table
        if pathway == 0:
            pathways.append(np.nan)
    
        # sleep
        sleep(random.randint(3, 6))

  1%|          | 28/2340 [02:27<2:39:18,  4.13s/it]

no tables


  1%|▏         | 33/2340 [02:52<2:37:51,  4.11s/it]

no tables


  1%|▏         | 34/2340 [02:53<1:55:26,  3.00s/it]

no tables


  2%|▏         | 37/2340 [03:03<1:51:04,  2.89s/it]

no tables


  2%|▏         | 40/2340 [03:15<2:04:19,  3.24s/it]

no tables


  2%|▏         | 47/2340 [03:48<2:36:00,  4.08s/it]

no tables


  2%|▏         | 57/2340 [04:40<2:38:59,  4.18s/it]

no tables


  3%|▎         | 66/2340 [05:23<2:09:06,  3.41s/it]

no tables


  3%|▎         | 68/2340 [05:28<1:43:39,  2.74s/it]

no tables


  3%|▎         | 69/2340 [05:29<1:18:45,  2.08s/it]

no tables


  3%|▎         | 74/2340 [05:52<2:07:50,  3.38s/it]

no tables


  4%|▎         | 86/2340 [06:51<2:32:35,  4.06s/it]

no tables


  4%|▎         | 87/2340 [06:52<1:52:47,  3.00s/it]

no tables


  4%|▍         | 101/2340 [08:02<2:28:37,  3.98s/it]

no tables


  4%|▍         | 103/2340 [08:06<1:46:46,  2.86s/it]

no tables


  5%|▌         | 125/2340 [09:55<2:22:57,  3.87s/it]

no tables


  5%|▌         | 128/2340 [10:08<2:18:26,  3.76s/it]

no tables


  6%|▌         | 132/2340 [10:25<2:05:44,  3.42s/it]

no tables


  6%|▌         | 133/2340 [10:25<1:31:27,  2.49s/it]

no tables


  6%|▌         | 134/2340 [10:25<1:07:09,  1.83s/it]

no tables


  6%|▌         | 138/2340 [10:41<1:46:06,  2.89s/it]

no tables


  6%|▌         | 146/2340 [11:16<2:08:08,  3.50s/it]

no tables


  6%|▋         | 148/2340 [11:20<1:40:08,  2.74s/it]

no tables


  6%|▋         | 150/2340 [11:27<1:41:22,  2.78s/it]

no tables


  6%|▋         | 151/2340 [11:27<1:14:38,  2.05s/it]

no tables


  8%|▊         | 179/2340 [13:33<2:14:19,  3.73s/it]

no tables


  8%|▊         | 180/2340 [13:33<1:37:08,  2.70s/it]

no tables


  8%|▊         | 181/2340 [13:33<1:11:23,  1.98s/it]

no tables


  9%|▊         | 201/2340 [15:06<2:06:40,  3.55s/it]

no tables


  9%|▊         | 204/2340 [15:16<1:54:58,  3.23s/it]

no tables


  9%|▉         | 207/2340 [15:29<2:03:30,  3.47s/it]

no tables


  9%|▉         | 219/2340 [16:27<1:52:09,  3.17s/it]

no tables


 10%|█         | 234/2340 [17:43<2:21:32,  4.03s/it]

no tables


 10%|█         | 238/2340 [17:58<1:48:21,  3.09s/it]

no tables


 11%|█         | 248/2340 [18:42<2:13:31,  3.83s/it]

no tables


 11%|█▏        | 267/2340 [20:06<1:56:13,  3.36s/it]

no tables


 11%|█▏        | 268/2340 [20:07<1:24:23,  2.44s/it]

no tables


 12%|█▏        | 275/2340 [20:40<2:06:47,  3.68s/it]

no tables


 13%|█▎        | 299/2340 [22:46<2:16:43,  4.02s/it]

no tables


 13%|█▎        | 304/2340 [23:08<2:05:14,  3.69s/it]

no tables


 13%|█▎        | 308/2340 [23:25<2:04:25,  3.67s/it]

no tables


 13%|█▎        | 315/2340 [23:59<2:12:52,  3.94s/it]

no tables


 14%|█▎        | 316/2340 [24:00<1:35:52,  2.84s/it]

no tables


 14%|█▎        | 319/2340 [24:11<1:40:04,  2.97s/it]

no tables


 14%|█▍        | 326/2340 [24:41<1:51:03,  3.31s/it]

no tables


 14%|█▍        | 332/2340 [25:11<2:13:32,  3.99s/it]

no tables


 14%|█▍        | 335/2340 [25:22<1:51:36,  3.34s/it]

no tables


 15%|█▌        | 361/2340 [27:29<2:13:31,  4.05s/it]

no tables


 15%|█▌        | 362/2340 [27:29<1:36:16,  2.92s/it]

no tables


 16%|█▌        | 363/2340 [27:30<1:10:22,  2.14s/it]

no tables


 16%|█▌        | 367/2340 [27:42<1:21:17,  2.47s/it]

no tables


 16%|█▌        | 369/2340 [27:46<1:12:19,  2.20s/it]

no tables


 16%|█▌        | 370/2340 [27:47<53:25,  1.63s/it]  

no tables


 17%|█▋        | 392/2340 [29:34<1:58:11,  3.64s/it]

no tables


 17%|█▋        | 394/2340 [29:39<1:38:13,  3.03s/it]

no tables


 17%|█▋        | 406/2340 [30:33<1:46:46,  3.31s/it]

no tables


 18%|█▊        | 412/2340 [30:54<1:30:29,  2.82s/it]

no tables


 18%|█▊        | 413/2340 [30:55<1:06:10,  2.06s/it]

no tables


 18%|█▊        | 414/2340 [30:55<49:07,  1.53s/it]  

no tables


 18%|█▊        | 416/2340 [31:00<57:09,  1.78s/it]  

no tables


 18%|█▊        | 420/2340 [31:18<1:47:11,  3.35s/it]

no tables


 18%|█▊        | 424/2340 [31:35<1:51:34,  3.49s/it]

no tables


 18%|█▊        | 427/2340 [31:47<1:45:04,  3.30s/it]

no tables


 18%|█▊        | 432/2340 [32:08<1:47:58,  3.40s/it]

no tables


 19%|█▉        | 449/2340 [33:26<1:38:55,  3.14s/it]

no tables


 19%|█▉        | 450/2340 [33:26<1:11:52,  2.28s/it]

no tables


 19%|█▉        | 453/2340 [33:37<1:26:04,  2.74s/it]

no tables


 20%|█▉        | 464/2340 [34:28<1:57:23,  3.75s/it]

no tables


 20%|██        | 476/2340 [35:24<1:49:08,  3.51s/it]

no tables


 21%|██▏       | 502/2340 [37:27<1:47:05,  3.50s/it]

no tables


 22%|██▏       | 507/2340 [37:43<1:27:59,  2.88s/it]

no tables


 22%|██▏       | 508/2340 [37:44<1:04:07,  2.10s/it]

no tables


 22%|██▏       | 514/2340 [38:11<1:42:35,  3.37s/it]

no tables


 22%|██▏       | 516/2340 [38:16<1:20:49,  2.66s/it]

no tables


 22%|██▏       | 517/2340 [38:16<59:07,  1.95s/it]  

no tables


 22%|██▏       | 519/2340 [38:21<1:02:58,  2.07s/it]

no tables


 22%|██▏       | 520/2340 [38:21<47:24,  1.56s/it]  

no tables


 22%|██▏       | 521/2340 [38:22<36:28,  1.20s/it]

no tables


 22%|██▏       | 523/2340 [38:29<1:02:05,  2.05s/it]

no tables


 23%|██▎       | 527/2340 [38:43<1:24:02,  2.78s/it]

no tables


 23%|██▎       | 531/2340 [38:54<1:16:17,  2.53s/it]

no tables


 23%|██▎       | 534/2340 [39:04<1:18:13,  2.60s/it]

no tables


 23%|██▎       | 536/2340 [39:09<1:15:49,  2.52s/it]

no tables


 23%|██▎       | 537/2340 [39:10<55:50,  1.86s/it]  

no tables


 23%|██▎       | 541/2340 [39:26<1:32:27,  3.08s/it]

no tables


 23%|██▎       | 549/2340 [40:05<1:57:13,  3.93s/it]

no tables


 24%|██▎       | 550/2340 [40:05<1:24:44,  2.84s/it]

no tables


 25%|██▍       | 583/2340 [42:44<1:36:15,  3.29s/it]

no tables


 25%|██▌       | 585/2340 [42:48<1:13:28,  2.51s/it]

no tables


 25%|██▌       | 588/2340 [42:58<1:21:43,  2.80s/it]

no tables


 26%|██▌       | 598/2340 [43:42<1:32:48,  3.20s/it]

no tables


 26%|██▌       | 602/2340 [43:55<1:23:06,  2.87s/it]

no tables


 27%|██▋       | 625/2340 [45:44<2:00:00,  4.20s/it]

no tables


 27%|██▋       | 632/2340 [46:14<1:41:56,  3.58s/it]

no tables


 27%|██▋       | 639/2340 [46:40<1:21:33,  2.88s/it]

no tables


 28%|██▊       | 645/2340 [47:06<1:34:33,  3.35s/it]

no tables


 28%|██▊       | 655/2340 [47:47<1:38:53,  3.52s/it]

no tables


 29%|██▉       | 673/2340 [49:12<1:39:30,  3.58s/it]

no tables


 29%|██▉       | 678/2340 [49:33<1:34:49,  3.42s/it]

no tables


 30%|███       | 702/2340 [51:28<1:41:25,  3.72s/it]

no tables


 30%|███       | 710/2340 [52:02<1:24:54,  3.13s/it]

no tables


 31%|███       | 715/2340 [52:22<1:34:49,  3.50s/it]

no tables


 31%|███       | 722/2340 [52:55<1:37:46,  3.63s/it]

no tables


 31%|███       | 724/2340 [52:59<1:09:41,  2.59s/it]

no tables


 31%|███       | 728/2340 [53:13<1:18:03,  2.91s/it]

no tables


 33%|███▎      | 773/2340 [57:03<1:41:54,  3.90s/it]

no tables


 34%|███▍      | 793/2340 [58:47<1:29:54,  3.49s/it]

no tables


 34%|███▍      | 797/2340 [59:01<1:19:37,  3.10s/it]

no tables


 35%|███▍      | 810/2340 [1:00:05<1:42:05,  4.00s/it]

no tables


 35%|███▍      | 811/2340 [1:00:05<1:13:45,  2.89s/it]

no tables


 35%|███▍      | 813/2340 [1:00:11<1:07:21,  2.65s/it]

no tables


 35%|███▍      | 814/2340 [1:00:11<50:25,  1.98s/it]  

no tables


 35%|███▌      | 824/2340 [1:01:01<1:41:18,  4.01s/it]

no tables


 35%|███▌      | 825/2340 [1:01:01<1:14:43,  2.96s/it]

no tables


 35%|███▌      | 828/2340 [1:01:13<1:21:25,  3.23s/it]

no tables


 36%|███▌      | 833/2340 [1:01:37<1:34:34,  3.77s/it]

no tables


 36%|███▋      | 849/2340 [1:02:46<1:27:22,  3.52s/it]

no tables


 36%|███▋      | 850/2340 [1:02:46<1:03:37,  2.56s/it]

no tables


 37%|███▋      | 856/2340 [1:03:07<1:08:21,  2.76s/it]

no tables


 38%|███▊      | 896/2340 [1:06:29<1:21:10,  3.37s/it]

no tables


 39%|███▉      | 913/2340 [1:07:51<1:25:02,  3.58s/it]

no tables


 39%|███▉      | 916/2340 [1:08:03<1:22:26,  3.47s/it]

no tables


 40%|███▉      | 932/2340 [1:09:08<1:13:26,  3.13s/it]

no tables


 40%|███▉      | 934/2340 [1:09:12<55:21,  2.36s/it]  

no tables


 41%|████      | 948/2340 [1:10:13<1:19:27,  3.42s/it]

no tables


 41%|████      | 950/2340 [1:10:19<1:06:46,  2.88s/it]

no tables


 41%|████      | 953/2340 [1:10:31<1:14:20,  3.22s/it]

no tables


 41%|████      | 958/2340 [1:10:49<1:11:36,  3.11s/it]

no tables


 42%|████▏     | 977/2340 [1:12:25<1:32:37,  4.08s/it]

no tables


 42%|████▏     | 991/2340 [1:13:32<1:35:49,  4.26s/it]

no tables


 43%|████▎     | 997/2340 [1:13:59<1:21:03,  3.62s/it]

no tables


 43%|████▎     | 999/2340 [1:14:03<59:33,  2.66s/it]  

no tables


 43%|████▎     | 1003/2340 [1:14:19<1:10:35,  3.17s/it]

no tables


 43%|████▎     | 1004/2340 [1:14:19<51:14,  2.30s/it]  

no tables


 43%|████▎     | 1005/2340 [1:14:20<38:16,  1.72s/it]

no tables


 43%|████▎     | 1006/2340 [1:14:20<28:47,  1.29s/it]

no tables


 43%|████▎     | 1007/2340 [1:14:20<22:33,  1.02s/it]

no tables


 43%|████▎     | 1009/2340 [1:14:26<38:43,  1.75s/it]

no tables


 43%|████▎     | 1011/2340 [1:14:31<42:09,  1.90s/it]

no tables


 43%|████▎     | 1013/2340 [1:14:37<47:55,  2.17s/it]  

no tables


 43%|████▎     | 1017/2340 [1:14:54<1:13:51,  3.35s/it]

no tables


 44%|████▎     | 1019/2340 [1:14:58<55:18,  2.51s/it]  

no tables


 44%|████▎     | 1022/2340 [1:15:09<1:00:24,  2.75s/it]

no tables


 44%|████▍     | 1024/2340 [1:15:13<51:49,  2.36s/it]  

no tables


 44%|████▍     | 1025/2340 [1:15:14<38:18,  1.75s/it]

no tables


 44%|████▍     | 1026/2340 [1:15:14<28:47,  1.31s/it]

no tables


 44%|████▍     | 1027/2340 [1:15:14<22:10,  1.01s/it]

no tables


 44%|████▍     | 1040/2340 [1:16:22<1:27:27,  4.04s/it]

no tables


 45%|████▍     | 1049/2340 [1:17:02<1:11:25,  3.32s/it]

no tables


 45%|████▍     | 1052/2340 [1:17:13<1:08:59,  3.21s/it]

no tables


 45%|████▌     | 1057/2340 [1:17:34<1:14:40,  3.49s/it]

no tables


 46%|████▌     | 1069/2340 [1:18:25<1:06:06,  3.12s/it]

no tables


 46%|████▌     | 1077/2340 [1:19:00<1:11:13,  3.38s/it]

no tables


 46%|████▌     | 1078/2340 [1:19:00<52:21,  2.49s/it]  

no tables


 46%|████▋     | 1083/2340 [1:19:22<1:05:56,  3.15s/it]

no tables


 47%|████▋     | 1106/2340 [1:21:14<1:07:44,  3.29s/it]

no tables


 48%|████▊     | 1113/2340 [1:21:44<1:11:39,  3.50s/it]

no tables


 48%|████▊     | 1115/2340 [1:21:50<1:00:26,  2.96s/it]

no tables


 48%|████▊     | 1117/2340 [1:21:54<45:57,  2.25s/it]  

no tables


 49%|████▊     | 1138/2340 [1:23:43<1:18:22,  3.91s/it]

no tables


 49%|████▉     | 1155/2340 [1:25:07<1:15:30,  3.82s/it]

no tables


 50%|████▉     | 1167/2340 [1:25:59<1:05:13,  3.34s/it]

no tables


 50%|████▉     | 1168/2340 [1:25:59<47:24,  2.43s/it]  

no tables


 50%|████▉     | 1169/2340 [1:25:59<34:48,  1.78s/it]

no tables


 50%|█████     | 1173/2340 [1:26:15<52:41,  2.71s/it]  

no tables


 51%|█████     | 1185/2340 [1:27:09<1:05:51,  3.42s/it]

no tables


 51%|█████     | 1187/2340 [1:27:15<55:28,  2.89s/it]  

no tables


 51%|█████     | 1189/2340 [1:27:22<55:05,  2.87s/it]  

no tables


 51%|█████▏    | 1200/2340 [1:28:10<1:07:43,  3.56s/it]

no tables


 51%|█████▏    | 1202/2340 [1:28:17<1:00:58,  3.21s/it]

no tables


 52%|█████▏    | 1208/2340 [1:28:43<1:04:22,  3.41s/it]

no tables


 52%|█████▏    | 1210/2340 [1:28:47<47:29,  2.52s/it]  

no tables


 52%|█████▏    | 1219/2340 [1:29:34<1:16:47,  4.11s/it]

no tables


 53%|█████▎    | 1235/2340 [1:30:49<1:03:05,  3.43s/it]

no tables


 53%|█████▎    | 1241/2340 [1:31:14<1:00:40,  3.31s/it]

no tables


 53%|█████▎    | 1245/2340 [1:31:29<55:35,  3.05s/it]  

no tables


 53%|█████▎    | 1250/2340 [1:31:45<51:02,  2.81s/it]  

no tables


 54%|█████▎    | 1252/2340 [1:31:50<44:43,  2.47s/it]  

no tables


 54%|█████▎    | 1257/2340 [1:32:11<56:56,  3.15s/it]  

no tables


 54%|█████▍    | 1261/2340 [1:32:30<1:08:01,  3.78s/it]

no tables


 55%|█████▍    | 1280/2340 [1:34:09<1:10:23,  3.98s/it]

no tables


 55%|█████▍    | 1282/2340 [1:34:16<59:32,  3.38s/it]  

no tables


 55%|█████▍    | 1284/2340 [1:34:20<44:31,  2.53s/it]  

no tables


 55%|█████▌    | 1287/2340 [1:34:31<51:54,  2.96s/it]  

no tables


 55%|█████▌    | 1295/2340 [1:35:03<57:06,  3.28s/it]  

no tables


 55%|█████▌    | 1298/2340 [1:35:12<48:58,  2.82s/it]  

no tables


 56%|█████▌    | 1299/2340 [1:35:12<35:55,  2.07s/it]

no tables


 56%|█████▌    | 1305/2340 [1:35:39<1:00:39,  3.52s/it]

no tables


 56%|█████▌    | 1306/2340 [1:35:40<44:31,  2.58s/it]  

no tables


 56%|█████▌    | 1307/2340 [1:35:40<32:36,  1.89s/it]

no tables


 56%|█████▌    | 1315/2340 [1:36:15<1:03:08,  3.70s/it]

no tables


 57%|█████▋    | 1332/2340 [1:37:36<1:00:15,  3.59s/it]

no tables


 57%|█████▋    | 1334/2340 [1:37:43<54:08,  3.23s/it]  

no tables


 58%|█████▊    | 1354/2340 [1:39:17<1:03:37,  3.87s/it]

no tables


 58%|█████▊    | 1358/2340 [1:39:32<56:04,  3.43s/it]  

no tables


 58%|█████▊    | 1368/2340 [1:40:12<50:05,  3.09s/it]  

no tables


 59%|█████▊    | 1370/2340 [1:40:17<41:22,  2.56s/it]

no tables


 59%|█████▊    | 1371/2340 [1:40:17<30:28,  1.89s/it]

no tables


 60%|█████▉    | 1395/2340 [1:42:06<53:10,  3.38s/it]  

no tables


 60%|██████    | 1415/2340 [1:43:50<1:01:24,  3.98s/it]

no tables


 61%|██████    | 1418/2340 [1:44:00<51:45,  3.37s/it]  

no tables


 61%|██████    | 1427/2340 [1:44:38<46:20,  3.05s/it]  

no tables


 61%|██████▏   | 1438/2340 [1:45:30<55:28,  3.69s/it]  

no tables


 61%|██████▏   | 1439/2340 [1:45:31<40:55,  2.73s/it]

no tables


 62%|██████▏   | 1448/2340 [1:46:08<48:15,  3.25s/it]  

no tables


 62%|██████▏   | 1455/2340 [1:46:43<56:04,  3.80s/it]  

no tables


 63%|██████▎   | 1469/2340 [1:47:46<53:36,  3.69s/it]  

no tables


 63%|██████▎   | 1475/2340 [1:48:11<49:25,  3.43s/it]  

no tables


 64%|██████▍   | 1504/2340 [1:50:35<49:08,  3.53s/it]  

no tables


 65%|██████▍   | 1514/2340 [1:51:16<48:23,  3.52s/it]  

no tables


 65%|██████▍   | 1519/2340 [1:51:40<49:10,  3.59s/it]  

no tables


 65%|██████▌   | 1522/2340 [1:51:50<41:13,  3.02s/it]  

no tables


 65%|██████▌   | 1523/2340 [1:51:50<30:04,  2.21s/it]

no tables


 65%|██████▌   | 1525/2340 [1:51:57<34:51,  2.57s/it]

no tables


 65%|██████▌   | 1529/2340 [1:52:11<38:41,  2.86s/it]

no tables


 65%|██████▌   | 1530/2340 [1:52:11<28:16,  2.09s/it]

no tables


 66%|██████▌   | 1534/2340 [1:52:26<36:02,  2.68s/it]

no tables


 66%|██████▌   | 1541/2340 [1:53:02<54:11,  4.07s/it]  

no tables


 66%|██████▌   | 1547/2340 [1:53:27<47:12,  3.57s/it]  

no tables


 66%|██████▌   | 1550/2340 [1:53:34<33:40,  2.56s/it]

no tables


 66%|██████▋   | 1553/2340 [1:53:47<42:47,  3.26s/it]

no tables


 67%|██████▋   | 1565/2340 [1:54:47<46:01,  3.56s/it]  

no tables


 67%|██████▋   | 1566/2340 [1:54:47<33:25,  2.59s/it]

no tables


 68%|██████▊   | 1581/2340 [1:56:02<47:07,  3.73s/it]  

no tables


 68%|██████▊   | 1584/2340 [1:56:11<35:47,  2.84s/it]

no tables


 68%|██████▊   | 1592/2340 [1:56:41<36:25,  2.92s/it]

no tables


 68%|██████▊   | 1600/2340 [1:57:17<42:35,  3.45s/it]  

no tables


 69%|██████▊   | 1604/2340 [1:57:35<44:39,  3.64s/it]  

no tables


 69%|██████▊   | 1605/2340 [1:57:35<32:31,  2.66s/it]

no tables


 69%|██████▊   | 1607/2340 [1:57:40<28:25,  2.33s/it]

no tables


 69%|██████▊   | 1608/2340 [1:57:40<20:57,  1.72s/it]

no tables


 69%|██████▉   | 1611/2340 [1:57:50<30:43,  2.53s/it]

no tables


 69%|██████▉   | 1614/2340 [1:57:59<29:07,  2.41s/it]

no tables


 69%|██████▉   | 1616/2340 [1:58:03<24:15,  2.01s/it]

no tables


 69%|██████▉   | 1617/2340 [1:58:03<18:04,  1.50s/it]

no tables


 69%|██████▉   | 1618/2340 [1:58:03<13:45,  1.14s/it]

no tables


 69%|██████▉   | 1621/2340 [1:58:16<31:07,  2.60s/it]

no tables


 71%|███████   | 1657/2340 [2:01:13<42:38,  3.75s/it]  

no tables


 71%|███████   | 1658/2340 [2:01:14<31:33,  2.78s/it]

no tables


 71%|███████   | 1663/2340 [2:01:33<36:32,  3.24s/it]

no tables


 72%|███████▏  | 1689/2340 [2:03:41<42:29,  3.92s/it]  

no tables


 72%|███████▏  | 1691/2340 [2:03:47<33:56,  3.14s/it]

no tables


 72%|███████▏  | 1695/2340 [2:04:03<33:58,  3.16s/it]

no tables


 73%|███████▎  | 1697/2340 [2:04:09<30:05,  2.81s/it]

no tables


 73%|███████▎  | 1698/2340 [2:04:09<22:00,  2.06s/it]

no tables


 73%|███████▎  | 1701/2340 [2:04:19<24:58,  2.35s/it]

no tables


 73%|███████▎  | 1711/2340 [2:04:59<34:29,  3.29s/it]

no tables


 73%|███████▎  | 1712/2340 [2:04:59<25:41,  2.46s/it]

no tables


 73%|███████▎  | 1718/2340 [2:05:24<31:49,  3.07s/it]

no tables


 73%|███████▎  | 1719/2340 [2:05:24<23:03,  2.23s/it]

no tables


 74%|███████▎  | 1721/2340 [2:05:30<24:07,  2.34s/it]

no tables


 74%|███████▎  | 1723/2340 [2:05:35<22:28,  2.19s/it]

no tables


 74%|███████▍  | 1726/2340 [2:05:44<24:22,  2.38s/it]

no tables


 74%|███████▍  | 1728/2340 [2:05:49<22:21,  2.19s/it]

no tables


 74%|███████▍  | 1729/2340 [2:05:49<16:33,  1.63s/it]

no tables


 74%|███████▍  | 1731/2340 [2:05:54<18:24,  1.81s/it]

no tables


 74%|███████▍  | 1732/2340 [2:05:54<13:47,  1.36s/it]

no tables


 74%|███████▍  | 1733/2340 [2:05:55<10:26,  1.03s/it]

no tables


 74%|███████▍  | 1734/2340 [2:05:55<08:21,  1.21it/s]

no tables


 74%|███████▍  | 1736/2340 [2:06:02<18:40,  1.86s/it]

no tables


 74%|███████▍  | 1737/2340 [2:06:02<14:17,  1.42s/it]

no tables


 74%|███████▍  | 1739/2340 [2:06:06<15:54,  1.59s/it]

no tables


 75%|███████▍  | 1744/2340 [2:06:28<31:42,  3.19s/it]

no tables


 75%|███████▍  | 1750/2340 [2:06:53<32:45,  3.33s/it]

no tables


 75%|███████▍  | 1753/2340 [2:07:04<32:31,  3.32s/it]

no tables


 75%|███████▌  | 1761/2340 [2:07:38<28:47,  2.98s/it]

no tables


 75%|███████▌  | 1763/2340 [2:07:45<28:17,  2.94s/it]

no tables


 75%|███████▌  | 1764/2340 [2:07:45<20:47,  2.17s/it]

no tables


 76%|███████▌  | 1782/2340 [2:09:07<28:14,  3.04s/it]

no tables


 77%|███████▋  | 1804/2340 [2:10:53<32:05,  3.59s/it]

no tables


 77%|███████▋  | 1806/2340 [2:11:00<28:35,  3.21s/it]

no tables


 78%|███████▊  | 1821/2340 [2:12:14<32:28,  3.75s/it]

no tables


 78%|███████▊  | 1822/2340 [2:12:15<23:29,  2.72s/it]

no tables


 78%|███████▊  | 1823/2340 [2:12:15<17:27,  2.03s/it]

no tables


 78%|███████▊  | 1829/2340 [2:12:43<29:54,  3.51s/it]

no tables


 78%|███████▊  | 1830/2340 [2:12:43<22:19,  2.63s/it]

no tables


 79%|███████▊  | 1838/2340 [2:13:15<24:34,  2.94s/it]

no tables


 79%|███████▉  | 1854/2340 [2:14:36<30:57,  3.82s/it]

no tables


 79%|███████▉  | 1856/2340 [2:14:42<25:14,  3.13s/it]

no tables


 80%|███████▉  | 1869/2340 [2:15:44<31:59,  4.08s/it]

no tables


 80%|███████▉  | 1870/2340 [2:15:45<23:18,  2.97s/it]

no tables


 80%|███████▉  | 1871/2340 [2:15:45<16:57,  2.17s/it]

no tables


 80%|████████  | 1872/2340 [2:15:45<12:37,  1.62s/it]

no tables


 80%|████████  | 1875/2340 [2:15:56<19:33,  2.52s/it]

no tables


 81%|████████  | 1901/2340 [2:17:53<21:13,  2.90s/it]

no tables


 81%|████████▏ | 1906/2340 [2:18:12<21:56,  3.03s/it]

no tables


 81%|████████▏ | 1907/2340 [2:18:13<16:00,  2.22s/it]

no tables


 82%|████████▏ | 1910/2340 [2:18:22<18:04,  2.52s/it]

no tables


 82%|████████▏ | 1912/2340 [2:18:26<14:31,  2.04s/it]

no tables


 82%|████████▏ | 1913/2340 [2:18:27<11:06,  1.56s/it]

no tables


 82%|████████▏ | 1923/2340 [2:19:14<25:35,  3.68s/it]

no tables


 82%|████████▏ | 1929/2340 [2:19:43<25:09,  3.67s/it]

no tables


 82%|████████▏ | 1930/2340 [2:19:43<18:05,  2.65s/it]

no tables


 83%|████████▎ | 1931/2340 [2:19:43<13:21,  1.96s/it]

no tables


 83%|████████▎ | 1937/2340 [2:20:05<17:34,  2.62s/it]

no tables


 83%|████████▎ | 1948/2340 [2:20:54<24:06,  3.69s/it]

no tables


 83%|████████▎ | 1949/2340 [2:20:54<17:39,  2.71s/it]

no tables


 83%|████████▎ | 1951/2340 [2:21:01<17:16,  2.66s/it]

no tables


 84%|████████▎ | 1955/2340 [2:21:13<16:49,  2.62s/it]

no tables


 84%|████████▍ | 1963/2340 [2:21:44<17:52,  2.84s/it]

no tables


 84%|████████▍ | 1964/2340 [2:21:45<13:02,  2.08s/it]

no tables


 84%|████████▍ | 1965/2340 [2:21:45<09:58,  1.60s/it]

no tables


 84%|████████▍ | 1967/2340 [2:21:50<11:25,  1.84s/it]

no tables


 84%|████████▍ | 1969/2340 [2:21:54<10:44,  1.74s/it]

no tables


 84%|████████▍ | 1970/2340 [2:21:54<08:05,  1.31s/it]

no tables


 85%|████████▍ | 1981/2340 [2:22:47<23:15,  3.89s/it]

no tables


 85%|████████▍ | 1982/2340 [2:22:48<16:57,  2.84s/it]

no tables


 85%|████████▍ | 1984/2340 [2:22:54<16:10,  2.73s/it]

no tables


 85%|████████▌ | 1993/2340 [2:23:33<20:58,  3.63s/it]

no tables


 86%|████████▌ | 2004/2340 [2:24:25<21:51,  3.90s/it]

no tables


 86%|████████▌ | 2011/2340 [2:24:56<19:55,  3.63s/it]

no tables


 86%|████████▌ | 2015/2340 [2:25:13<19:13,  3.55s/it]

no tables


 86%|████████▌ | 2018/2340 [2:25:21<14:33,  2.71s/it]

no tables


 86%|████████▋ | 2021/2340 [2:25:30<13:15,  2.49s/it]

no tables


 87%|████████▋ | 2025/2340 [2:25:44<15:52,  3.02s/it]

no tables


 87%|████████▋ | 2028/2340 [2:25:57<17:37,  3.39s/it]

no tables


 88%|████████▊ | 2060/2340 [2:28:31<15:06,  3.24s/it]

no tables


 88%|████████▊ | 2070/2340 [2:29:19<15:24,  3.42s/it]

no tables


 89%|████████▊ | 2071/2340 [2:29:19<11:09,  2.49s/it]

no tables


 89%|████████▉ | 2082/2340 [2:30:08<15:54,  3.70s/it]

no tables


 89%|████████▉ | 2085/2340 [2:30:19<13:30,  3.18s/it]

no tables


 89%|████████▉ | 2089/2340 [2:30:34<12:25,  2.97s/it]

no tables


 89%|████████▉ | 2091/2340 [2:30:41<12:01,  2.90s/it]

no tables


 91%|█████████ | 2121/2340 [2:33:12<14:56,  4.09s/it]

no tables


 91%|█████████ | 2128/2340 [2:33:45<13:09,  3.72s/it]

no tables


 91%|█████████ | 2132/2340 [2:33:59<10:49,  3.12s/it]

no tables


 91%|█████████ | 2133/2340 [2:34:00<07:51,  2.28s/it]

no tables


 92%|█████████▏| 2145/2340 [2:34:58<12:04,  3.72s/it]

no tables


 92%|█████████▏| 2149/2340 [2:35:14<11:14,  3.53s/it]

no tables


 93%|█████████▎| 2172/2340 [2:37:02<09:24,  3.36s/it]

no tables


 93%|█████████▎| 2173/2340 [2:37:02<06:46,  2.43s/it]

no tables


 93%|█████████▎| 2179/2340 [2:37:23<07:08,  2.66s/it]

no tables


 93%|█████████▎| 2186/2340 [2:37:50<08:29,  3.31s/it]

no tables


 95%|█████████▍| 2212/2340 [2:40:06<08:47,  4.12s/it]

no tables


 95%|█████████▍| 2218/2340 [2:40:33<07:01,  3.46s/it]

no tables


 95%|█████████▍| 2219/2340 [2:40:33<05:06,  2.53s/it]

no tables


 95%|█████████▌| 2229/2340 [2:41:21<06:46,  3.67s/it]

no tables


 96%|█████████▋| 2253/2340 [2:43:16<05:47,  4.00s/it]

no tables


 96%|█████████▋| 2257/2340 [2:43:32<04:52,  3.52s/it]

no tables


 97%|█████████▋| 2262/2340 [2:43:54<04:44,  3.64s/it]

no tables


 97%|█████████▋| 2274/2340 [2:44:51<03:51,  3.50s/it]

no tables


 97%|█████████▋| 2281/2340 [2:45:25<03:42,  3.78s/it]

no tables


 98%|█████████▊| 2298/2340 [2:46:48<02:44,  3.92s/it]

no tables


 98%|█████████▊| 2300/2340 [2:46:52<01:49,  2.73s/it]

no tables


 98%|█████████▊| 2301/2340 [2:46:52<01:17,  1.99s/it]

no tables


 98%|█████████▊| 2302/2340 [2:46:52<00:56,  1.49s/it]

no tables


 98%|█████████▊| 2303/2340 [2:46:53<00:44,  1.19s/it]

no tables


 98%|█████████▊| 2304/2340 [2:46:53<00:33,  1.07it/s]

no tables


 99%|█████████▊| 2305/2340 [2:46:53<00:26,  1.34it/s]

no tables


 99%|█████████▉| 2318/2340 [2:47:56<01:15,  3.41s/it]

no tables


 99%|█████████▉| 2324/2340 [2:48:17<00:46,  2.92s/it]

no tables


 99%|█████████▉| 2325/2340 [2:48:18<00:32,  2.14s/it]

no tables


100%|█████████▉| 2330/2340 [2:48:40<00:32,  3.30s/it]

no tables


100%|██████████| 2340/2340 [2:49:31<00:00,  4.35s/it]


In [104]:
# check lengths of lists storing scraped information
len(targets), len(pathways), len(drug_cats)

(2340, 2340, 2340)

## Create DrugBank dataframe
Now we create a dataframe using the information scraped from the website, using the codes above.

In [106]:
# create dataframe from scraped info
drugbank['target'] = targets
drugbank['pathways'] = pathways
drugbank['drug_cat'] = drug_cats
drugbank['db_id'] = db_ids
drugbank.head()

Unnamed: 0,drug,yc_id,db_id,target,pathways,drug_cat
0,abacavir,40046536,DB01048,"[Reverse transcriptase/RNaseH, HLA class I his...",[Abacavir Action Pathway],08:18.08.20 — Nucleoside and Nucleotide Revers...
1,abatacept,561378321,DB01281,"[T-lymphocyte activation antigen CD80, T-lymph...",,Bristol-Myers Squibb Co.\nCelltrion Inc.\nE.R....
2,abciximab,231911819,DB00054,"[Integrin beta-3, Integrin alpha-IIb, Low affi...",[Abciximab Action Pathway],92:00.00 — Miscellaneous Therapeutic Agents
3,abemaciclib,369408139,DB12001,"[Cyclin-dependent kinase 4, Cyclin-dependent k...",,10:00.00 — Antineoplastic Agents
4,abiraterone,968368347,DB05812,"[Steroid 17-alpha-hydroxylase/17,20 lyase]",,10:00.00 — Antineoplastic Agents


In [107]:
drugbank.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2340 entries, 0 to 2339
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   drug      2340 non-null   object
 1   yc_id     2340 non-null   object
 2   db_id     1987 non-null   object
 3   target    1565 non-null   object
 4   pathways  402 non-null    object
 5   drug_cat  1072 non-null   object
dtypes: object(6)
memory usage: 109.8+ KB


## Data Cleaning

### Initial Overview
Now we clean the dataframe that we have just created.

In [108]:
# check for null values
drugbank.isnull().sum()

drug           0
yc_id          0
db_id        353
target       775
pathways    1938
drug_cat    1268
dtype: int64

In [109]:
# check shape of dataframe
drugbank.shape

(2340, 6)

### Manually fill in gaps
Since there was not a lot of drugs in the `not_found` list, I manually searched for the DrugBank IDs in case my Selenium codes did not find it in the first attempt.

In [110]:
# list of drugs not found while scraping
not_found = list(drugbank[drugbank['db_id'].isnull()]['drug'].values)
print(len(not_found))
# show first 20 values
not_found[:20]

353


['adenophora',
 'agrimonia',
 'agropyron repens',
 'alathon polyethylene resin',
 'alchemilla vulgaris',
 'alexitol',
 'allium sativum',
 'alpha-1 proteinase inhibitor',
 'alphadolone acetate',
 'alphaxalone',
 'althaea',
 'ambucetamide',
 'ambutonium',
 'aminoxidase',
 'amiphenazole',
 'anamirta cocculus',
 'anethum graveolens',
 'anti-d (rho) immunoglobulin',
 'anti-inhibitor coagulant complex, dried',
 'antilymphocyte immunoglobulins']

In [89]:
# for instance, I manually added the DB ids
x = drugbank[drugbank['drug'] =='sulphasalazine'].index
drugbank.loc[x, 'db_id'] = 'DB00795'
x = drugbank[drugbank['drug'] =='sulphamethoxypyridazine'].index
drugbank.loc[x, 'db_id'] = 'DB13773'
x = drugbank[drugbank['drug'] =='sulphamethoxydiazine'].index
drugbank.loc[x, 'db_id'] = 'DB00576'
x = drugbank[drugbank['drug'] =='sulphamethizole'].index
drugbank.loc[x, 'db_id'] = 'DB12829'
x = drugbank[drugbank['drug'] =='sulphadimidine'].index
drugbank.loc[x, 'db_id'] = 'DB01582'
x = drugbank[drugbank['drug'] =='sodium thiosulphate'].index
drugbank.loc[x, 'db_id'] = 'DB09499'
x = drugbank[drugbank['drug'] =='sodium picosulphate'].index
drugbank.loc[x, 'db_id'] = 'DB09268'
x = drugbank[drugbank['drug'] =='senna'].index
drugbank.loc[x, 'db_id'] = 'DB11365'
x = drugbank[drugbank['drug'] =='thiacetazone'].index
drugbank.loc[x, 'db_id'] = 'DB12829'
x = drugbank[drugbank['drug'] =='sacubitril/valsartan'].index
drugbank.loc[x, 'db_id'] = 'DB09292'
x = drugbank[drugbank['drug'] =='polymyxin'].index
drugbank.loc[x, 'db_id'] = 'DB00781'
x = drugbank[drugbank['drug'] =='lithium'].index
drugbank.loc[x, 'db_id'] = 'DB14507'

Here I looked at drugbank Ids that occured more than once, and amended the errors.

In [91]:
# check for drugbank ids that appeared more than once
dup_ids = list(drugbank.db_id.value_counts()[drugbank.db_id.value_counts() > 1].index)
dup_ids

['DB00066',
 'DB01053',
 'DB00052',
 'DB09449',
 'DB00067',
 'DB00863',
 'DB04398',
 'DB00783',
 'DB00085',
 'DB12829',
 'DB12667',
 'DB09146']

In [101]:
# show first 10
drugbank[drugbank.db_id.isin(dup_ids)].sort_values('db_id').head(10)

Unnamed: 0,drug,yc_id,db_id,target,pathways,drug_cat
2001,somatropin,314288027,DB00052,"[Growth hormone receptor, Prolactin receptor]",,Genentech inc\nSerono laboratories inc\nCangen...
988,growth hormone,996119734,DB00052,"[Growth hormone receptor, Prolactin receptor]",,Genentech inc\nSerono laboratories inc\nCangen...
900,follitropin delta,58677892,DB00066,[Follicle-stimulating hormone receptor],,Organon usa inc\nEmd serono inc
899,follitropin beta,127709748,DB00066,[Follicle-stimulating hormone receptor],,Organon usa inc\nEmd serono inc
897,follicle stimulating hormone,322144909,DB00066,[Follicle-stimulating hormone receptor],,Organon usa inc\nEmd serono inc
898,follitropin alfa,1040768484,DB00066,[Follicle-stimulating hormone receptor],,Organon usa inc\nEmd serono inc
153,argipressin,538552928,DB00067,"[Vasopressin V2 receptor, Vasopressin V1a rece...",,Parke davis div warner lambert co
2270,vasopressin,121577179,DB00067,"[Vasopressin V2 receptor, Vasopressin V1a rece...",,Parke davis div warner lambert co
1593,pancrelipase,1001595052,DB00085,"[Dietary fat, Dietary protein, Dietary starch]",,56:16.00 — Digestants
1592,pancreatin,240328336,DB00085,"[Dietary fat, Dietary protein, Dietary starch]",,56:16.00 — Digestants


In [78]:
# more manual corrections
drugbank.loc[379, 'db_id'] = 'DB01416'
drugbank.loc[371, 'db_id'] = 'DB01140'
drugbank.loc[372, 'db_id'] = 'DB00567'
drugbank.loc[373, 'db_id'] = 'DB00535'
drugbank.loc[374, 'db_id'] = 'DB01413'
drugbank.loc[369, 'db_id'] = np.nan
drugbank.loc[376, 'db_id'] = 'DB00493'
drugbank.loc[383, 'db_id'] = 'DB01415'
drugbank.loc[375, 'db_id'] = 'DB00671'
drugbank.loc[382, 'db_id'] = 'DB00438'
drugbank.loc[381, 'db_id'] = 'DB06590'
drugbank.loc[380, 'db_id'] = 'DB13499'
drugbank.loc[389, 'db_id'] = 'DB00482'
drugbank.loc[378, 'db_id'] = 'DB13682'
drugbank.loc[377, 'db_id'] = 'DB01331'
drugbank.loc[364, 'db_id'] = 'DB10516'
drugbank.loc[388, 'db_id'] = 'DB01112'
drugbank.loc[390, 'db_id'] = 'DB04846'
drugbank.loc[391, 'db_id'] = np.nan
drugbank.loc[392, 'db_id'] = 'DB14707'
drugbank.loc[363, 'db_id'] = 'DB01136'
drugbank.loc[394, 'db_id'] = 'DB09008'
drugbank.loc[395, 'db_id'] = 'DB00456'
drugbank.loc[396, 'db_id'] = 'DB01326'
drugbank.loc[398, 'db_id'] = 'DB09063'
drugbank.loc[387, 'db_id'] = 'DB01212'
drugbank.loc[399, 'db_id'] = 'DB00439'
drugbank.loc[400, 'db_id'] = 'DB13173'
drugbank.loc[401, 'db_id'] = 'DB08904'
drugbank.loc[362, 'db_id'] = np.nan
drugbank.loc[361, 'db_id'] = np.nan
drugbank.loc[360, 'db_id'] = np.nan
drugbank.loc[359, 'db_id'] = 'DB00521'
drugbank.loc[397, 'db_id'] = 'DB01333'
drugbank.loc[386, 'db_id'] = 'DB09050'
drugbank.loc[393, 'db_id'] = np.nan
drugbank.loc[384, 'db_id'] = 'DB01332'
drugbank.loc[365, 'db_id'] = 'DB00520'
drugbank.loc[366, 'db_id'] = np.nan
drugbank.loc[367, 'db_id'] = 'DB14298'
drugbank.loc[368, 'db_id'] = np.nan
drugbank.loc[385, 'db_id'] = 'DB04918'
drugbank.loc[370, 'db_id'] = 'DB00833'

In [95]:
# amend trometamol
drugbank.loc[2231]['db_id'] = 'DB03754'
drugbank.loc[2231]['target'] = ['Amyloid beta A4 protein']
drugbank.loc[2231]['pathways'] = np.nan
drugbank.loc[2231]['drug_cat'] = '40:08.00 — Alkalinizing Agents'

# amend gallium
drugbank.loc[934]['db_id'] = 'DB15494'
drugbank.loc[934]['target'] = ['Somatostatin receptor type 2', 'Somatostatin receptor type 5',
                               'Somatostatin receptor type 3', 'Somatostatin receptor type 1']

I also noticed that information on iodine was missing and manually added that in.

In [98]:
# check if iodine is in the dataframe
drugbank[drugbank.drug.str.contains('iodine')]

Unnamed: 0,drug,yc_id,db_id,target,pathways,drug_cat
307,cadexomer-iodine,743544955,,,,
1114,iodobenzylguanidine m- (iodine-131),501603764,,,,
1765,povidone-iodine,850028924,DB06812,,,


In [99]:
# add in information for iodine

# create variable t for iodine's targets
t = ['Arachidonate 5-lipoxygenase',
'Prostaglandin G/H synthase 2',
'Prostaglandin G/H synthase 1',
'Peroxisome proliferator-activated receptor gamma',
'Inhibitor of nuclear factor kappa-B kinase subunit',
'Inhibitor of nuclear factor kappa-B kinase subunit',
'Cystine/glutamate transporter',
'Acetyl-CoA acetyltransferase, mitochondrial',
'Thromboxane-A synthase',
'Phospholipase A2']

# append iodine row and save to dataframe
drugbank = drugbank.append(
    {'drug': 'iodine',
     'yc_id': '000743544955',
     'target': ['Sodium/iodide cotransporter', 'Microbial proteins'],
     'drug_cat': ['92:01.00 — Herbs and Natural Products', '88:29.00 — Minerals', '84:04.92 — Miscellaneous Local Anti-infectives'],
     'db_id': 'DB05382'},
    ignore_index = True)

After making the manual corrections above, I re-ran the web-scrape code in part 1.3.3, and moved onto the next step.

### Data cleaning
#### Pathways column
I dropped the pathways column as over 80% of the drugs did not have such information.

In [111]:
# drop pathways as not enough data
drugbank.drop(columns = ['pathways'], inplace = True)

#### Drug target column
I filled null values with a string of `'nan'`, so that it will be a category on its own when I dummify this column at a later stage.

In [118]:
drugbank['target'][drugbank['target'].isnull()] = 'nan'

#### Drug category column
I filled null values with a string of `'nan'`, so that it will be a category on its own when I dummify this column at a later stage.

In [112]:
drugbank['drug_cat'][drugbank['drug_cat'].isnull()] = 'nan'

This column also needs some cleaning, as the list of categories are saved as one long string with to delimiter `'\n'` included, for instance:

In [113]:
drugbank.loc[22, 'drug_cat']

'84:04.06 — Antivirals\n08:18.32 — Nucleosides and Nucleotides'

In [114]:
# split on \n
drugbank['drug_cat'] = drugbank['drug_cat'].apply(lambda x: x.split('\n'))

#### Drop rows without db_id
I then decided to only retain rows that have the drugbank IDs. Now I have 2326 rows, compared to 2339 initially.

In [120]:
drugbank = drugbank[~drugbank['db_id'].isnull()].copy()
drugbank.reset_index(drop = True, inplace = True)
drugbank.shape

(1987, 5)

In [123]:
drugbank.head()

Unnamed: 0,drug,yc_id,db_id,target,drug_cat
0,abacavir,40046536,DB01048,"[Reverse transcriptase/RNaseH, HLA class I his...",[08:18.08.20 — Nucleoside and Nucleotide Rever...
1,abatacept,561378321,DB01281,"[T-lymphocyte activation antigen CD80, T-lymph...","[Bristol-Myers Squibb Co., Celltrion Inc., E.R..."
2,abciximab,231911819,DB00054,"[Integrin beta-3, Integrin alpha-IIb, Low affi...",[92:00.00 — Miscellaneous Therapeutic Agents]
3,abemaciclib,369408139,DB12001,"[Cyclin-dependent kinase 4, Cyclin-dependent k...",[10:00.00 — Antineoplastic Agents]
4,abiraterone,968368347,DB05812,"[Steroid 17-alpha-hydroxylase/17,20 lyase]",[10:00.00 — Antineoplastic Agents]


In [121]:
drugbank.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1987 entries, 0 to 1986
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   drug      1987 non-null   object
 1   yc_id     1987 non-null   object
 2   db_id     1987 non-null   object
 3   target    1987 non-null   object
 4   drug_cat  1987 non-null   object
dtypes: object(5)
memory usage: 77.7+ KB


In [122]:
## save dataframe
# drugbank.to_csv('drugbank_final.csv', index = False, header = True)

# Next step:
Now let's move onto the last notebook on data collection: `1c_reviews`