# Data Collection 1: Yellow Card Scheme

##### Disclaimer
> The MHRA and CHM encourage the use of data from the Yellow Card Scheme in research and for publication, but wish to ensure that the limitations of interpretation of the data are made clear.
If you propose to publish information based on Yellow Card data or Interactive Drug Analysis Profiles, the MHRA is most willing to provide advice on how the Yellow Card information might be best used and presented. The MHRA is also willing to provide feedback on manuscripts prior to publication. Please write to the Director, Vigilance and Risk Management of Medicines Division by email.

## Overview
This notebook includes codes used to download side effect reports on the Yellow Card Scheme Interactive Drug Analysis Profiles (iDAP).

In [1]:
# import necessary pacakges
import requests
import bs4
from bs4 import BeautifulSoup
import numpy as np
from time import sleep
import random
import pandas as pd
from tqdm import tqdm
import string

# Download files using Selenium

## Inspect website elements
In order to efficiently download the zip files from Yellow Card iDAP, let's inspect the website elements. This can be done by right-clicking on the item you wish to look at, and selecting the `Inspect` option.
1. Element from yellow card list of drugs:
`<a href="dap.html?drug=./UK_EXTERNAL/NONCOMBINED/UK_NON_000692998184.zip&amp;agency=MHRA" target="_blank">Zafirlukast</a>`

2. Element for downloading csv file:
`<a href="data/./UK_EXTERNAL/NONCOMBINED/UK_NON_000692998184.zip">click here</a>`

3. Link for downloading csv files:
`https://info.mhra.gov.uk/drug-analysis-profiles/data/./UK_EXTERNAL/NONCOMBINED/UK_NON_000692998184.zip`


Note that the `UK_NON_` numbers are the same for each drug, for which I call the drug's Yellow Card ID.
Also, the downloadable links for each drug's zip file are the same, apart from the `UK_NON_` numbers, i.e. the Yellow Card IDs.

The plan to automate this download process is to:
1. Obtain each drug's unique Yellow Card ID (the 12-digit number following `UK_NON_`)
2. Replace each Yellow Card ID into the link to download all the zip files.

## Set Up Selenium
Selenium is a great tool for dynamic web content in which BeautifulSoup is unable to scrape. It allows for operations such as typing in certain words, clicking certain buttons, and other mouse or keyboard commands.

Let's first import the libraries that we need and set up Selenium:
> Note:
> - This set-up is for Chrome web browser
> - Change the `executable path` to the absolute path to where your downloaded `chromedriver` folder is

In [219]:
# import necessary packages and functions
import selenium
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

# set up web driver for Selenium
options = webdriver.ChromeOptions()
chrome_options=options
options.add_argument('--enable-javascript')
driver = webdriver.Chrome(executable_path='/Applications/chromedriver', options=options)

True

## Click onto each alphabet and sub-range
Before downloading the files, I need to click onto each drug. To do that, I need to go through each alphabet from A to Z on the webpage, then for each alphabet (e.g. A) I click onto each sub-range (e.g. Aa-Ad), and finally click onto the drug name (e.g. Abacavir).

To do this, I now use the `WebDriver` previously set up to visit the Yellow Card web page. Note that upon inspection, the content I need is encapsulated in a `iframe` called `"e"`, therefore I used toe function `WebDriverWait` to switch to the iframe before retrieving the information.

In [None]:
# go to the yellow card url
driver.get('https://yellowcard.mhra.gov.uk/iDAP/')

# wait until the iframe "e" is available before commencing web scrape
WebDriverWait(driver, 10).until(EC.frame_to_be_available_and_switch_to_it("e"))

Once I'm in the correct frame, I now start finding the elements I wish to tell the web driver to click onto.

In [None]:
try:
    # create empty lists to store drug names and respective links
    drug_names = []
    links = []
    
    # create a list of alphbets to mimic A to Z on webpage
    alphabets = list(string.ascii_uppercase)
    
    # loop through each alphabet
    for i, alphabet in enumerate(alphabets):
        
        # sleep for a few seconds
        sleep(random.randint(3, 5))
        
        # click onto each alphabet section
        driver.find_element_by_xpath(f'//*[@id="top_index_level"]/div[{i + 1}]').click()
        
        # get range of subsets under each alphabet
        index = driver.find_element_by_xpath(f'//*[@id="top_index_level_{alphabet}"]')
        words = index.text
        # e.g. Aa-Ad will be captured by [x:x+5]
        ranges = [words[x : x + 5] for x in range(0, len(words), 5)]
        
        # for each subsection of each alphabet
        for j in tqdm(range(len(ranges))):
            
            # sleep for a few seconds
            sleep(random.randint(3, 5))
                
            # click onto each subsection
            driver.find_element_by_xpath(f'//*[@id="top_index_level_{alphabet}"]/div[{j + 1}]').click()
            body = driver.find_element_by_tag_name("body")
            
            # the first 2 elements are not relevant, so we exclude them with [2:]
            tmp = body.text.split('\n')[2:]
            # add drug names to list
            drug_names.extend(tmp)
        
            
            # loop through each drug name
            for drug in tmp:
                # get href link for each drug
                link = driver.find_element_by_link_text(f'{drug}').get_attribute('href')
                # add link to list of links
                links.append(link)
                
                # sleep
                sleep(random.randint(3, 5))
except:
    # prints out error if url or any of the above steps did not work
    print('error')

## Create a dataframe

Now that I obtained the information I need, let's make a data frame.

In [4]:
# create a dataframe of each drug and respective urls
yc = pd.DataFrame({'drug_name': drug_names, 'link': links})

# from the urls, extract the yellow card ids that start with UK_NON_
yc['yc_id'] = yc['link'].str.extract(r'UK_NON_(.+).zip')
yc.head()

## save csv
# yc.to_csv('yellow_card_links.csv', header = True, index = False)

Unnamed: 0,drug_name,link,yc_id
0,Abacavir,https://info.mhra.gov.uk/drug-analysis-profile...,40046536
1,Abatacept,https://info.mhra.gov.uk/drug-analysis-profile...,561378321
2,Abciximab,https://info.mhra.gov.uk/drug-analysis-profile...,231911819
3,Abemaciclib,https://info.mhra.gov.uk/drug-analysis-profile...,369408139
4,Abiraterone,https://info.mhra.gov.uk/drug-analysis-profile...,968368347


In [None]:
## codes used to reload saved csv of the yellow card dataframe
# yc = pd.read_csv('/Users/JocelynHo/Desktop/GA Capstone/yellow_card_links.csv', dtype = object)
# yc.head()

In [215]:
# make sure there are no duplicates
yc.duplicated().sum()

0

In [6]:
# overview of dataframe
yc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2339 entries, 0 to 2338
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   drug_name  2339 non-null   object
 1   link       2339 non-null   object
 2   yc_id      2339 non-null   object
dtypes: object(3)
memory usage: 54.9+ KB


## Automate the download process
I now loop through each of the Yellow Card IDs extracted from the links, and download the zip file for each drug.

In [221]:
# loop through each id to download zip files from Yellow Card
for x in tqdm(yc['yc_id']):
    driver.get(f'https://info.mhra.gov.uk/drug-analysis-profiles/data/./UK_EXTERNAL/NONCOMBINED/UK_NON_{x}.zip')

100%|██████████| 2339/2339 [13:55<00:00,  2.80it/s]


Then, looping through each zip file, I unzip the files to extract the respective `.csv` files.

In [223]:
# extract csvs from zip
import zipfile

for x in tqdm(yc['yc_id']):
    with zipfile.ZipFile(f'/Users/JocelynHo/Downloads/UK_NON_{x}.zip', 'r') as zip_ref:
        zip_ref.extractall('/Users/JocelynHo/Desktop/GA Capstone/yellow_card_raw_csvs')

100%|██████████| 2339/2339 [00:18<00:00, 126.68it/s]


# Next Steps:
Now let's move on to the next Jupyter Notebook: `1b_drugbank`.