## Web Scraping Walkthrough - Curating Healthcare Professionals' Information with Python and Selenium
___
Link to Medium article: 

**Motivation**  
Analytics is not possible without data, and web scraping is one of the many tools out there for us to curate data. The concept of web scraping has always fascinated me, and I felt it would be fun to practice my coding chops while exploring the public information of registered healthcare professionals in Singapore.

I also notice that most existing web scraping tutorials tend to be overly brief, therefore a more detailed walkthrough would certainly be beneficial for everyone.

The Ministry of Health (MOH) Professional Registration System website provides public access to the general information of healthcare professionals registered in Singapore. As with most search sites, the records in the system are not available for viewing in its entirety. That in fact creates a good exercise for web scraping practice.

There are different sites for the different categories of healthcare practitioners (e.g. doctors, pharmacists, nurses, dentists etc). Being a pharmacist myself, it was natural for me to first test things out on the Pharmacists dataset. 
___

**Objective**  
Curate list of healthcare professionals registered in Singapore, along with their respective public professional information  
___

**What is Web Scraping**  
So let's start from the definition of web scraping. Web scraping refers to the process of extracting content and data from a website. This information is collected and then exported into a format that is more useful for the user, typically in a spreadsheet. This process can certainly be done manually, but the amount of effort involved can make it impractical to do so (e.g. costly, slow, human error etc). This is where automated web scraping comes into play, delivering efficiency and accuracy in data extraction.  
___

**DISCLAIMER**  
This tutorial is purely for educational purposes only. The information accessed is based on **publicly available content** provided on the MOH site. The contents extracted has not been (and will not be) reproduced, republished, uploaded, posted, transmitted or otherwise distributed in any way. Please refer to the Terms of Service of the MOH website for more details - https://www.healthprofessionals.gov.sg/spc/terms-of-use
___

### Table of Contents
- [Section 1 - Setup](#section1) 
- [Section 2 - Initial Exploration](#section2)  
- [Section 3 - Running Main Script](#section3)  
- [Section 4 - References](#section4)  
 ___

<a name="section1"></a>
## Section 1 - Setup

Before starting any web scraping, let's make sure that the essential dependencies and drivers are correctly installed and imported. I will briefly explain the three key common components involved in web scraping.

**(1) BeautifulSoup**  

BeautifulSoup is a Python library that is used to extract data out of HTML and XML files. You will notice that many web scraping tutorials do mention about this library, so all scrapers-to-be will need to have an understanding of it. This Wiki page gives a good summary of BeautifulSoup - https://en.wikipedia.org/wiki/Beautiful_Soup_(HTML_parser)

We will be utilizing BeautifulSoup4 (BS) just for the initial exploration section, purely to briefly demonstrate how it can be used. Here is the command to get the BS installation going.

```
pip install beautifulsoup4
```

**(2) Selenium Setup**  

Selenium is a powerful open-sourced web-based automation tool for controlling web browsers and performing browser automation. Essentially what it means is that you can use it to automate repetitive actions throughout the website, such as clicking buttons, filling forms, extracting text, scrolling page etc. This host of Selenium functions allows for step-by-step interactions with a webpage and assessment of the response of a browser to various changes. 

Selenium can be installed on pip with this command:

```
pip install selenium
```

More information on Selenium can be found here: https://selenium-python.readthedocs.io/

**(3) Chrome WebDriver**

The Selenium API (described above) requires the WebDriver protocol to control a web browser, like Chrome, Firefox or Safari. WebDriver provides a platform and language-neutral wire protocol as a way for out-of-process programs to remotely instruct the behavior of web browsers. For this project, I utilized the Chrome WebDriver since Google Chrome is one of the most popular browsers around.

- Do first check the Chrome version you are currently running, by visiting https://www.whatismybrowser.com/detect/what-version-of-chrome-do-i-have.  


- Once you have confirmed your Chrome version, you can visit https://chromedriver.chromium.org/downloads to download the appropriate WebDriver for Chrome (which comes in an .exe file). For example, my Chrome version is version 86, and so I downloaded ChromeDriver 86.0.4240.22. Do note that the chromedriver.exe file in this repo is that of version 86.0.4240.22 as well.  


- Make sure that the chromedriver.exe application is in the same folder as your script or notebook.
___

With the above installations complete, it is time to import the dependencies.

In [1]:
from bs4 import BeautifulSoup
import urllib
import re
import time
import pandas as pd
import os

from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import ElementClickInterceptedException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains

___
<a name="section2"></a>
## Section 2 - Initial Exploration

Although different websites will invariably have different functionalities and design, they still do contain numerous commonalities within the structures since there is always HTML/CSS involved. 

Thus, the first step (before any coding) is to understand the website structure. That means you should first browse the site and click the various buttons and links to see how they work and where they bring you. It is only upon having a clear understanding of the site flow that you are able to build any form of script to automate the scraping process. For the purpose of this exercise, we want to do the following: 

**Step 1**: Click the *'Search'* button (*without* keying any input in the two search bar fields above) to load all the results

<img src="./images/homepage_searchbar.png" width="70%" align = "left">

___

**Step 2**: Click the *'View More Details'* link to access the professional's public details, and you will be redirected to the dedicated details page for that individual.

<img src="./images/view_more_details.png" width="70%" align = "left">

___

**Step 3**: On the details page, we can see the text fields which we can scrape. Once done, click the 'Back to Search Results' link at the bottom to head back to the Search Results page, and repeat this with the next record on that page

<img src="./images/detailed_info.png" width="70%" align = "left">

**Step 4**: Once you are done with the page, click the next number on the pagination row to go to the next page of records. For example, if you are already on page 1, then click '2' to go to the next page.

<img src="./images/search_results_next_page.png" width="88%" align = "left">

One important thing to note is that the next set of pages in the pagination section (located at bottom left of the page) will only appear upon clicking a number closer to the last number on the existing row. For example, to go to page 14, you will need to first click '10' of (Page 1 2 3 4 5 6 7 8 9 10 ..), after which the next pagination row will appear, which is (Page .. 5 6 7 8 9 10 11 12 13 14 ..).

This is actually a source of quite a bit of pain, which you will witness later on.
___

After understanding how the target website works, the next step is to explore the underlying HTML structure of the page. Let's put BeautifulSoup into action.

In [2]:
# Define target website
home_page = "https://prs.moh.gov.sg/prs/internet/profSearch/main.action?hpe=SPC"

# Use urllib to open home page, and BeautifulSoup to parse HTML
home_page_content = urllib.request.urlopen(home_page)
home_page_html = BeautifulSoup(home_page_content, 'html.parser')
home_page_html


<html>
<head>
<title>Professional Registration System</title>
<script src="https://assets.wogaa.sg/scripts/wogaa.js"></script>
<meta content="IE=9" http-equiv="X-UA-Compatible" id="metaCompatible"/><!-- added for PRS-7074 on Apr23,2014  -->
<link href="/prs/css/site.css" rel="stylesheet" type="text/css"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="noindex, nofollow" name="robots"/>
</head>
<script language="Javascript1.1" type="text/javascript">
    var today = new Date();
    var expired = new Date(today.getTime() - 48 * 60 * 60 * 1000); // less 2 days
    var bikky = document.cookie;
	var isInSameFrame = true;

    function deleteCookie(attribute) {
        document.cookie = attribute + "=null; path=/; expires="
                + expired.toGMTString();
        bikky = document.cookie;
    }

    function dode() {
        deleteCookie("AA_JSessionInfo_Cookie_IP");
        deleteCookie("AA_JSessionInfo_Cookie");
    }
</script>
<style> html{dis

*Note: Having basic HTML knowledge will help you greatly in digesting the chunk of HTML code above.*  

Looking at the HTML returned above, it seems odd that we are unable to find the 'Search' button. What I did next was to head back to the page on the browser to inspect the page. To do that, you can click the F12 button (or press Ctrl + Shift + I), and the following screen will appear. The HTML is fully displayed under the Elements tab, and you can click further into the dropdowns to have a detailed look at the entire layout structure.

<img src="./images/inspect_page.png" width="88%" align = "left">
___

It is only upon this further inspection that I discovered there is a need to go deeper into this frame named `msg_main`:

```
<frame name="msg_main" noresize="" scrolling="auto" src="/prs/internet/profSearch/showSearchSummaryByName.action"/>
```

<img src="./images/deeper_meme.png" width="50%" align = "left">

This is because this particular HTML frame contains the 'Search' button element we are looking to activate. With that in mind, it's now time to call upon Selenium and Chrome WebDriver to do a test run to access this deeper frame.

In [3]:
# Setting up options for WebDriver
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')

# Initiate WebDriver
driver = webdriver.Chrome(options=options)

# Direct WebDriver to the target home page
driver.get(home_page)
driver.switch_to.frame(driver.find_element_by_name('msg_main'))

The above code will initiate the WebDriver, and you will see a new Chrome window appearing on your screen. In that window, you are able to see that Chrome recognizes an automation test is about to begin. Remember not to close this window, as WebDriver will be using this window to perform the automated tasks.

<img src="./images/automated_test.png" width="60%" align = "left">

Following that, by using the `switch_to.frame()` and `find_element_by_name` methods, we can then access the HTML code within the frame we are targeting. These two methods are part of a whole host of Selenium functions for us to leverage, and you can visit the Selenium documentation page to explore what is available. For example, to find out how to locate elements on a website, you can check out this specific page: https://selenium-python.readthedocs.io/locating-elements.html. I will be showcasing more of these Selenium functions later.

After accessing the frame, the next step is to utilize BeautifulSoup to extract the HTML from the current page the WebDriver is at. Remember that our objective is still to find out where the 'Search' button is, so that we can understand how to activate it.

In [4]:
# Parse HTML of current page source
frame_html = BeautifulSoup(driver.page_source, 'html.parser')
frame_html

<html lang="en" style="height:100%" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml"><head>
<title>Professionals Search</title>
<meta content="IE=9" http-equiv="X-UA-Compatible"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="no-cache" http-equiv="Cache-Control"/>
<meta content="no-cache" http-equiv="Pragma"/>
<meta content="0" http-equiv="Expires"/>
<meta content="noindex, nofollow" name="robots"/>
<script async="" src="https://assets.wogaa.sg/snowplow/2.14.0/sp.js"></script><script src="https://assets.wogaa.sg/scripts/wogaa.js"></script><script async="" src="https://assets.wogaa.sg/scripts/wogaa.js?url=https%3A%2F%2Fprs.moh.gov.sg%2Fprs%2Finternet%2FprofSearch%2FshowSearchSummaryByName.action"></script>
<link href="/prs/css/internet/spc/header.css" media="screen" rel="stylesheet" type="text/css"/>
<link href="/prs/css/internet/public-header.css" media="screen" rel="stylesheet" type="text/css"/>
<!--[if lt IE 8]><link rel="stylesheet" type="text

Lo and behold, after scrolling through the HTML text above, we can finally see the 'Search' button, and it is coded with: 

```
<input type = "button" value="Search" name="btnSearch" onclick"resubmit();">
```

Let us do a simple Selenium function to click this 'Search' button we have found (using `find_element_by_xpath`), and then parse the HTML of the resulting page through BeautifulSoup again. 

In [5]:
# Locate and click the Search button
search_button = driver.find_element_by_xpath("//input[@name='btnSearch']")
search_button.click()

You will notice that WebDriver Chrome window has loaded the first page of search results.

In [9]:
# Parse and display HTML of results page (after Search button is clicked)
results_html = BeautifulSoup(driver.page_source, 'html.parser')
results_html

<html lang="en" style="height:100%" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml"><head>
<title>Professionals Search</title>
<meta content="IE=9" http-equiv="X-UA-Compatible"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="no-cache" http-equiv="Cache-Control"/>
<meta content="no-cache" http-equiv="Pragma"/>
<meta content="0" http-equiv="Expires"/>
<meta content="noindex, nofollow" name="robots"/>
<script async="" src="https://assets.wogaa.sg/snowplow/2.14.0/sp.js"></script><script src="https://assets.wogaa.sg/scripts/wogaa.js"></script><script async="" src="https://assets.wogaa.sg/scripts/wogaa.js?url=https%3A%2F%2Fprs.moh.gov.sg%2Fprs%2Finternet%2FprofSearch%2FgetSearchSummaryByName.action"></script>
<link href="/prs/css/internet/spc/header.css" media="screen" rel="stylesheet" type="text/css"/>
<link href="/prs/css/internet/public-header.css" media="screen" rel="stylesheet" type="text/css"/>
<!--[if lt IE 8]><link rel="stylesheet" type="text/

Upon extracting the HTML, we can see that all the records has been loaded, and we are able to actually see the first 10 results embedded in the HTML we just extracted. More importantly, we are able to identify the next link to click in order to access the details of each record i.e. View More Details:

```
<a href="#" onclick="viewMoreDetails('PXXXXXX'); return false;">View more details</a>
```

Since every page loads 10 records, we need a way to collate the list of 10 IDs of the professionals listed on the page. This is the first step in creating a for loop so that we can loop over each record later on. To do this, we make use of regular expression (regex) since every ID is of the same format i.e. Letter P, followed by 5 digits, and ending with another letter (A-Z).

In [10]:
# Get list of PRN IDs on current page
results_html_text = results_html.get_text() # Convert HTML to text so that we can run regex

id_list = re.findall("P[0-9]{5}[A-Z]{1}", results_html_text)
id_list

['P02392B',
 'P03376F',
 'P03995J',
 'P03945D',
 'P04296Z',
 'P03956Z',
 'P01062F',
 'P04294C',
 'P04307I',
 'P03890C']

As with what we have done earlier, the key thing is about locating the elements to click or extract. The following steps are executed based upon the knowledge of what the elements are named within the HTML structure. For example, I know that the name of the professional is within the `table-head` div class, so I run this code to extract the name:

```
hcp_name =  driver.find_element_by_xpath("//div[@class='table-head']").text
```

From the individual details page, we can then view the various fields and text we can extract.

In [16]:
# Test one record
id_single_list = id_list[0]
id_single_list

'P02392B'

In [17]:
# Extract fields and HTML of the record
for index, hcp_id in enumerate(id_single_list):
    
    # Click the 'View More Details' link for the specific professional ID
    # This is coded as such because every View More Details anchor link is appended with the HCP ID
    # i.e. <a href="#" onclick="viewMoreDetails('P12345A'); return false;">View more details</a>
    driver.find_element_by_xpath(f"//a[contains(@onclick,'{hcp_id}')]").click()
    
    hcp_html = BeautifulSoup(driver.page_source, 'html.parser')
    hcp_name =  driver.find_element_by_xpath("//div[@class='table-head']").text
    all_fields = driver.find_elements_by_xpath("//td[@class='no-border table-data']") # Using find element since there are multiple elements

In [20]:
hcp_html

<html lang="en" style="height:100%" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml"><head>
<title>Professionals Search</title>
<meta content="IE=9" http-equiv="X-UA-Compatible"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="no-cache" http-equiv="Cache-Control"/>
<meta content="no-cache" http-equiv="Pragma"/>
<meta content="0" http-equiv="Expires"/>
<meta content="noindex, nofollow" name="robots"/>
<script async="" src="https://assets.wogaa.sg/snowplow/2.14.0/sp.js"></script><script src="https://assets.wogaa.sg/scripts/wogaa.js"></script><script async="" src="https://assets.wogaa.sg/scripts/wogaa.js?url=https%3A%2F%2Fprs.moh.gov.sg%2Fprs%2Finternet%2FprofSearch%2FgetSearchDetails.action"></script>
<link href="/prs/css/internet/spc/header.css" media="screen" rel="stylesheet" type="text/css"/>
<link href="/prs/css/internet/public-header.css" media="screen" rel="stylesheet" type="text/css"/>
<!--[if lt IE 8]><link rel="stylesheet" type="text/css" h

We have successfully gone all the way in to expose the HTML codes of the public details of a single healthcare professional. Now that the initial exploration is complete, let's get to the real action of automated web scraping!

___

<a name="section3"></a>
## Section 3 - Running Main Script

With the same dependencies installed and imported earlier, we can proceed to build a script to automate the tedious web scrapping process for us. I previously mentioned that there are various healthcare professional bodies on the MOH webpage, and they are actually represented by different abbreviations. For example, SDC for Singapore Dental Council, SMC for Singapore Medical Council etc. For the Pharmacists group I am looking at, the abbreviation will be SPC.

We prepare our WebDriver once again to access the target HTML frame, along with several WebDriver options to make sure our WebDriver runs smoothly. In particular, we are introducing wait times and sleep times so that there can be appropriate pauses during the scraping process.

### (i) WebDriver Initiation

In [25]:
# Define healthcare professional (HCP) body 
hcp_body = 'SPC'

# Set wait times
waittime = 20
sleeptime = 2

# Initiate web driver
try:
    driver.close() # Close any existing WebDrivers
except Exception:
    pass

# Access professional registration system (PRS) for specific healthcare professional body
home_page = f"https://prs.moh.gov.sg/prs/internet/profSearch/main.action?hpe={hcp_body}"

# Set webdriver options
options = webdriver.ChromeOptions()
options.add_argument('--no-sandbox')
options.add_argument('ignore-certificate-errors')

# Initiate webdriver
driver = webdriver.Chrome(options=options) 

# Get driver to retrieve homepage
driver.get(home_page)

# Switch to frame which contains HTML for Search section
driver.switch_to.frame(driver.find_element_by_name('msg_main'))

# Click Search button (named btnSearch) to load all results
WebDriverWait(driver, waittime).until(EC.presence_of_element_located((By.XPATH, "//input[@name='btnSearch']"))).click()

# Sleep a short while for page loading to be fully completed
time.sleep(sleeptime)

### (ii) Setting up key functions

Because there is a series of actions to take in order to navigate through the different records across different Search results pages, it is best to setup functions to better organize all these steps.

#### (1) Setup master list CSV to store all the records
We first generate a CSV file to store the records which we will be appending later. The column names are based on the earlier exploration of the web page, where we identified the various fields we can extract from each public record.

In [26]:
# Set file name (based on your own preference)
file_name = 'master_list.csv'

# Check if file already exists
if os.path.isfile(f'./{file_name}'):
    print(f'Filename {file_name} already exists')
else:
    # Set names of fields we want to extract
    column_names = ['name','reg_number','reg_date','reg_end_date','reg_type','practice_status','cert_start_date',
                    'cert_end_date','qualification','practice_place_name','practice_place_address','practice_place_phone']
    
    # Generate template dataframe
    df_template = pd.DataFrame(columns = column_names)
    
    # Generate csv file from the dataframe template
    df_template.to_csv(f'{file_name}', header=True)
    
    print('Created new master list file')

Filename master_list_test.csv already exists


#### (2) Get current page number
We need a function to track which page of the Search results the WebDriver-controlled browser is currently on

In [23]:
def get_current_page():
    """
    Get current Search results page number the WebDriver is on
    """
    # Wait until element is located, then click it
    current_page_elem = WebDriverWait(driver, waittime).until(EC.presence_of_element_located((By.XPATH, "//label[@class='pagination_selected_page']"))).text
    
    # Get page number as integer
    current_page_num = int(current_page_elem)
    
    return current_page_num

#### (3) Get absolute last page number
It is important to know how many Search results pages we have in total, so that we can create a page loop accordingly. For example, at the time of this exercise, the last page number is 340 (i.e. there are 340 Search results pages)

In [28]:
def get_absolute_last_page():
    """
    Get last page number of all Search results records
    """
    # Find all elements with pagination class (since it contains page numbers)
    WebDriverWait(driver, waittime).until(EC.presence_of_element_located((By.XPATH, "//a[@class='pagination']")))
    all_pages = driver.find_elements_by_xpath("//a[@class='pagination']")

    # Get final element, which corresponds to 'Last' hyperlink (which will go to last page number)
    last_elem = all_pages[-1].get_attribute('href')

    # Keep only the number of last page
    last_page_num = int(re.sub("[^0-9]", "", last_elem))
    
    return last_page_num

#### (4) Extract data from detailed information page of a single record
While we are on the details page of a particular healthcare professional (after clicking the 'View More Details' link), we need a way to extract all the text from the respective fields available for public viewing. The information for each record is consolidated into a Python dictionary.

Do note that we are longer using BeautifulSoup here (since BS was used just for demo purposes earlier). We are now solely depending on Selenium to do the parsing and text extraction for us.

In [29]:
def gen_hcp_dict():
    """
    Extract text from details fields and consolidate into dictionary
    """
    WebDriverWait(driver, waittime).until(EC.presence_of_element_located((By.XPATH, "//div[@class='table-head']")))
    hcp_name =  driver.find_element_by_xpath("//div[@class='table-head']").text    
    all_fields = driver.find_elements_by_xpath("//td[@class='no-border table-data']") # Using find elementS since there are multiple elements      
    hcp_data = []
    
    # Extract test for every field within the table
    for field in all_fields:
        hcp_data.append(field.text)

    hcp_dict = {}
    
    # Assign extracted data into respective key-value pairs in dictionary
    hcp_dict['name'] = hcp_name
    hcp_dict['reg_number'] = hcp_data[0]
    # hcp_data[1] is just a blank space, so it can be ignored
    hcp_dict['reg_date'] = hcp_data[2]
    hcp_dict['reg_end_date'] = hcp_data[3]
    hcp_dict['reg_type'] = hcp_data[4]
    hcp_dict['practice_status'] = hcp_data[5]
    hcp_dict['cert_start_date'] = hcp_data[6]
    hcp_dict['cert_end_date'] = hcp_data[7]
    hcp_dict['qualification'] = hcp_data[8]
    hcp_dict['practice_place_name'] = hcp_data[9]
    hcp_dict['practice_place_address'] = hcp_data[10]
    hcp_dict['practice_place_phone'] = hcp_data[11]
    
    return hcp_dict

#### (5) Get current pagination range
There is a pagination range that is available on each page of the Search results, and it is usually limited to around 10 digits. This means that you are only able to directly access a particular page from that specific range visible to you. Thus, we need a function to determine what is the current pagination range that is accessible. Below is the screenshot of the Pagination range I am referring to:

<img src="./images/pagination_range.png" width="60%" align = "left">

In [30]:
def get_current_pagination_range():
    """
    Get range of pages visible in pagination range on current page
    """
    WebDriverWait(driver, waittime).until(EC.presence_of_element_located((By.XPATH, "//a[@class='pagination']")))
    
    # Get all elements related to pagination
    all_pages = driver.find_elements_by_xpath("//a[@class='pagination']")
    
    # Implicit wait to allow page to load
    driver.implicitly_wait(1)
    
    # Find numbers of pagination range, and append to list
    pagination_range_on_page = []
    for elem in all_pages:
        
        # Only extracting numeric values of pagination range
        if elem.text.isnumeric():
            pagination_range_on_page.append(int(elem.text))
            driver.implicitly_wait(1)
        else:
            pass
    driver.implicitly_wait(1)
    return pagination_range_on_page

#### (6) Click last pagination number on current page
With the pagination range known, we need a function to click on the last number displayed in this range. This is for us to proceed further into the pages beyond the given range. For example, to reveal pages beyond page 10, we need to click the number 10 from the range (Page 1 2 3 4 5 6 7 8 9 10 ..)

In [31]:
def click_last_pagination_num(pagination_range):
    """
    Click last page number on current visible pagination range
    """
    last_pagination_num = pagination_range[-1] 
    
    # Introduce implicit wait for page to complete loading
    driver.implicitly_wait(1)
    
    # Click page number once element is located
    WebDriverWait(driver, waittime).until(EC.presence_of_element_located((By.LINK_TEXT, f'{last_pagination_num}'))).click() 
    driver.implicitly_wait(1)

#### (7) Click first pagination number on current page
Since there are hundreds of pages, it does not make sense to always start from page 1. This is because it will require an extremely tedious amount of effort to click on the last visible page number of each pagination range to reach the later half of the Search results. 

The solution to this is that for page numbers greater than the halfway mark (e.g. pages beyond 100 if there is a total of 200 pages), our program will instead start from the last page, and go back in reverse to reach those latter pages. This strategy of starting from the last page to reach these latter pages is much more efficient compared to always starting from the first page.

To make that happen, we need a function to click the first number of the existing pagination range. For example, if our current pagination range is (Page .. 151 152 153 154 155 156 157 158 ..), and we want to go to page 147, the program will be clicking the page number 151 so as to reveal the preceding set of pagination range.

In [32]:
def click_first_pagination_num(pagination_range):
    """
    Click first page number on current visible pagination range
    """
    first_pagination_num = pagination_range[0] 
    
    # Introduce implicit wait for page to complete loading
    driver.implicitly_wait(1)
    
    # Click page number once element is located
    WebDriverWait(driver, waittime).until(EC.presence_of_element_located((By.LINK_TEXT, f'{first_pagination_num}'))).click() 
    driver.implicitly_wait(1)

#### (8) Locate and access target page
Let's put the earlier functions into action. We now need to find a way to get our WebDriver to go to each page to scrape all the relevant information for each of the 10 professionals listed.

A massive pain comes from the fact that when you click the *'Back to Search Results'* link while on the details page of a particular professional, it brings you all the way back to Page 1 of the Search results instead of the current page. This means that if you were looking at the details page of Mr Chan on Page 10 and you click *'Back to Search Results'*, you will be brought back to Page 1 INSTEAD of the page 10 you were on.

As such, we need a function that can repeatedly click the appropriate numbers of the pagination range to return to the target page we were scraping EVERY TIME after the *'Back to Search Results'* link is clicked. The following code sets up the automation to do just that, with the strategic use of conditional statements and loops.

In [33]:
def locate_target_page(target_page):
    """
    Navigate to target page to proceed with scraping
    """
    last_page_num = get_absolute_last_page()
    
    # Find midway point of all search results page
    midway_point = last_page_num/2

    # If target page is in first half, start clicking from the start
    if target_page < midway_point: 
        current_page_num = get_current_page()
    
        if current_page_num == target_page:
            pass
        else:            
            pagination_range = get_current_pagination_range()

            while target_page not in pagination_range:
                driver.implicitly_wait(1)
                
                # If target page is not in pagination range, keep clicking last pagination number to go further down the list
                click_last_pagination_num(pagination_range) 
                current_page_num = get_current_page()
                pagination_range = get_current_pagination_range()
                driver.implicitly_wait(1)
            else:
                WebDriverWait(driver, waittime).until(EC.presence_of_element_located((By.LINK_TEXT, f"{target_page}"))).click() # Once target page is in pagination page, go to the target page

    else: # If target page is in later half of list, go to Last page and move in reverse (This saves alot of time)
        WebDriverWait(driver, waittime).until(EC.presence_of_element_located((By.LINK_TEXT, 'Last'))).click()  # Go to last page
        time.sleep(sleeptime)
        current_page_num = get_current_page()

        if current_page_num == target_page:
            pass
        else:           
            pagination_range = get_current_pagination_range()

            while target_page not in pagination_range:
                driver.implicitly_wait(2)
                click_first_pagination_num(pagination_range)
                current_page_num = get_current_page()
                pagination_range = get_current_pagination_range()
            else:
                WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.LINK_TEXT, f"{target_page}"))).click() # Once target page is in pagination page, go to the target page

#### (9) Create script that automates the navigation through the website portal
And now we setup the last function, which will bring everything together

In [34]:
def full_scrape(target_page):
    """
    Perform scraping for all pages in loaded search results
    """
    last_page_num = get_absolute_last_page()
    driver.implicitly_wait(1)
    
    while target_page != last_page_num:
        locate_target_page(target_page)
        print('Starting with target page ' + str(target_page))
        
        # Retrieve HTML from search page
        target_page_html = driver.find_element_by_xpath("//body").get_attribute('outerHTML')
        driver.implicitly_wait(1)
        
        # Find list of all IDs on that page, and keep the unique IDs
        all_ids = re.findall("P[0-9]{5}[A-Z]{1}", target_page_html)
        id_list = list(dict.fromkeys(all_ids))

        for index, hcp_id in enumerate(id_list): # Tracking the healthcare professional (HCP)'s ID
            # Click 'View More Details' link to access details page for that professional with the specific ID
            WebDriverWait(driver, waittime).until(EC.presence_of_element_located((By.XPATH, f"//a[contains(@onclick,'{hcp_id}')]"))).click()
            
            # Scrape relevant data from details page into a dictionary
            hcp_dict = gen_hcp_dict()
            
            # Convert dict to pandas dataframe (Need to set an index since we are passing scalar values)
            df_hcp_dict = pd.DataFrame(hcp_dict, index=[0])
            
            # Append df row directly to existing master list csv
            df_hcp_dict.to_csv(f'{file_name}', mode='a', header=False)
            
            # Print row that was just scraped (To track progress)
            print(f'Scraped row {index+1} of page {target_page}')
                  
            # Update (+1) target page after successfully scrapping all 10 records on page
            if index == len(id_list):
                print(f'Completed scraping for page {target_page}')
                target_page += 1
                print('Updated target page ' + str(target_page))
            else:
                pass

            # Head back to homepage by clicking 'Back to Search Results' link
            WebDriverWait(driver, waittime).until(EC.presence_of_element_located((By.LINK_TEXT, 'Back to Search Results'))).click() 
            
            # Go to latest updated target page
            locate_target_page(target_page)
    
    # Define steps when program reaches last page (essentially similar to the steps above)
    else:
        locate_target_page(target_page)
        print('Working on last page')
        target_page_html = driver.find_element_by_xpath("//body").get_attribute('outerHTML')
        driver.implicitly_wait(1)
        all_ids = re.findall("P[0-9]{5}[A-Z]{1}", target_page_html)
        id_list = list(dict.fromkeys(all_ids))

        for index, hcp_id in enumerate(id_list): 
            WebDriverWait(driver, waittime).until(EC.presence_of_element_located((By.XPATH, f"//a[contains(@onclick,'{hcp_id}')]"))).click()           
            hcp_dict = gen_hcp_dict()
            df_hcp_dict = pd.DataFrame(hcp_dict, index=[0])
            df_hcp_dict.to_csv('master_list.csv', mode='a', header=False)
            print(f'Scraped row {index+1} of {target_page}')
                  
            if index == len(id_list)-1:
                print(f'Completed scraping for page {target_page}')
                print('Mission Complete')
            else:
                pass
            
            WebDriverWait(driver, waittime).until(EC.presence_of_element_located((By.LINK_TEXT, 'Back to Search Results'))).click() 
            locate_target_page(target_page)

### (iii) Kickstart automated web scraping
It's been a relatively intense process so far, in terms of setting up all the functions correctly. This will require plenty of trial and error, so do prepare to spend some time and effort to get the automation right. Putting things in perspective, this short term hassle will still deliver much much more efficiency compared to doing this scraping manually.

Now that we are done with the hardest part of the coding, so it is time to execute all our functions and witness the fruits of our labour.

In [36]:
# Start with selected target page. target_page = 1 by default (to start from page 1)
target_page = 1

# Display time started
print(time.strftime("%H:%M:%S", time.localtime()))

# Run main web scraping function
full_scrape(target_page)

# Display time concluded
print(time.strftime("%H:%M:%S", time.localtime()))

16:02:11


**NOTE**: You will notice that there are many wait times peppered across the script. This is because the scraping will fail if a particular page is not completely loaded. Therefore, these wait times provide some buffer time for page loading. More info on waits can be found here: https://selenium-python.readthedocs.io/locating-elements.html

Despite that, the webpage loading may still fail from time to time. If that is the case, simply take note which page was last scraped successfully (from the output screen in the above cell). You then update the `target_page` in the above chunk, and rerun the automated scraping. You may do so by re-running the code from the WebDriver initiation section.

### (iv) Data Cleaning of Master List

Once scraping is completed, we can now tidy up the master list CSV file where our data has been progressively appended to. This is important since there will likely be row duplicates if you encounter scraping failures along the way and need to re-run scraping from break points.

In [107]:
# Import master list file
df_master = pd.read_csv(f'{file_name}')

# Display master list columns
df_master.columns

Index(['Unnamed: 0', 'name', 'reg_number', 'reg_date', 'reg_end_date',
       'reg_type', 'practice_status', 'cert_start_date', 'cert_end_date',
       'qualification', 'practice_place_name', 'practice_place_address',
       'practice_place_phone'],
      dtype='object')

In [108]:
# Drop leftmost index column (Unnecessary column)
df_master.drop(columns = [df_master.columns[0]], axis = 1, inplace = True)

# Remove duplicates
df_master.drop_duplicates(keep='first', inplace=True)

In [112]:
# Export final curated list
df_master.to_excel('master_list_final.xlsx', index = False)

### (v) Conclusion

And that's it! You have successfully performed web scrapping on a website that is relatively difficult to navigate. The concepts for the other types of websites are also similar, so you are in good stead to try out web scraping for other kinds of projects. If you do encounter specific errors or problems, do always head back to the usual trusty resources like StackOverflow and Selenium documentation pages.

Once the data has been curated, you can then proceed to perform fun analysis on the dataset e.g. Finding out proportions of professionals who graduated from various countries.

Please feel free to reach out to me at https://www.linkedin.com/in/kennethleungty/ should you have any suggestions or queries.

Happy web scraping!

___

<a name="section4"></a>
## Section 4 - References  

- https://selenium-python.readthedocs.io/locating-elements.html
- https://stackoverflow.com/questions/49171370/python-selenium-how-can-click-on-onclick-elements
- https://stackoverflow.com/questions/23924008/get-the-text-from-multiple-elements-with-the-same-class-in-selenium-for-python
- https://selenium-python.readthedocs.io/waits.html
- https://stackoverflow.com/questions/28778142/selenium-webdriver-give-nosuchframeexception
- https://medium.com/dunder-data/jupyter-to-medium-initial-post-ecd140d339f0