# Scraping IBM W3 Bluepages

This Notebook is an exercise to scrape IBM Bluepage site for employee hierarchy data. This is purely an exercise for learning Python web scraping capability. The result should not be share with anyone outside IBM.


## Requirements
The requirement is to start anywhere in the IBM Bluepage (`BASE_URL`), from the bluepage for any employee), go over all employees reporting to this person, and recursively all their reportees, up to a given levels (`MAX_LEVEL_down`) down. 

The results will be saved in a CSV file, with following columns:
  1. Serial ID
  2. Employee full name
  3. Employee email address
  4. Employee location
  5. Manager's ID
  6. Number of direct reports
  

## Packages used
A common web scraping application uses requests and BeautifulSoup packages. That works well for source file in HTML. If a web page is dynamically rendered on client side using JavaScript, such as IBM Bluepage, you won't be able to use the same approach, as the requests package will only get you the raw source in JavaScript, not the final HTML content.

To solve this challenge, we need to use additional help. There are many options out there, such as dryscrape. But I was only able to get it work on my Wondows 10 computer using selenium. You need to install Python selenium pacjkge first: `pip install selenium`. But there is another step required to get the Chrome driver installed manually. 

Here are the instructions to setup the Chrome engine for selenium. First, download the ChromeDriver from this [web site](https://sites.google.com/a/chromium.org/chromedriver/downloads). In my case, I [downloaded the win32 version](https://chromedriver.storage.googleapis.com/index.html?path=90.0.4430.24/) to my Windows machine, unzipped it to a folder, and add it to my system `PATH` environment variable. That is all I had to do.

Now we are ready to go. Will start by importing all the required packages:

In [160]:
import requests
import re
import csv
from datetime import datetime
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup as bs
from collections import deque

Setup some basic configurations for the scraping, such as starting point, and the depth to traverse. The `max_level_down` below means we will go at most 2 levels below the base_URL, as base_url represents level 0, one level down is considered to be level 1, and so on.

The following code block contains all the configurations which can be applied to the scraping process. Don't change any other code unless you want to make some processing logic change, or bug fixes.

In [53]:
# start with Arvind's page, change this value if want to start from someone else
base_url = 'https://w3.ibm.com/bluepages/profile.html?uid=067308897'

# How many levels down from above page
max_level_down = 2

# Set start level to default value 0
start_level = 0

We will use a queue for traversing the tree of the employees. Each entry in the queue is a tuple with employ page URL and their level. Start the queue with one record, with value of `(base_url,0)`. We will append all reportees to the queue for each record we process, until we hit the `max_level`.

Now we are ready to do the scraping. Will start with a function to scrape one page, and get the user information out from this page, and add the reportee info to the queue. The input to the function is simply a record in the `process_queue`, and the return is tuple of all the required information from this page.

In [20]:
def scrapeOnePage(pageInfo):
    url = pageInfo[0]
    level = pageInfo[1]
    driver = webdriver.Chrome()
    soup = getSoupForUrl(driver, url)
    # print(soup)
    info = scrapeUserInfo(soup)
    
    # close the chrome page for the current URL
    driver.quit()  

In [21]:
def getSoupForUrl(driver, url):
    driver.get(url)
    
    # the following line is unique to the bluepage. If scraping another web site, 
    # you need to find something matches your site design.
    element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "profile-w3-section"))) 
    #waits 10 seconds until element is located. Can have other wait conditions  such as visibility_of_element_located or text_to_be_present_in_element
    
    html = driver.page_source
    soup = bs(html, "lxml")
    return soup

In [153]:
def scrapeUserInfo(soup):
    name = getName(soup)
    serial = getSerial(soup)
    title = getTitle(soup)
    email = getEmail(soup, serial)
    address = getLocation(soup)
    reports = getNumReports(soup)
    manager = getManagerId(soup)
    return [serial, name, title, email, address, reports, manager]

The above are main processing functions. The following are functions for each data fields to be extracted.

In [67]:
def getName(soup):
    name = soup.select('.name-link.loader-anim.field-text')[0].text
    return name

In [68]:
def getEmail(soup, serial):
    idValue = '#copy-icon-header-'+serial
    email = soup.select(idValue)[0].text
    return email

In [69]:
def getSerial(soup):
    serial = soup.select('#serial-number')[0].text
    values = serial.split(':')
    serial = values[1].strip()
    return serial

In [70]:
def getLocation(soup):
    address = soup.select('.location-link-metrics')[0].text
    return address

In [71]:
def getTitle(soup):
    title = soup.select('.card-h3.ellipsis.hyphenate.loader-anim.active')[1].span.text
    return title

In [72]:
def getNumReports(soup):
    numReports = soup.select('.team-full-link')
    n = 0
    if len(numReports) > 0:
        raw = numReports[0].text
        n = extractValueInParenthesis(raw)
    return n

In [93]:
def getManagerId(soup):
    res = soup.select('.report-name.pd-left-65')
    mid = 'null'
    if len(res) > 0 :
        mid = res[0].a['data-teammember-uid']
    return mid

In [150]:
def getDirectReportIds(soup):
    persons = soup.select("#anchorReports")
    ids = []
    if persons:
        for p in persons[0].find_all('li', {'class':'person'}):
            # name = p.find('span', {'class':'person-name'}).text
            pid = p.div['data-teammember-uid']
            ids.append(pid)
    return ids

Some utility functions:

In [73]:
def extractValueInParenthesis(s):
    result = re.search(r"\(([A-Za-z0-9_]+)\)", s)
    return result.group(1)

In [None]:
def getOutputFileName():
    now = datetime.now()
    timestring = now.strftime("%Y%m%d-%H%M%S")
    return 'bluepage-'+timestring+'.csv'

With all functions defined, we can now start the processing. 

In [23]:
# this is the main function for the scraping

# initialize the processing queue. Each record in the queue is a tuple with URL and Level.
process_queue = deque([(base_url, start_level)])

# setup the CSV export file
filename = getOutputFileName()
output_file = open(filename,'w',newline='')
csv_writer = csv.writer(output_file,delimiter=',')
csv_writer.writerow(['Serial','Name','Title', 'email', 'Address', '#reports', 'Manager'])

# Start the total record count:
total_count = 0

while len(process_queue) > 0:
    record = process_queue.popleft()
    # process user info for the current record, save data to CSV
    # the same function will also add any child records to the processing queue
    scrapeOnePage(record)
    
# close the csv file
output_file.close()
print('Scraped a total number of {} records to file {}'.format(total_count, filename))


[<a class="location-link-metrics" href="http://w3.ibm.com/workplaces/location.html?campusid=HQ1ARM1" role="link" tabindex="2" target="_blank">Armonk, NY , United States</a>, <a class="location-link-metrics" href="http://w3.ibm.com/workplaces/location.html?campusid=HQ1ARM1" role="link" tabindex="2" target="_blank">Armonk, NY , United States</a>]


The following are test scripts.

In [23]:
scrapeOnePage(info)

[]


In [33]:
res = requests.get(base_url)
soup = bs4.BeautifulSoup(res.text,"lxml")
soup.select('title')[0].text

'BluePages'

In [147]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup as bs

driver = webdriver.Chrome()
# url = 'https://w3.ibm.com/bluepages/profile.html?uid=1G4482897'  # ming
url = 'https://w3.ibm.com/bluepages/profile.html?uid=067308897'  # Arvind 
# url = 'https://w3.ibm.com/bluepages/profile.html?uid=934567897'  # Rob
driver.get(url)
currentLevel = 6

element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, 'profile-w3-section'))) 

#if currentLevel > 0:
#    element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "anchorNonPersonReports"))) 
#else:
#    element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "anchorReports"))) 
#waits 10 seconds until element is located. Can have other wait conditions  such as visibility_of_element_located or text_to_be_present_in_element

html = driver.page_source
soup = bs(html, "lxml")


In [152]:
name = getName(soup)
serial = getSerial(soup)
title = getTitle(soup)
email = getEmail(soup, serial)
address = getLocation(soup)
reports = getNumReports(soup)
manager = getManagerId(soup)
[serial, name, title, email, address, reports, manager]

['067308897',
 'Arvind Krishna',
 'Chairman and Chief Executive Officer',
 'arvindk@us.ibm.com',
 'Armonk, NY , United States',
 '16',
 'null']

In [26]:
address = soup.select('.location-link-metrics')[0].text
print(address)

Armonk, NY , United States


In [35]:
name = soup.select('.name-link.loader-anim.field-text')[0].text
print(name)

Arvind Krishna


In [41]:
title = soup.select('.card-h3.ellipsis.hyphenate.loader-anim.active')[1].span.text
title

'Chairman and Chief Executive Officer'

In [46]:
serial = soup.select('#serial-number')[0].text
values = serial.split(':')
serial = values[1].strip()
print(serial)

067308897


In [61]:
numReports = soup.select('.team-full-link')
if len(numReports) > 0:
    raw = numReports[0].text
    print(raw)
    n = extractValueInParenthesis(raw)
    print(n)

Reports (16)
<re.Match object; span=(8, 12), match='(16)'>
16


In [65]:
n = getNumReports(soup)
print(n)

16


In [148]:
# test direct reports
persons = soup.select("#anchorReports")
ids = []
if persons:
    for p in persons[0].find_all('li', {'class':'person'}):
        name = p.find('span', {'class':'person-name'}).text
        pid = p.div['data-teammember-uid']
        ids.append(pid)
        print('Name: {},  id: {}'.format(name, pid))
ids

Name: Jonathan Adashek,  id: 4J6316897
Name: Howard Boville,  id: 4J8426897
Name: Michelle Browdy,  id: 5D0691897
Name: Gary Cohn,  id: 5J2912897
Name: Mark Foster,  id: 6G1701897
Name: Darío Gil ,  id: 784626897
Name: James Kavanaugh,  id: 914589897
Name: John Kelly,  id: 086767897
Name: Nickle LaMoreaux,  id: 3A1811897
Name: Carla Piñeyro Sublett,  id: 5J3109897
Name: Tom Rosamilia,  id: 579070897
Name: Martin Schroeter,  id: 216989897
Name: Martin Schroeter,  id: 6J2643897
Name: Bridget Van Kralingen,  id: 9A7664897
Name: Jim Whitehurst,  id: 3J7003897
Name: Kate Woolley,  id: 4J8390897


['4J6316897',
 '4J8426897',
 '5D0691897',
 '5J2912897',
 '6G1701897',
 '784626897',
 '914589897',
 '086767897',
 '3A1811897',
 '5J3109897',
 '579070897',
 '216989897',
 '6J2643897',
 '9A7664897',
 '3J7003897',
 '4J8390897']

In [103]:
# test manager ID
res = soup.select('.report-name.pd-left-65')
mid = 'null'
res
if len(res) > 0 :
    mid = res[0].a['data-teammember-uid']
print(mid)

3J7003897


In [146]:
driver.quit()

In [162]:
# test CSV export
filename = getOutputFileName()
output_file = open(filename,'w',newline='')
csv_writer = csv.writer(output_file,delimiter=',')
csv_writer.writerow(['Serial','Name','Title', 'email', 'Address', '#reports', 'Manager'])
uinfo = scrapeUserInfo(soup)
csv_writer.writerow(uinfo)
output_file.close()


In [161]:
def getOutputFileName():
    now = datetime.now()
    timestring = now.strftime("%Y%m%d-%H%M%S")
    return 'bluepage-'+timestring+'.csv'