# Scraping UW CS faculty homepages

## Pre-Step: System setup 

Before we start, make sure to install the required libraries
    
    pip install bs4
    pip install selenium

Since UW's website has some javascript rendered HTML content, we'll be using Selenium for scraping the content loaded dynamically by javascript. For this,you would also need to download a selenium supported browser webdriver.

|University Name | Unversity of Washington|
|----------------|------------------------|
|Department Name | Computer Science |
|Faculty Home Page | https://www.cs.washington.edu/people/faculty |

e.g. For Chrome, download the appropriate webdriver from here: http://chromedriver.chromium.org/downloads, unzip it and save in current directory.

In [1]:
from bs4 import BeautifulSoup
from selenium import webdriver 
from selenium.webdriver.chrome.options import Options
import re 
import urllib

In [2]:
#create a webdriver object and set options for headless browsing
options = Options()
options.headless = True
browser = webdriver.Chrome('./chromedriver',options=options)

## Step 1: Design Helper Functions

If you visit UW's CS Faculty Directory Listing: https://www.cs.washington.edu/people/faculty , you'll notice that it has all the faculty listed there. 

Faculty Profile Page can be found in two places:
1. Clicking on a faculty's Name  -> Leading to official faculty page.
2. Clicking on a faculty's Photo -> Leading to personal home page if he or she has.

We will use the personal home page for detail scraping if he or she has. Otherwise, we will scrape it from the official faculty page.

Before we start scraping, we'll define some helper functions

In [23]:
#uses webdriver object to execute javascript code and get dynamically loaded webcontent
def get_js_soup(url,browser):
    browser.get(url)
    res_html = browser.execute_script('return document.body.innerHTML')
    soup = BeautifulSoup(res_html,'html.parser') #beautiful soup object to be used for parsing html content
    return soup

#tidies extracted text 
def process_bio(bio):
    bio = bio.encode('ascii',errors='ignore').decode('utf-8')       #removes non-ascii characters
    bio = re.sub('\s+',' ',bio)       #repalces repeated whitespace characters with single space
    bio = bio.strip() #trim the space at the begining and ending
    return bio

''' More tidying
Sometimes the text extracted HTML webpage may contain javascript code and some style elements. 
This function removes script and style tags from HTML so that extracted text does not contain them.
'''
def remove_script(soup):
    for script in soup(["script", "style"]):
        script.decompose()
    return soup

## Step 2: Scrape Faculty Listing

Now, let's start scraping.

First, let's get links to all Faculty Profile pages by scraping the Directory Listing. By using Chrome developer tools (F12), it's easier to find the links within the HTML content and layout. For my target url.

1. Each Faculty is showing in a <div\> with class of "row directory-row"
2. Faculty profile fields is under <div\> with class of "directory-name"
3. Link can be found under <a\> tag with attribute of "href"


![faculty_urls](img/uw_faculty_urls.PNG)



Now we can specify exactly what needs to be extracted from the directory listing page

In [24]:
#extracts all Faculty Profile page urls from the Directory Listing Page
def scrape_dir_page(dir_url,browser):
    print ('-'*20,'Scraping directory page','-'*20)
    faculty_links = []
    faculty_base_url = 'https://www.cs.washington.edu'
    #execute js on webpage to load faculty listings on webpage and get ready to parse the loaded HTML 
    soup = get_js_soup(dir_url,browser)     
    #get list of all <div> of class 'col-sm-2 directory-photo-container'
    for faculty_div in soup.find_all('div', class_='row directory-row'):
        official_link_holder = faculty_div.find('div',class_='col-sm-2 directory-photo-container')
        #url returned is relative, so we need to add base url
        official_link = faculty_base_url + official_link_holder.find('a')['href'] 
        
        home_page_holder = faculty_div.find('div',class_='directory-name')
        home_page_link = home_page_holder.find('a')['href']
        if (not home_page_link.startswith('https://')) & (not home_page_link.startswith('http://')):
            home_page_link = faculty_base_url + home_page_link
        
        links = {}
        links["official"] = official_link.strip('/')
        links["homepage"] = home_page_link.strip('/').replace('http://www.cs.washington.edu/people/faculty/','https://www.cs.washington.edu/people/faculty/')
        
        faculty_links.append(links)
        
    print ('-'*20,'Found {} faculty profile urls'.format(len(faculty_links)),'-'*20)
    return faculty_links

In [5]:
#url of directory listings of CS faculty in UW
dir_url = 'https://www.cs.washington.edu/people/faculty' 
faculty_links = scrape_dir_page(dir_url,browser)

-------------------- Scraping directory page --------------------
-------------------- Found 83 faculty profile urls --------------------


In [13]:
print(faculty_links[0])
print(faculty_links[15])
print(faculty_links)

{'official': 'https://www.cs.washington.edu/people/faculty/althoff', 'homepage': 'http://www.timalthoff.com'}
{'official': 'https://www.cs.washington.edu/people/faculty/etzioni', 'homepage': 'http://homes.cs.washington.edu/~etzioni'}
[{'official': 'https://www.cs.washington.edu/people/faculty/althoff', 'homepage': 'http://www.timalthoff.com'}, {'official': 'https://www.cs.washington.edu/people/faculty/anderson', 'homepage': 'https://www.cs.washington.edu/people/faculty/anderson'}, {'official': 'https://www.cs.washington.edu/people/faculty/rea', 'homepage': 'http://homes.cs.washington.edu/~rea'}, {'official': 'https://www.cs.washington.edu/people/faculty/tom', 'homepage': 'https://www.cs.washington.edu/people/faculty/tom'}, {'official': 'https://www.cs.washington.edu/people/faculty/magda', 'homepage': 'https://www.cs.washington.edu/people/faculty/magda'}, {'official': 'https://www.cs.washington.edu/people/faculty/beame', 'homepage': 'https://www.cs.washington.edu/people/faculty/beame'},

## Step 3: Scrape Faculty Official Page & Homepage

Above script returns both the official pages and personal home pages for each faculty. (They may be the same if the faculty has no personal home page.)


Ofcourse, there might still be a few cases where the faculty does not provide his or her personal home page. In these cases, we will treat the official faculty page as the homepage.


In order to get the correct or the best profile page for each faculty. Here's the design:

1. Scrape the official page e.g. https://www.cs.washington.edu/people/faculty/althoff

    > Get text from the Contact </div> with class of "row directory-row contact-block"
    
    > Get text from the main Section with class of "block block-system clearfix"
    
 ![faculty_urls](img/uw_faculty_detail.PNG)
2. Scrape the personal homepage if provided and different with the official one. e.g. http://www.timalthoff.com

And here's the algorithms to decide the best profile page.
1. If official page and homepage are the same, use the only url as candiate to get the biography.
2. If a different url provided as the personal homepage. We will use the personal homepage if 
 > Text scraped from personal home page is longer.*
    




In [25]:
def scrape_faculty_page(links,browser):
    homepage = links['homepage']
    official = links['official']
    
    bio = ''
    bio_official = ''
    bio_homepage = ''
    bio_url = ''
    
    bio_soup = remove_script(get_js_soup(official,browser)) 
        
    #we're only interested in some parts of the profile page namely the address
    #and information listed under the Overview, Research, Publication, etc
    contact_div = bio_soup.find('div',class_='row directory-row contact-block')
    detail_section = bio_soup.find('section',class_='block block-system clearfix')
    if (contact_div is None) | (detail_section is None): 
        #if the official page has redirect or custom UI, I will scrape all texts
        bio_official = bio_soup.get_text(separator=' ')
    else:
        bio_official += contact_div.get_text(separator=' ') + ' '
        bio_official += detail_section.get_text(separator=' ')
    bio_official = process_bio(bio_official)
        
    #if the home page is provide and not the same with official one
    #try to scrape the content.
    if homepage != official:
        bio_soup = remove_script(get_js_soup(homepage,browser)) 
        
        #get all the text from homepage(bio) since there's no easy to filter noise like navigation bar etc
        bio_homepage += process_bio(bio_soup.get_text(separator=' ')) 
        
    # by default, we will use the bio from official page
    # however if the faculty has personal homepage, we will try to use the one with more content
    if (homepage != official ) & (len(bio_homepage) > len(bio_official)):
        bio = bio_homepage
        bio_url = homepage
    else:
        bio = bio_official
        bio_url = official
        
    return bio_url,bio

In [26]:
#Scrape all faculty homepages using profile page urls
bio_urls, bios = [],[]
tot_urls = len(faculty_links)
for i,link in enumerate(faculty_links):
    print ('-'*20,'Scraping faculty url {}/{}'.format(i+1,tot_urls),'-'*20)
    bio_url,bio = scrape_faculty_page(link,browser)
    bio_urls.append(bio_url)
    bios.append(bio)

-------------------- Scraping faculty url 1/83 --------------------
-------------------- Scraping faculty url 2/83 --------------------
-------------------- Scraping faculty url 3/83 --------------------
-------------------- Scraping faculty url 4/83 --------------------
-------------------- Scraping faculty url 5/83 --------------------
-------------------- Scraping faculty url 6/83 --------------------
-------------------- Scraping faculty url 7/83 --------------------
-------------------- Scraping faculty url 8/83 --------------------
-------------------- Scraping faculty url 9/83 --------------------
-------------------- Scraping faculty url 10/83 --------------------
-------------------- Scraping faculty url 11/83 --------------------
-------------------- Scraping faculty url 12/83 --------------------
-------------------- Scraping faculty url 13/83 --------------------
-------------------- Scraping faculty url 14/83 --------------------
-------------------- Scraping faculty url 1

## Post-Step: Save Results
Finally, write urls and extracted bio to txt files

In [20]:
def write_lst(lst,file_):
    with open(file_,'w') as f:
        for l in lst:
            f.write(l)
            f.write('\n')

In [28]:
bio_urls_file = 'bio_urls.txt'
bios_file = 'bios.txt'
write_lst(bio_urls,bio_urls_file)
write_lst(bios,bios_file)