# Assignment 1: Web Scraping

## Objective

Data scientists often need to crawl data from websites and turn the crawled data (HTML pages) to structured data (tables). Thus, web scraping is an essential skill that every data scientist should master. In this assignment, you will learn the followings:


* How to download HTML pages from a website?
* How to extract relevant content from an HTML page? 

Furthermore, you will gain a deeper understanding of the data science lifecycle.

**Requirements:**

1. Please use [pandas.DataFrame](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) rather than spark.DataFrame to manipulate data.

2. Please use [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) rather than [lxml](http://lxml.de/) to parse an HTML page and extract data from the page.

3. Please follow the python code style (https://www.python.org/dev/peps/pep-0008/). If TA finds your code hard to read, you will lose points. This requirement will stay for the whole semester.

## Preliminary

If this is your first time to write a web scraper, you need to learn some basic knowledge of this topic. I found that this is a good resource: [Tutorial: Web Scraping and BeautifulSoup](https://realpython.com/beautiful-soup-web-scraper-python/). 

Please let me know if you find a better resource. I'll share it with the other students.

## Overview

Imagine you are a data scientist working at HKUST(GZ). Your job is to extract insights from HKUST(GZ) data to answer questions. 

In this assignment, you will do two tasks. Please recall the high-level data science lifecycle from Lecture 1. I suggest that when doing this assignment, please remind yourself of what data you collected and what questions you tried to answer.

## Task 1: HKUST(GZ) Information Hub Faculty Members

Sometimes you don't know what questions to ask. No worries. Start collecting data first. 

In Task 1, your job is to write a web scraper to extract the faculty information from this page: [https://facultyprofiles.hkust-gz.edu.cn/](https://facultyprofiles.hkust-gz.edu.cn/).




### (a) Crawl Web Page

A web page is essentially a file stored in a remote machine (called web server). Please write code to download the HTML page and save it as a text file ("infhfaculty.html").

### My answer:

In [1]:
# install relevant packages
!pip install selenium 
!pip install webdriver_manager
!pip install beautifulsoup4



In [2]:
# import relevant / sufficient packages 
from bs4 import BeautifulSoup
import codecs
import re
import requests
import copy
import csv
import time
import os
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import Select

In [3]:
# getting data from the required webpage
response = requests.get('https://facultyprofiles.hkust-gz.edu.cn/')

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')

    # Save the HTML content to a file
    with open("infhfaculty.html", "w", encoding="utf-8") as html_file:
        html_file.write(str(soup))
    
    print("HTML content saved as 'infhfaculty.html'")

else:
    print("Failed to retrieve the web page. Status code:", response.status_code)


HTML content saved as 'infhfaculty.html'


In [4]:
print(soup.prettify())

<!DOCTYPE html>
<html lang="">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width,initial-scale=1,minimum-scale=1,maximum-scale=1,user-scalable=no" name="viewport"/>
  <link href="/favicon.ico" rel="icon"/>
  <title>
   Faculty Profile | HKUST(GZ)
  </title>
  <link href="/css/chunk-68bc8d83.24b819a5.css" rel="prefetch"/>
  <link href="/css/chunk-75f8e0d6.59edb571.css" rel="prefetch"/>
  <link href="/css/chunk-876f85d2.de927588.css" rel="prefetch"/>
  <link href="/js/chunk-17674ada.8e51a3a0.js" rel="prefetch"/>
  <link href="/js/chunk-68bc8d83.99ebc600.js" rel="prefetch"/>
  <link href="/js/chunk-75f8e0d6.6e100b20.js" rel="prefetch"/>
  <link href="/js/chunk-876f85d2.869022b3.js" rel="prefetch"/>
  <link as="style" href="/css/app.d4c609ae.css" rel="preload"/>
  <link as="style" href="/css/chunk-vendors.70e18e9e.css" rel="preload"/>
  <link as="script" href="/js/app.83fc59d3.js" rel="preload"/>
  <link as="scrip

Beautiful soup is primarily designed for parsing and navigating HTML and XML documents but not extracting JavaScript. Here we use webdrivers with Selenium to interact with JS script. 

In [5]:
# Use Wdgedriver
driver = webdriver.Edge(executable_path='C:/Users/cindy/OneDrive - HKUST (Guangzhou)/DSC/msedgedriver.exe')
driver.get('https://facultyprofiles.hkust-gz.edu.cn/')

# click info hub
driver.find_element_by_xpath('//*[@id="app"]/section/section/div/ul[1]/li[3]').click()
time.sleep(5)

#scrape with BeautifulSoup
page_source = driver.page_source
soup = BeautifulSoup(page_source, "html.parser")
#find corresponding table for faculty table
table = soup.find('table', class_='el-table__body')
driver.quit()

### (b) Extract Structured Data

Please write code to extract relevant content (name, rank, area, profile, homepage, ...) from "infhfaculty.html" and save them as a CSV file (save as "faculty_table.csv"). 

### My answer: 
According to the CSV template provided, the result requires something like this: 
| Name       | Rank               | Area                                                                         | Profile                                                                                     | Homepage            | Email                  | Office    |
|------------|--------------------|-------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------|------------------------|-----------|
| Yuyu LUO  | Assistant Professor | Intelligent Visualization and Visual Analytics, Orchestrating Data Analytics Pipelines, Data-centric Artificial Intelligence, Data Management for Data Science | https://facultyprofiles.hkust-gz.edu.cn/faculty-personal-page/LUO-Yuyu/yuyuluo | https://luoyuyu.vip/ | yuyuluo@hkust-gz.edu.cn | E2 L6 615 |

Here, we will try to create one.
To get the area of interest, link to personal homepage and office locations, we have to click into each faculty members' personal profile to extract data.

In [68]:
# Initialize lists to store data
names = []
ranks = []
areas = []
profiles = []
homepages = []
emails = []
offices = []

In [69]:
# execute with my edgedrive from PATH
driver = webdriver.Edge(executable_path='C:/Users/cindy/OneDrive - HKUST (Guangzhou)/DSC/msedgedriver.exe')
# get data from faculty webpage
driver.get('https://facultyprofiles.hkust-gz.edu.cn/')

#press 'info hub' to extract relevant hub
driver.find_element_by_xpath('//*[@id="app"]/section/section/div/ul[1]/li[3]').click()
time.sleep(2)
# 'links' indicate the 'More' buttons
links = driver.find_elements_by_css_selector('button.el-button.el-button--text.more-btn')

# Go to each personal webpage by redirecting with the 'More' button 
for i,link in enumerate(links,start=1): 
    try: 
        # click 'More' and shift driver to new tab
        link.click()  
        time.sleep(2)
        all_handles = driver.window_handles
        driver.switch_to.window(all_handles[1])

        # click 'Research interests' and scrape using ccs selecter (differ for each page)
        time.sleep(3)
        driver.element = driver.find_element_by_xpath("//div[contains(text(), 'RESEARCH INTEREST')]").click()
        time.sleep(3)
        new_page_source = driver.page_source
        soup2 = BeautifulSoup(new_page_source, "html.parser")
        
        # get name
        name = soup2.find('h2', {'class': 'english-name'}).get_text(strip=True)
        
        # get rank
        rank = soup2.find('p', {'class': 'positions-class'}).get_text(strip=True)
        
        # get area by finding the div with class "overview-div"
        area_div = soup2.find('div', {'class': 'overview-div'})
        if area_div:
            # Find all <p> elements within the div with class "content" and extract their text
            area_elements = area_div.find_all('p', {'class': 'content'})
            # Extract the text from each <p> element and join them with a comma
            area = ', '.join([element.get_text(strip=True) for element in area_elements])
        else:
            area = ''

        # get profile (this webpage link)
        profile = driver.current_url
        #get personal webpage
        homepage_element = soup2.find('a', text='Personal Web')
        homepage = homepage_element['href'] 
        
        # get email
        mailto_link =  soup2.find('a', class_='icon-text', href=lambda x: x.startswith('mailto:'))
        email = mailto_link['href']

        # get office location
        office =  soup2.find('p', {'class': 'icon-text'}).get_text(strip=True)
        print(name + ', '+ rank +', '+ area +', '+ profile+', '+
              homepage +', '+ email +', '+ office)
        
        #put each element into the list 
        names.append(name)
        ranks.append(rank)
        areas.append(area)  
        profiles.append(profile)  
        homepages.append(homepage)
        emails.append(email)
        offices.append(office)
        
        driver.close()
        # Switch back to the first window or tab (index 0)
        driver.switch_to.window(driver.window_handles[0])
        
    except Exception as e:
        print(f"An error occurred: {str(e)}")

# close the webdiver
driver.quit()

  homepage_element = soup2.find('a', text='Personal Web')


Lei CHEN, Dean, Data-driven machine learning, Crowdsourcing-based data processing, Uncertain and probabilistic databases, Web information management, Multimedia systems, https://facultyprofiles.hkust-gz.edu.cn/faculty-personal-page/CHEN-Lei/leichen, https://facultyprofiles.ust.hk/profiles.php?profile=lei-chen-leichen, mailto:leichen@hkust-gz.edu.cn, E3 L5 511
Pan HUI, Chair Professor, Mobile computing, Computer networking, Data analytics, Human-computer interaction, https://facultyprofiles.hkust-gz.edu.cn/faculty-personal-page/HUI-Pan/panhui, https://panhui.people.ust.hk, mailto:panhui@hkust-gz.edu.cn, E1 L6 605
Vincent Kin Nang LAU, Chair Professor, Stochastic Optimization and Analysis for wireless systems, Massive MIMO Systems, Sparse Recovery, Bayesian Inferencing, Mission-Critical IoT, PHY Caching for Wireless Networks, https://facultyprofiles.hkust-gz.edu.cn/faculty-personal-page/LAU-KinNang/eeknlau, https://eeknlau.home.ece.ust.hk/HKUST-Office-HomePage/HKUST_Home.html, mailto:eek

Sean Sihong XIE, Associate Professor, Responsible machine learning on graphs, Data-Centric AI, Misinformation detection, https://facultyprofiles.hkust-gz.edu.cn/faculty-personal-page/XIE-Sihong/sihongxie, , mailto:sihongxie@hkust-gz.edu.cn, E4 L3 306
Tengfei CHANG, Assistant Professor, Low-Power Wireless Mesh Networks, Indoor Wireless Localization, Sensor Fusion, Swarm Robotics/Intelligence, https://facultyprofiles.hkust-gz.edu.cn/faculty-personal-page/CHANG-Tengfei/tengfeichang, , mailto:tengfeichang@hkust-gz.edu.cn, W2(C8) L6 606
Huangxun CHEN, Assistant Professor, Internet of Things, Cyber-physical/AI Security, Intelligent Network OAM, Human-centered Interaction System, https://facultyprofiles.hkust-gz.edu.cn/faculty-personal-page/CHEN-Huangxun/huangxunchen, https://www.chenhuangxun.com/, mailto:huangxunchen@hkust-gz.edu.cn, E1 L6 606
Yingcong CHEN, Assistant Professor, Computer Vision, Machine Learning, https://facultyprofiles.hkust-gz.edu.cn/faculty-personal-page/CHEN-Yingcong/yin

Wei ZENG, Assistant Professor, , https://facultyprofiles.hkust-gz.edu.cn/faculty-personal-page/ZENG-Wei/weizeng, http://zeng-wei.com/, mailto:weizeng@hkust-gz.edu.cn, W1 L6 601
Theodoros PAPATHEODOROU, Associate Thrust Head, , https://facultyprofiles.hkust-gz.edu.cn/faculty-personal-page/PAPATHEODOROU-Theodoros/theodoros, , mailto:theodoros@hkust-gz.edu.cn, E2 L3 308
Rui HU, Lecturer I, Media Arts (applied computer graphics in art, simulation, game, and virtual reality in art; film and animation; installation art), Philosophy (time, causation, process, simulation), https://facultyprofiles.hkust-gz.edu.cn/faculty-personal-page/HU-Rui/ruihu, https://hurui.ooo/, mailto:ruihu@hkust-gz.edu.cn, E1 L4 406
Jake Junjie ZHANG, Lecturer I, , https://facultyprofiles.hkust-gz.edu.cn/faculty-personal-page/ZHANG-Junjie/jakezhang, https://www.jakeanime.com/, mailto:jakezhang@hkust-gz.edu.cn, E2 L5 506
Meihui ZHANG, Visiting Professor, , https://facultyprofiles.hkust-gz.edu.cn/faculty-personal-page/ZHA

In [74]:
# make it as csv file
with open('faculty_data.csv', 'w', newline='', encoding="utf-8") as csvfile:
    fieldnames = ['Name', 'Rank', 'Area', 'Profile', 'Homepage', 'Email', 'Office']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

    writer.writeheader()
    
    for i in range(len(names)):
        # remove the 'mailto:' in emaila
        email_without_prefix = emails[i].replace('mailto:', '')
        # write the csv 
        writer.writerow({
            'Name': names[i],
            'Rank': ranks[i],
            'Area': areas[i],
            'Profile': profiles[i],
            'Homepage': homepages[i],
            'Email': email_without_prefix,
            'Office': offices[i]
        })


In [75]:
# check first few rows of csv file
df = pd.read_csv('faculty_data.csv')
df.head()

Unnamed: 0,Name,Rank,Area,Profile,Homepage,Email,Office
0,Lei CHEN,Dean,"Data-driven machine learning, Crowdsourcing-ba...",https://facultyprofiles.hkust-gz.edu.cn/facult...,https://facultyprofiles.ust.hk/profiles.php?pr...,leichen@hkust-gz.edu.cn,E3 L5 511
1,Pan HUI,Chair Professor,"Mobile computing, Computer networking, Data an...",https://facultyprofiles.hkust-gz.edu.cn/facult...,https://panhui.people.ust.hk,panhui@hkust-gz.edu.cn,E1 L6 605
2,Vincent Kin Nang LAU,Chair Professor,Stochastic Optimization and Analysis for wirel...,https://facultyprofiles.hkust-gz.edu.cn/facult...,https://eeknlau.home.ece.ust.hk/HKUST-Office-H...,eeknlau@ust.hk,CWB Room 2416
3,Irene Man Chi Lo,Chair Professor,"Nanotechnology for environmental application, ...",https://facultyprofiles.hkust-gz.edu.cn/facult...,http://cemclo.people.ust.hk/,cemclo@ust.hk,"CWB Campus, Room 3570"
4,Lionel Ming-Shuan NI,President,"Big data, High-performance computing, Internet...",https://facultyprofiles.hkust-gz.edu.cn/facult...,https://president.hkust-gz.edu.cn/,ni@hkust-gz.edu.cn,C1 E L7


Please see the attachment for details.

### (c) Interesting Finding

Note that you don't need to do anything for Task 1(c). The purpose of this part is to give you some sense about how to leverage Exploratory Data Analysis (EDA) to come up with interesting questions about the data. EDA is an important topic in data science; you will  learn it soon from this course. 


First, please install [dataprep](http://dataprep.ai).
Then, run the cell below. 
It shows a bar chart for every column. What interesting findings can you get from these visualizations? 

### My answer: 

Since markupsafe remove unicode, I tried downgrading markupsafe to 2.0.1, but this version is incompatible with python v3.11 or above.

In [None]:
'''# since markupsafe remove unicode, marksafe has to be downgraded
#downgraded version is incompatible with python v3.11 or above
!pip install MarkupSafe==2.0.1
!pip install dataprep
'''

In [None]:
'''from dataprep.eda import plot

df = pd.read_csv("faculty_table.csv")
plot(df)'''

Below are some examples:

**Finding 1:** Assistant Professor# (~76) is more than 5x larger than Associate Professor# (10). 

**Questions:** Why did it happen? Is it common in all CS schools in the world? Will the gap go larger or smaller in five years? What actions can be taken to enlarge/shrink the gap?


**Finding 2:** The Homepage has 22% missing values. 

**Questions:** Why are there so many missing values? Is it because many faculty do not have their own homepages or do not add their homepages to the school page? What actions can be taken to avoid this to happen in the future? 

## Task 2: Age Follows Normal Distribution?

In this task, you start with a question and then figure out what data to collect.

The question that you are interested in is `Does HKUST(GZ) Info Hub faculty age follow a normal distribution?`

To estimate the age of a faculty member, you can collect the year in which s/he graduates from a university (`gradyear`) and then estimate `age` using the following equation:

$$age \approx 2023+23 - gradyear$$

For example, if one graduates from a university in 1990, then the age is estimated as 2023+23-1990 = 56. 



### (a) Crawl Web Page

You notice that faculty profile pages contain graduation information. For example, you can see that Dr. Yuyu LUO graduated from Tsinghua University in 2023 at [https://facultyprofiles.hkust-gz.edu.cn/faculty-personal-page/LUO-Yuyu/yuyuluo](https://facultyprofiles.hkust-gz.edu.cn/faculty-personal-page/LUO-Yuyu/yuyuluo). 


Please write code to download the profile pages (info hub faculties) and save each page as a text file. 

### My answer:

In [7]:
# call Edgedriver
driver = webdriver.Edge(executable_path='C:/Users/cindy/OneDrive - HKUST (Guangzhou)/DSC/msedgedriver.exe')

# go to desired webpage
try: 
    driver.get('https://facultyprofiles.hkust-gz.edu.cn/')
    #go to info hub
    driver.find_element_by_xpath('//*[@id="app"]/section/section/div/ul[1]/li[3]').click()
    time.sleep(3)
    links = driver.find_elements_by_css_selector('button.el-button.el-button--text.more-btn')

    for i,link in enumerate(links,start=1): 
        #scroll to the n-th link, it may be out of the initially visible area
        link.click()  
        time.sleep(1)
        all_handles = driver.window_handles
        driver.switch_to.window(all_handles[1])
        new_page_source = driver.page_source
        #find corresponding table for info
        with open(f'profile_{i}.txt', 'w', encoding='utf-8') as file:
            file.write(new_page_source)
        time.sleep(1)
        driver.close()
        # Switch back to the first window or tab (index 0)
        driver.switch_to.window(driver.window_handles[0])

except Exception as e:
    print(f'An error has occurred: {str(e)}')

finally:
    # Close the WebDriver
    driver.quit()



### (b) Extract Structured Data

Please write code to extract the earliest graduation year (e.g., 2023 for Dr. Yuyu LUO) from each profile page, and create a csv file like [faculty_grad_year.csv](./faculty_grad_year.csv). 

### My answer:

In [50]:
# locate the path
directory = 'C:/Users/cindy/OneDrive - HKUST (Guangzhou)/DSC/A1'

In [59]:

# Initialize an empty list to store the data
data = []

# Iterate over each text file in the directory
for filename in os.listdir(directory):
    if filename.endswith(".txt"):
        # Read the text file
        with open(os.path.join(directory, filename), 'r', encoding='utf-8') as file:
            soup = BeautifulSoup(file, 'html.parser')

            # Find the second <p> tag within the specified <div>
            degree_detail_div = soup.find('div', {'class': 'degree-detail', 'data-v-bbc02c4e': ''})
            second_p_tag = degree_detail_div.find_all('p')[1]

            # Extract the text from the second <p> tag
            name = soup.find('h2', {'class': 'english-name'}).get_text(strip=True)
            # get the year
            graduation = second_p_tag.get_text(strip=True)
            year = re.sub(r'\D', '', graduation)  # Strip out non-digit characters
            data.append({'name': name, 'gradyear': year})
            
csv_filename = 'faculty_grad_year.csv'

# Write the data to a CSV file
with open(csv_filename, mode='w', newline='') as csv_file:
    fieldnames = ['name', 'gradyear']
    writer = csv.DictWriter(csv_file, fieldnames=fieldnames)

    # Write the header row
    writer.writeheader()

    # Write the data rows
    writer.writerows(data)

print(f'Data has been written to {csv_filename}')


Data has been written to faculty_grad_year.csv


Please see the attachment for details.

### (c) Interesting Finding

### My answer: 


In [None]:
# Add data to the list
data.append({'Name': filename, 'Earliest Graduation Year': earliest_year})

# Create a DataFrame from the collected data
df = pd.DataFrame(data)

# Save the DataFrame to a CSV file
csv_filename = 'faculty_grad_year.csv'
df.to_csv(csv_filename, index=False)

print(f'Data saved to {csv_filename}')

Similar to Task 1(c), you don't need to do anything here. Just look at different visualizations w.r.t. age and give yourself an answer to the question: `Does HKUST(GZ) Info Hub faculty age follow a normal distribution?`

In [None]:
from dataprep.eda import plot
import pandas as pd

df = pd.read_csv("faculty_grad_year.csv")
df["age"] = 2023+23-df["gradyear"]

plot(df, "age")

## Submission

Complete the code in this notebook, and submit it to the Canvas assignment `Assignment 1`.