Crawler for Department of Computer Science (Type 3)
---

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# 1. Metadata of CS Courses for an Academic Year
- All courses offered in 2022-23 are listed in HTML table format
- We use BeautifulSoup for the crawler

We crawl the following fields:
1. Year
2. Type (Core/Elective)
3. Course Code
4. Course Title
5. Term
6. Staff (can have multiple names)
7. Moodle link
8. Course description link
9. Staff link

In [3]:
from bs4 import BeautifulSoup
from time import sleep
import requests

**[Helper Function] Retrieve HTML of the metatable**

In [121]:
def get_meta_table_html(academic_year):
    # define url to crawl based on academic year
    META_URL = 'https://www.cs.hku.hk/index.php/programmes/course-offered?acadYear=' + academic_year

    # set up beautiful soup configurations
    meta_page = requests.get(META_URL)
    meta_soup = BeautifulSoup(meta_page.text, 'html.parser')

    # get the second table on the page (which is the one we want to crawl)
    meta_tables = meta_soup.find_all('table')[1]

    return meta_tables

**[Function] Retrieve metatable dataframe of all courses offered in an academic year**

In [122]:
def get_meta_df(academic_year):
    # call helper function to get metatable HTML
    meta_table = get_meta_table_html(academic_year)

    # get all the rows in this table
    meta_trs = meta_table.find_all('tr')

    # get data in each row
    rows = []
    for tr in meta_trs:
        row = []
        for td in tr:
            try: 
                if td.text != '\n':
                    row.append(td.text)
                    link = td.find('a').get('href')
                    if link != None:
                        row.append(link)
            except: 
                continue
        
        # only append the row if there are 4 fields
        if len(row) > 4: rows.append(row)

    # define column names
    column_names = ['Course Code', 'Moodle Link', 'Course Title', 'Course Link', 'Term', 'Staff', 'Staff Link']
    
    # convert matrix into a dataframe
    meta_df = pd.DataFrame(rows, columns=column_names)

    # TODO: should we drop staff link or no?
    # drop the staff link
    meta_df.drop(columns=['Staff Link'], inplace=True)

    return meta_df

# 2. Detailed Course Info Page
**Learning Objectives**
1. List of objectives
2. Table with course-programme outcome mapping

**Syllabus**
1. Calendar entry
2. Detailed description (table form)
3. Assessment

**Helper functions**

In [141]:
# retrieve a specific table HTML (using table index) on a specific course page
def get_specific_table_html_for_course(COURSE_URL, info_type):
    # first retrieve all table HTMLs on the course info page
    course_page = requests.get(COURSE_URL)
    course_soup = BeautifulSoup(course_page.text, 'html.parser')
    course_tables = course_soup.find_all('table')

    # TODO: need to identify class first and then search for the tables


In [142]:
get_specific_table_html_for_course('https://www.cs.hku.hk/index.php/programmes/course-offered?infile=2022/comp1117.html', 'basic')

## 2.1 Basic course information
1. Course name
2. Instructor(s)
3. Number of credits
4. Recommended learning hours (?)
5. Pre-requisite(s)
6. Co-requisite(s)
7. Mutually exclusive with: ENGG1111 or ENGG1330
8. Remarks
9. Moodle course link