Crawler for JUPAS Students (Type 2)
---

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

---
# 1. Metadata of HKU Jupas Programmes
- All programmes offered are listed in HTML table format
- We use Beautiful Soup for the crawler

We crawl the following fields:
1. Institutition/Scheme
2. JUPAS Catalogue No. (+ link)
3. Funding Category
4. Programme Short Name
5. Programme Full Title

In [2]:
from bs4 import BeautifulSoup
from time import sleep
import requests

## Helper functions

In [10]:
# helper function
# retrieve HTML of the metatable
# return: STRING (original html of the metatable)
def get_meta_table_html():
    META_URL = 'https://www.jupas.edu.hk/en/programme/hku/'

    # set up beautiful soup configurations
    meta_page = requests.get(META_URL)
    meta_soup = BeautifulSoup(meta_page.text, 'html.parser')

    # get the table with all the programmes
    meta_table = meta_soup.find('table')

    return meta_table

In [63]:
# helper function
# retrieve metatable dataframe of all HKU-JUPAS programmes offered
# return: DATAFRAME

def get_meta_df():
    # call helper function to get metatable HTML
    meta_table = get_meta_table_html()

    # get all the rows in this table
    meta_trs = meta_table.find_all('tr')

    # get data in each row
    rows = []
    for tr in meta_trs:
        row = []
        for td in tr:
            try:
                if td.text != ' ':
                    row.append(td.text)
                    link = td.find('a').get('href')
                    if link != None:
                        row.append(link)
            except:
                continue

        if len(row) == 5: # header row
            row.insert(2, 'JUPAS Catalogue Link')
        
        rows.append(row) # append to matrix
    
    # convert matrix into a dataframe
    df_meta = pd.DataFrame(rows[1:], columns=rows[0])

    return df_meta
        

## Start scraping

In [65]:
get_meta_df().head()

Unnamed: 0,Institution / Scheme,JUPAS Catalogue No.,JUPAS Catalogue Link,Funding Category,Programme Short Name,Programme Full Title
0,HKU,JS6004,/en/programme/hku/JS6004,UGC-funded,BA(AS),Bachelor of Arts in Architectural Studieså»ºç¯...
1,HKU,JS6016,/en/programme/hku/JS6016,UGC-funded,BSC(SURV),Bachelor of Science in Surveyingçå­¸å£«(æ¸¬é...
2,HKU,JS6028,/en/programme/hku/JS6028,UGC-funded,BA(LS),Bachelor of Arts in Landscape Studiesåå¢å­¸...
3,HKU,JS6042,/en/programme/hku/JS6042,UGC-funded,BA(US),Bachelor of Arts in Urban Studiesæå­¸å£«(å...
4,HKU,JS6054,/en/programme/hku/JS6054,UGC-funded,BA,Bachelor of Artsæå­¸å£«


---
# 2. Detailed JUPAS Programme Info Page
We crawl the following:
1. JUPAS catalogue no.
2. Programme name
3. Programme short name
4. Short Description
5. Programme website
6. Programme entrance requirements
7. General entrance requirements
8. First year tuition fee
9. Corresponding HKU department
10. HKU department email
--

10. Study level
11. Duration of study
12. First year intake
13. Interview arrangements
14. Funding category

## Helper functions

In [106]:
# helper function
# retrieve the basic info of a specific programme page
# return: DATAFRAME

def get_specific_program_basic_info(PROGRAM_URL):
    # first retrieve HTML for the program
    program_page = requests.get(PROGRAM_URL)
    program_soup = BeautifulSoup(program_page.text, 'html.parser')
    program_html = program_soup.find('main')

    # define column names
    col_names = [
        'JUPAS Catalogue No.', 
        'Programme Name', 
        'Programme Short Name',
        'Short Description',
        'Programme Website',
        'Programme Entrance Requirements',
        'General Entrance Requirements',
        'First Year Tuition Fee',
        'Corresponding HKU department'
    ]

    # define row
    row = []

    # 1. JUPAS catalogue no.
    field = program_html.find('p', {'class': 'programCode_block'}).text
    row.append(field)

    # 2. Programme name
    field = program_html.find('span', {'class': 'before_label'}).text
    row.append(field)

    # 3. Short description
    field = program_html.find('h2', {'class': 'program_title_shortname'}).text
    row.append(field)

    # 4. Short description
    field = program_html.find('div', {'class': 'ckec'}).text.replace('Â\xa0', ' ')
    row.append(field)

    # 5. Programme website
    field = program_html.find('div', {'class': 'ckec link_break'}).text.replace(' ', '')
    row.append(field)

    # 6. Programme entrance requirements
    # TODO

    # 7. General entrance requirements
    # TODO

    # 8. First year tuition fee
    field = program_html.find('p', {'class': 'linetext'}).text
    row.append(field)

    # 9. Corresponding HKU department
    enquiries_html = program_html.find('div', {'class': 'enquiries_section'})
    field = enquiries_html.find('p', {'class': 'enquiries_block-top'}).text
    row.append(field)

    # 10. HKU department email
    fields = enquiries_html.find('a')
    for field in fields:
        row.append(field.text)

    print (row)


In [108]:
get_specific_program_basic_info('https://www.jupas.edu.hk/en/programme/hku/JS6004/')

['JS6004', 'Bachelor of Arts in Architectural Studies', 'BA(AS)', ' The Bachelor of Arts in Architectural Studies BA(AS) is a four-year, full-time programme that introduces students to architecture as a cultural discipline.  It is the first degree needed to qualify as an architect, and provides an excellent general education that gives students the confidence to pursue related careers both within the design professions and further afield.  The course of study is centered on research based learning in the design studio and is structured to foster a sense of community, develop deep knowledge of the discipline, and encourage creativity.  Design studios look at diverse contemporary issues relevant to Hong Kong and the world and are structured by thematic platforms that introduce students to a range of methodologies and approaches to the territorial, social, technological, urban and environmental issues that comprise architecture as a cultural discipline. The BA(AS) program offers numerous 