# Introduction

This notebook will showcase the method to collect the entire dataset using selenium, with a clumsy way to store it initally.

# Scrapping

## Onboarding and Requirements

We will need the following requirements in order to properly scrape the data stored in capes.ucsd.edu. This part of code could not be done without the previous code that have been written by Seascape – https://github.com/dcao/seascape, and Smartercape – https://github.com/andportnoy/smartercapes.com 

In [1]:
import pandas as pd
import re
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.firefox.service import Service
from webdriver_manager.firefox import GeckoDriverManager
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions

## Functions that would run code

In order for the below function to be able to perform correctly, one must install firefox and alongside with webdriver. The function below will return a pandas dataframe that contains all of the general information of the courses that were caped before. If you only need the general info from cape, it is advise that you run this function to a variable and output the pandas dataframe to a csv for further analysis.

In [2]:
def get_raw_cape_dataframe():
    
    # launch browser using Selenium, need to have Firefox installed
    print('Opening a browser window...')
    s=Service(GeckoDriverManager().install())
    driver = webdriver.Firefox(service=s)
    print('Browser window open, loading the page...')

    # get the page that lists all the data, first try
    driver.get('https://cape.ucsd.edu/responses/Results.aspx')
    print('Please enter credentials...')

    # wait until SSO credentials are entered
    wait = WebDriverWait(driver, 60)
    element = wait.until(expected_conditions.title_contains('Course And Professor Evaluations (CAPE)'))

    # get the page that lists all the data
    # (%2C is the comma, drops all the data since every professor name has it)
    driver.get('https://cape.ucsd.edu/responses/Results.aspx?Name=%2C')

    # read in the dataset from the html file
    df = pd.read_html(driver.page_source)[0]
    print('Dataset parsed, closing browser window.')

    # destroy driver instance
    driver.quit()

    return df   

The function below will return a beautiful soup object, using the same method.

In [3]:
def get_raw_html():
    # launch browser using Selenium, need to have Firefox installed
    print('Opening a browser window...')
    s=Service(GeckoDriverManager().install())
    driver = webdriver.Firefox(service=s)
    print('Browser window open, loading the page...')

    # get the page that lists all the data, first try
    driver.get('https://cape.ucsd.edu/responses/Results.aspx')
    print('Please enter credentials...')

    # wait until SSO credentials are entered
    wait = WebDriverWait(driver, 60)
    element = wait.until(expected_conditions.title_contains('Course And Professor Evaluations (CAPE)'))
    print('The driver is now loading the page, please wait patiently as it may take longer than\n' +
    'a few minutes to complete due to the large size of the CAPE dataset')

    # get the page that lists all the data
    # (%2C is the comma, drops all the data since every professor name has it)
    driver.get('https://cape.ucsd.edu/responses/Results.aspx?Name=%2C')
    print('Finished loading, currently parsing the soup object...')
    
    # creating soup object
    soup = BeautifulSoup(driver.page_source)
    print('Dataset parsed, closing browser window.')

    # destroy driver instance
    driver.quit()

    return soup

In order to get all the course links to get all the nitty gritty of cape, one must need to get all the links to scrape the page one by one. The function below would get all the link that the general page provides.

In [4]:
def get_courseID(soup):
    bs4obj = soup.find_all("a", {"href" : True, "id": re.compile("ContentPlaceHolder1_gvCAPEs_hlViewReport_.*")})
    return_lst = []
    for query in bs4obj:
        return_lst.append(query.get("href"))
    return return_lst

After getting all the course links, one must request the page one by one to pull the data. This function would take a page and return a bs4obj to grab the page.

In [5]:
def get_bs4(page):
    driver.get(page)
    soup = BeautifulSoup(driver.page_source)
    return soup

This function would take a soup object and return a dictionary contains a dictionary, which it's value is also a dictionary and the value of that dictionary would be a list.

In [6]:
def question_collector(bs4obj):
    question_text = bs4obj.find_all("span", {"id": re.compile("ContentPlaceHolder1_dlQuestionnaire_lblQuestionText_.*")})
    dct_key = []
    general_info_dct = dict(zip(["Course", "Instructor", "Qtr/Yr", "Enrollment", "CAPEs Returned"], 
                                           [[i.text] for i in bs4obj.find_all(
                                               "span", {"id": re.compile("ContentPlaceHolder1_lbl.*"), 
                                                        "style": "font-weight:bold;"})]))
    general_info_dct["Question"] = "General Info"
    return_dct = {"General Info": {"General Info": general_info_dct}}
    counter = 0
    for qt in question_text:
        if qt.text.split()[0].isupper() and len(qt.text.split()[0]) > 1:
            dct_key.append(qt.text)
            return_dct[qt.text] = {}
            counter += 1
        else:
            if len(bs4obj.find_all("span", {"id": re.compile("ContentPlaceHolder1_dlQuestionnaire_rptChoiceTexts_" 
                                                                + str(counter)+"_.*")})) != 0:
                col = dict(zip([i.text for i in bs4obj.find_all("span", {"id": re.compile("ContentPlaceHolder1_dlQuestionnaire_rptChoiceTexts_" 
                                                                    + str(counter)+"_.*")})],
                                [[i.get_text(separator=" ").split(" ")[0]] for i in bs4obj.find_all("span", {"id": re.compile("ContentPlaceHolder1_dlQuestionnaire_rptChoices_" 
                                                                    + str(counter)+"_.*")})]))
                col_copy = col.copy()
                col_copy.update({"Question": qt.text})
                return_dct[dct_key[-1]].update({qt.text: col_copy})
                counter += 1
            else:
                col = dict(zip(col.keys(), [[i.get_text(separator=" ").split(" ")[0]] for i in bs4obj.find_all("span", {"id": re.compile("ContentPlaceHolder1_dlQuestionnaire_rptChoices_" 
                                                                    + str(counter)+"_.*")})]))
                col_copy = col.copy()
                col_copy.update({"Question": qt.text})
                return_dct[dct_key[-1]].update({qt.text: col_copy})
                counter += 1
    return return_dct

This function would take the multi-level dictionary into a single dataframe. However, this resulting dataframe would have lots of null values. The sole goal of this function is to create a single dataframe that is easy for storage. 

In [7]:
def question_collector_to_pd(bs4obj):
    dict_collect = question_collector(bs4obj)
    return_dct = {}
    for key, value in dict_collect.items():
        for in_key, in_value in value.items():
            if len(value[in_key]) != 1:
                return_dct[in_key] = pd.DataFrame(data = in_value)
                return_dct[in_key].insert(0, "course", 
                                          bs4obj.find("span", {"id": "ContentPlaceHolder1_lblCourseDescription"})
                                          .text.split("(")[0].strip())
                return_dct[in_key].insert(0, "term", 
                                          bs4obj.find("span", {"id": "ContentPlaceHolder1_lblTermCode"}).text)
    df_return = pd.DataFrame([])
    for values in return_dct.values():
        df_return = pd.concat([df_return, values])
    df_return = df_return.reset_index()
    return df_return

# The final function

In [8]:
def get_raw_html_noquit():
    # launch browser using Selenium, need to have Firefox installed
    print('Opening a browser window...')
    s=Service(GeckoDriverManager().install())
    driver = webdriver.Firefox(service=s)
    print('Browser window open, loading the page...')

    # get the page that lists all the data, first try
    driver.get('https://cape.ucsd.edu/responses/Results.aspx')
    print('Please enter credentials...')

    # wait until SSO credentials are entered
    wait = WebDriverWait(driver, 60)
    element = wait.until(expected_conditions.title_contains('Course And Professor Evaluations (CAPE)'))
    print('Now loading...\nPlease wait patiently as it may take longer than ' +
    'a few minutes to complete due to the large size of the CAPE dataset')

    # get the page that lists all the data
    # (%2C is the comma, drops all the data since every professor name has it)
    driver.get('https://cape.ucsd.edu/responses/Results.aspx?Name=%2C')
    print('Finished loading, currently parsing the soup object...')
    
    # creating soup object
    soup = BeautifulSoup(driver.page_source)
    print('Soup parsed.')

    return soup, driver

In [50]:
def scrape_func():
    soup, driver = get_raw_html_noquit()
    print("Getting a list of course IDs for scraping the individual pages...")
    course_lst = get_courseID(soup)
    print('This browser window will now keep changing until the last cape record has been recorded...')
    return_pd = pd.DataFrame([])
    counter = 37 # counting the slice of csv files ouput
    start_index = course_lst.index("CAPEReport.aspx?sectionid=779432") + 1
    course_link = "https://cape.ucsd.edu/responses/" + course_lst[start_index]
    # get each individual course page
    driver.get(course_link)
    for course in range(start_index + 1, len(course_lst)):
        # creating soup object
        soup = BeautifulSoup(driver.page_source)
        # preventing hitting error with pages with wrong link, smh who made cape page should be jailed!
        if soup.find("span", {"class": "errormessage"}).text != "":
            print("Error message find")
            print(course_link)
            course_link = "https://cape.ucsd.edu/responses/" + course_lst[course + 1]
            driver.get(course_link)
            continue
        try:
            # Getting the actual grade received distribution
            df = pd.read_html(driver.page_source)
            # Since we know prior to FA12, CAPE uses old format, we would exclude it from this function
            if course_lst[course + 1] == "../scripts/detailedStats.asp?SectionId=749583":
                break
            # There is a slight chance that the url may contain A-z
            course_link = "https://cape.ucsd.edu/responses/" + course_lst[course + 1]#.strip("CAPEReport.aspx?sectionid=")#.strip("[A-z]+")
            # get each individual course page
            try:
                driver.get(course_link)
            except:
                try:
                    driver.get(course_link)
                except:
                    print("Unfortunately, there is an error occured, now outputing the remaining data to a csv...")
                    print(course_link)
                    print(course_lst.index(course_link))
                    # outputing the result data to a file named realdata.csv
                    return_pd.to_csv("realdata.csv")
                    print(course_link)
                    break
            
            # return an individual dataframe object for one page
            return_pd = pd.concat([return_pd, question_collector_to_pd(soup)])
            if len(df) > 2:
                df = df[2]
                df["term"] = return_pd["term"].iloc[-1]
                df["course"] = return_pd["course"].iloc[-1]
                df["Question"] = "Grade received"
                return_pd = pd.concat([return_pd, df])
            if return_pd.shape[0] > 50000:
                print("We've scrapped the " + str(counter) + "th 50000 row in our data, ouputing a csv...")
                return_pd.to_csv("realdata" + str(counter) + ".csv")
                return_pd = pd.DataFrame([])
                print("The " + str(counter) + "th 50000 row has been successfully output as " + "realdata" + str(counter) + ".csv")
                print("Last Entry is:")
                print(course_link)
                counter += 1
        except:
                print("Unfortunately, there is an error occured, now outputing the remaining data to a csv...")
                print(course_link)
                print(course)
                # outputing the result data to a file named realdata.csv
                return_pd.to_csv("realdata.csv")
                break
    
    # destroy driver instance
    driver.quit()
    print("Congradualation! The last record has been recorded, now outputing the data to a csv...")
    print(course_link)
    # outputing the result data to a file named realdata.csv
    return_pd.to_csv("realdata.csv")
    print('The remaining data has been output to a file named "realdata.csv". Goodbye!')

In [51]:
def lst_tester():
    soup, driver = get_raw_html_noquit()
    print("Getting a list of course IDs for scraping the individual pages...")
    course_lst = get_courseID(soup)
    return course_lst

In [52]:
cookies = scrape_func()



Current firefox version is 100.0
Get LATEST geckodriver version for 100.0 firefox


Opening a browser window...


Driver [/Users/edwint/.wdm/drivers/geckodriver/macos/v0.31.0/geckodriver] found in cache


Browser window open, loading the page...
Please enter credentials...
Now loading...
Please wait patiently as it may take longer than a few minutes to complete due to the large size of the CAPE dataset
Finished loading, currently parsing the soup object...
Soup parsed.
Getting a list of course IDs for scraping the individual pages...
This browser window will now keep changing until the last cape record has been recorded...
We've scrapped the 37th 50000 row in our data, ouputing a csv...
The 37th 50000 row has been successfully output as realdata37.csv
Last Entry is:
https://cape.ucsd.edu/responses/CAPEReport.aspx?sectionid=772394
We've scrapped the 38th 50000 row in our data, ouputing a csv...
The 38th 50000 row has been successfully output as realdata38.csv
Last Entry is:
https://cape.ucsd.edu/responses/CAPEReport.aspx?sectionid=765020
We've scrapped the 39th 50000 row in our data, ouputing a csv...
The 39th 50000 row has been successfully output as realdata39.csv
Last Entry is:
https:

In [32]:
import numpy as np
np_lstarr = np.asarray(lst_check)
split_arr_check = np.char.split(np_lstarr, sep = "=")

NameError: name 'lst_check' is not defined

In [None]:
split_arr_check

In [None]:
np_lstarr[-1]

In [None]:
"936287A".strip("[A-z]+")

In [None]:
lst_series = pd.Series(np.char.strip(np.char.strip(np_lstarr, "CAPEReport.aspx?sectionid="
                                         ),"/scripts/detailedStats.asp?SectionId="))
lst_series[lst_series.str.contains("E")]