Web Form Autofiller for University Major Recommendation (Type 1)
---

**Overall logic**
1. We use a web crawler to automatically fill in an online questionnaire based on the student's responses to a series of personality and academic related questions.
2. During the autofill process, we record the crawler's selected fields and the generated major recommendation output. We repeat this process many times to collect our own dataset.
3. Using our collected dataset, we train our own major recommendation model.

In [19]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# 1. Autofiller of Online Questionnaire
**Online questionnaire source(s)**

["What's My Major Quiz" by Loyola University Chicago](https://www.luc.edu/undergrad/academiclife/whatsmymajorquiz/#)
- 40 yes/no short questions - $2^{40}$ possible combinations
- Recommends multiple majors

## Setup

In [79]:
# import selenium related packages
from time import sleep
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

In [33]:
# define webdriver paths
CHROME_DRIVER_PATH = './drivers/chromedriver'

In [34]:
# define urls to be crawled
LUC_QUIZ_URL = 'https://www.luc.edu/undergrad/academiclife/whatsmymajorquiz/#'

## Helper functions

In [198]:
# helper function
# launch chrome browser (headless)
# return: BROWSER type
def launch_chrome_browser(CHROME_DRIVER_PATH):
    CHROME_OPTIONS = Options()
    CHROME_OPTIONS.add_argument('--headless')
    browser = webdriver.Chrome(CHROME_DRIVER_PATH, options=CHROME_OPTIONS)
    return browser

In [199]:
# helper function
# click yes/no 40 times on the online questionnaire to reach the final recommendation
import random

def luc_click(browser, chosen_option):
    # click the radio button
    btn = browser.find_element_by_id(chosen_option) # locate the radio button
    btn.click() # click the radio button on the webpage

In [200]:
# helper function
# click 40 times and store every 'yes' or 'no' into the dataframe
def luc_generate_options(browser):
    options = ['yes', 'no']
    chosen_options = random.choices(options, k=40)
    return chosen_options

In [201]:
# helper function
# get all 40 questions and store them in a dataframe first
# return: DATAFRAME
def luc_get_all_questions(browser):
    questions = []
    chosen_options = luc_generate_options(browser)
    for chosen_option in chosen_options:
        q = browser.find_element_by_xpath("//div[contains(@class, 'question')]").text
        questions.append(q)

        luc_click(browser, chosen_option)
        sleep(1.5)
    df_questions = pd.DataFrame(questions, columns=['Question']).set_index('Question')
    df_questions = df_questions.transpose()
    return df_questions

In [202]:
# helper function
# get final major suggestions
# return: LIST
def luc_get_major_suggestions(browser):
    suggestions = browser.find_elements_by_xpath("//li[contains(@class, 'selected-6') or contains(@class, 'selected-5') or contains(@class, 'selected-4') or contains(@class, 'selected-3')]//a")
    return suggestions

In [203]:
# helper function
# get rows of training data
# 40 columns of yes/no + 1 last coloumn of major recommendation
# return: DATAFRAME
def luc_append_training_row(browser):
    chosen_options = luc_generate_options(browser)

    # click 40 times
    for chosen_option in chosen_options:
        luc_click(browser, chosen_option)
        sleep(1.2)
    
    # get the list of major suggestions
    suggestions = luc_get_major_suggestions(browser)

    # create a separate row for each of the suggestions
    df = pd.DataFrame()
    for suggestion in suggestions:
        row = pd.DataFrame([chosen_options + [suggestion.text]])
        df = df.append(row)
    
    return df

In [204]:
# helper function
# click 'START OVER'
def luc_start_over(browser):
    btn = browser.find_element_by_xpath("//a[@id='reset']")
    btn.click()

In [205]:
# helper function
# define the logic within k rounds of filling the form
# return: DATAFRAME
def luc_generate_trials(browser, k):
    df = pd.DataFrame()
    for i in range(k):
        random.seed(i)
        df_temp = luc_append_training_row(browser)
        df = df.append(df_temp)
        luc_start_over(browser)
    return df

## Generate actual data from autofilling

**Get all 40 questions** (only need to run once)

In [195]:
# launch chrome browser
browser = launch_chrome_browser(CHROME_DRIVER_PATH)

# launch browser
browser.get(LUC_QUIZ_URL)
sleep(2)

# get all 40 questions and store them in a dataframe df
df_luc_questions = luc_get_all_questions(browser)

# close the browser
browser.close()

**Generate data**

In [208]:
# launch chrome browser
browser = launch_chrome_browser(CHROME_DRIVER_PATH)

# launch browser
browser.get(LUC_QUIZ_URL)
sleep(2)

# generate the data for 200 rounds of autofilling
df_luc = luc_generate_trials(browser, 200)

# close the browser
browser.close()