Web Form Autofiller for University Major Recommendation (Type 1)
---

**Overall logic**
1. We use a web crawler to automatically fill in an online questionnaire based on the student's responses to a series of personality and academic related questions. During the autofill process, we record the crawler's selected fields and the generated major recommendation output. We repeat this process many times to collect our own dataset.
2. Using our collected dataset, we train our own major recommendation model.

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# 1. Autofiller of Online Questionnaire
**Online questionnaire source(s)**

["What's My Major Quiz" by Loyola University Chicago](https://www.luc.edu/undergrad/academiclife/whatsmymajorquiz/#)
- 40 yes/no short questions - $2^{40}$ possible combinations
- Recommends multiple majors

## Setup

In [10]:
# import selenium related packages
from time import sleep
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

In [11]:
# define webdriver paths
CHROME_DRIVER_PATH = './drivers/chromedriver'

In [12]:
# define urls to be crawled
LUC_QUIZ_URL = 'https://www.luc.edu/undergrad/academiclife/whatsmymajorquiz/#'

## Helper functions

In [13]:
# helper function
# launch chrome browser (headless)
# return: BROWSER type
def launch_chrome_browser(CHROME_DRIVER_PATH):
    CHROME_OPTIONS = Options()
    CHROME_OPTIONS.add_argument('--headless')
    browser = webdriver.Chrome(CHROME_DRIVER_PATH, options=CHROME_OPTIONS)
    return browser

In [14]:
# helper function
# click yes/no 40 times on the online questionnaire to reach the final recommendation
import random

def luc_click(browser, chosen_option):
    # click the radio button
    btn = browser.find_element_by_id(chosen_option) # locate the radio button
    btn.click() # click the radio button on the webpage

In [15]:
# helper function
# click 40 times and store every 'yes' or 'no' into the dataframe
def luc_generate_options(browser):
    options = ['yes', 'no']
    chosen_options = random.choices(options, k=40)
    return chosen_options

In [16]:
# helper function
# get all 40 questions and store them in a dataframe first
# return: DATAFRAME
def luc_get_all_questions(browser):
    questions = []
    chosen_options = luc_generate_options(browser)
    for chosen_option in chosen_options:
        q = browser.find_element_by_xpath("//div[contains(@class, 'question')]").text
        questions.append(q)

        luc_click(browser, chosen_option)
        sleep(1.5)
    df_questions = pd.DataFrame(questions, columns=['Question']).set_index('Question')
    df_questions = df_questions.transpose()
    return df_questions

In [17]:
# helper function
# get final major suggestions
# return: LIST
def luc_get_major_suggestions(browser):
    suggestions = browser.find_elements_by_xpath("//li[contains(@class, 'selected-6') or contains(@class, 'selected-5') or contains(@class, 'selected-4') or contains(@class, 'selected-3')]//a")
    return suggestions

In [18]:
# helper function
# get rows of training data
# 40 columns of yes/no + 1 last coloumn of major recommendation
# return: DATAFRAME
def luc_append_training_row(browser):
    chosen_options = luc_generate_options(browser)

    # click 40 times
    for chosen_option in chosen_options:
        luc_click(browser, chosen_option)
        sleep(1.2)
    
    # get the list of major suggestions
    suggestions = luc_get_major_suggestions(browser)

    # create a separate row for each of the suggestions
    df = pd.DataFrame()
    for suggestion in suggestions:
        row = pd.DataFrame([chosen_options + [suggestion.text]])
        df = df.append(row)
    
    return df

In [19]:
# helper function
# click 'START OVER'
def luc_start_over(browser):
    btn = browser.find_element_by_xpath("//a[@id='reset']")
    btn.click()

In [37]:
# helper function
# define the logic within k rounds of filling the form
# return: DATAFRAME
def luc_generate_trials(browser, k):
    df = pd.DataFrame()
    for i in range(k):
        random.seed(i+200)
        df_temp = luc_append_training_row(browser)
        df = df.append(df_temp)
        luc_start_over(browser)
    return df

## Generate actual data from autofilling

**Get all 40 questions** (only need to run once)

In [21]:
# launch chrome browser
browser = launch_chrome_browser(CHROME_DRIVER_PATH)

# launch browser
browser.get(LUC_QUIZ_URL)
sleep(2)

# get all 40 questions and store them in a dataframe df
df_luc_questions = luc_get_all_questions(browser)

# close the browser
browser.close()

In [23]:
# save questions list as csv
df_luc_questions.to_csv('df_luc_questions.csv')

**Generate data**

In [38]:
# launch chrome browser
browser = launch_chrome_browser(CHROME_DRIVER_PATH)

# launch browser
browser.get(LUC_QUIZ_URL)
sleep(2)

# generate the data for 200 rounds of autofilling
df_luc = luc_generate_trials(browser, 600)

# close the browser
browser.close()

In [223]:
# save the csv
df_luc.to_csv('datasets/df_luc.csv')

# 2. Training using generated data

In [9]:
# read data
df_luc = pd.read_csv('datasets/df_luc.csv', index_col=0)

In [77]:
df_luc_questions = df_luc_questions.transpose().reset_index()

In [118]:
df_luc_questions
df = df_luc.drop_duplicates(subset=df_luc.columns.difference(['40']))

In [119]:
X = df.iloc[:, :40].eq('yes').mul(1)
y_factorized = pd.factorize(df.iloc[:, 40])
y = y_factorized[0]

In [120]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7600, shuffle=True)

In [130]:
from sklearn.tree import DecisionTreeClassifier
clf_dt = DecisionTreeClassifier(random_state=7600)
clf_dt.fit(X_train, y_train)
clf_dt.score(X_test, y_test)

0.06875

In [129]:
from sklearn.naive_bayes import GaussianNB
clf_gnb = GaussianNB()
clf_gnb.fit(X_train, y_train)
clf_gnb.score(X_test, y_test)

0.18125

In [131]:
from sklearn.svm import SVC
clf_svc = SVC(random_state=8017)
clf_svc.fit(X_train, y_train)
clf_svc.score(X_test, y_test)

0.14375

In [132]:
from sklearn.linear_model import LogisticRegression
clf_logistic = LogisticRegression(random_state=8017)
clf_logistic.fit(X_train, y_train)
clf_logistic.score(X_test, y_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


0.11875

In [135]:
y_factorized

(array([ 0,  1,  2,  3,  4,  3,  5,  6,  7,  0,  8,  3,  3,  9,  5,  9, 10,
         8,  1,  5,  2, 11, 12,  4,  8,  2,  3, 13, 14, 11,  8, 15,  8, 10,
         7, 11,  8,  5,  3,  2,  7, 14,  0,  5,  8, 15,  8,  7, 10, 10,  0,
        16,  3,  8,  1, 10, 17,  7,  1,  4,  8,  2,  8, 17,  7,  3, 13,  2,
        15, 13, 18,  2,  2, 18,  3, 15,  8,  8, 10,  8, 13,  5, 18, 17, 17,
         7,  2, 14, 14, 14,  3,  0, 15,  0,  1, 14,  0,  5,  9,  0,  2,  0,
        10,  5, 14,  7, 15, 19, 10,  0,  6,  0,  3,  8,  9,  8,  4, 17, 12,
         2,  5,  0,  2,  0, 10,  2,  7,  4,  5, 12, 19,  7,  8,  2,  8,  2,
         0,  2,  8,  3,  2,  7,  2,  3,  7,  0,  3,  5, 14,  8,  8, 10, 15,
         0,  4, 14,  2,  6,  0,  9,  0, 16,  3,  5, 11,  0,  7,  2,  3,  2,
         0,  8,  2, 18,  2,  7, 15,  3,  0,  9, 17, 14,  8,  8,  7,  2,  6,
         7, 17,  7,  4,  7, 18,  2,  2,  0, 19,  1,  0,  6, 10, 15,  0,  8,
         7,  0,  9,  8,  9,  7,  8,  0, 20, 20,  3, 16,  0,  5,  4, 15,  2,
         0, 