<a href="https://colab.research.google.com/github/jeffreyong15/Counsel.NLP/blob/main/Baseline%20Experiment/Data%20Collection/Data_Collection_Jeffrey.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Synthetic Academic Advising Dataset

In [None]:
import pandas as pd
import random

In [None]:
# Define categories and templates for prompts and responses
data_templates = {
    "Prerequisites": [
        ("What are the prerequisites for {course}?", "You need to complete {course1} and {course2}."),
        ("Can I take {course} without {course1}?", "No, you need to complete {course1} first."),
        ("Are there prerequisites for {course}?", "Yes, you need to complete {course1} and {course2}.")
    ],
    "Graduation Requirements": [
        ("How many credits do I need to graduate?", "You need a total of {credits} credits."),
        ("What are the core requirements for graduation?", "You must complete core courses in Math, Science, and English."),
        ("Do I need elective credits to graduate?", "Yes, you need at least {elective_credits} elective credits.")
    ],
    "Academic Support": [
        ("Where can I find tutoring services?", "Tutoring services are available at the Academic Resource Center."),
        ("Is there a study group for {course}?", "Yes, check the bulletin board for study group information for {course}."),
        ("How can I get help with assignments?", "You can get help from tutors and your course TA.")
    ],
    "Course Scheduling": [
        ("When is {course} offered?", "{course} is offered every {semester}."),
        ("Are summer courses available?", "Yes, summer courses are available for selected subjects."),
        ("How do I register for next semester?", "You can register through the online portal starting in October.")
    ],
    "Changing Major": [
        ("How can I change my major?", "Meet with an academic advisor to discuss changing your major."),
        ("What are the steps to change my major?", "Fill out a change of major form and get approval from your advisor."),
        ("Can I switch to a double major?", "Yes, you can discuss this option with your advisor.")
    ],
    "Academic Policies": [
        ("What is the grading scale?", "The grading scale is A, B, C, D, and F."),
        ("What happens if I fail a course?", "You should meet with your advisor to discuss options."),
        ("Can I retake a course for a better grade?", "Yes, you can retake a course, and the new grade will replace the old one.")
    ],
    "Senior Project Requirements": [
        ("When should I take the Senior Project course?", "The Senior Project course should be taken in your final semester."),
        ("What is required for the Senior project?", "The Senior project requires a comprehensive research or practical project."),
        ("Is there a prerequisite for the Senior Project course?", "Yes, you need to complete all core courses before the Senior Project course.")
    ]
}

In [None]:
# Generate random values
def random_course_code():
    return f"CS{random.randint(100, 499)}"

def random_credits():
    return random.choice([120, 130, 140])

def random_semester():
    return random.choice(["Fall", "Spring", "Fall and Spring", "Summer"])

In [None]:
num_samples = 10000
rows = []

for _ in range(num_samples):
    category = random.choice(list(data_templates.keys()))
    query_template, response_template = random.choice(data_templates[category])

    course = random_course_code()
    course1 = random_course_code()
    course2 = random_course_code()
    credits = random_credits()
    elective_credits = random.choice([20, 30, 40])
    semester = random_semester()

    query = query_template.format(
        course=course,
        course1=course1,
        course2=course2,
        credits=credits,
        elective_credits=elective_credits,
        semester=semester
    )
    response = response_template.format(
        course=course,
        course1=course1,
        course2=course2,
        credits=credits,
        elective_credits=elective_credits,
        semester=semester
    )

    rows.append((query, response, category))

df = pd.DataFrame(rows, columns=["Prompt", "Response", "Category"])

output_path = "academic_advising_data.csv"
df.to_csv(output_path, index=False)

In [None]:
output_path

'academic_advising_data.csv'

## Real Student Academic Dataset

#### Install Required Libraries

In [1]:
!echo | sudo add-apt-repository ppa:saiarcot895/chromium-beta
!sudo apt remove chromium-browser
!sudo snap remove chromium
!sudo apt install chromium-browser -qq
# Chromium (an open-source version of Chrome) and Chromium WebDriver (which allows Selenium to control Chromium).

PPA publishes dbgsym, you may need to include 'main/debug' component
Repository: 'deb https://ppa.launchpadcontent.net/saiarcot895/chromium-beta/ubuntu/ jammy main'
Description:
This PPA contains the latest Chromium Beta builds, with hardware video decoding enabled (hidden behind a flag), and support for Widevine (needed for viewing many DRM-protected videos) enabled.

== Hardware Video Decoding ==

To enable hardware video decoding, start Chromium with the --enable-features=VaapiVideoDecoder argument. To make this persistent, create a file at /etc/chromium-browser/customizations/92-vaapi-hardware-decoding with the following contents:

CHROMIUM_FLAGS="${CHROMIUM_FLAGS} --enable-features=VaapiVideoDecoder"

See also https://wiki.archlinux.org/title/Chromium#Hardware_video_acceleration for more information on VAAPI video decoding support.

=== Widevine Support ===

The packages in this PPA have support for Widevine inside Chromium enabled. However, you still need to copy some files from 

In [2]:
!pip3 install selenium --quiet
!apt-get update
!apt install chromium-chromedriver -qq
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
#Selenium requires a browser driver (in this case, chromedriver) to communicate with the browser. You're installing it using the chromium-chromedriver package and copying it to /usr/bin for easy access.

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.7/9.7 MB[0m [31m48.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m481.7/481.7 kB[0m [31m15.8 MB/s[0m eta [36m0:00:00[0m
Hit:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:3 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:4 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:5 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Hit:6 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:7 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/saiarcot895/chromium-beta/ubuntu jammy InRelease
Hit:11 https://ppa.launchpadcontent.net

In [3]:
!pip install selenium
!apt-get update
!apt-get install -y chromium-chromedriver

Hit:1 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:2 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:3 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:4 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:5 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:6 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Hit:7 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/saiarcot895/chromium-beta/ubuntu jammy InRelease
Hit:11 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Reading package lists... Done
W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.lis

#### Import Library

In [99]:
import time
import sys
import warnings
import json
import glob
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
from urllib.parse import urljoin
from bs4.element import NavigableString, Tag

#### Load the chrome webdriver

In [45]:
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
# download the selenium chromedriver executable file and paste the link in the following code
# this code should open a new chrome window in your machine
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_service = ChromeService(
    executable_path='/usr/lib/chromium-browser/chromedriver',
    log_path='/dev/null'  # You can change the log path as needed
)
driver = webdriver.Chrome(service=chrome_service,options=chrome_options)
#The ChromeService class sets up the path to the chromedriver

#### Data Scraping and Collection

I only scrape all the CMPE courses from SJSU catalog. I will add more different type of courses in the future.

In [133]:
# Extract course description
def extract_description(course_table):
    description = "Description not found"
    hr_tag = course_table.find('hr')

    if hr_tag:
        description_parts = []
        content_div = course_table.find('div', {'class': None, 'style': None})

        if content_div:
            current = content_div.find('hr').next_sibling

            while current and (
                (isinstance(current, NavigableString) and not current.strip()) or
                (isinstance(current, Tag) and current.name == 'em' and 'unit' in current.text.lower())
            ):
                current = current.next_sibling


            while current:
                if isinstance(current, NavigableString):
                    text = current.strip()
                    if text:
                        description_parts.append(text)
                elif isinstance(current, Tag):
                    if current.name == 'br':
                        next_text = getattr(current.next_sibling, 'text', '') if current.next_sibling else ''
                        if any(keyword in str(next_text) for keyword in ['Lecture', 'Prerequisite', 'Corequisite']):
                            break
                    elif current.name != 'em' and current.name != 'strong':
                        if current.text and 'unit' not in current.text.lower():
                            description_parts.append(current.text.strip())

                current = current.next_sibling

            if description_parts:
                description = ' '.join(description_parts).strip()

    return description

# Extract course units
def extract_units(course_table):
    hr_tag = course_table.find('hr')
    if hr_tag:
        unit_ems = hr_tag.find_next_siblings('em', limit=2)
        if len(unit_ems) >= 2:
            return f"{unit_ems[0].text.strip()} {unit_ems[1].text.strip()}"
    return 'Units not found'

# Extract class structure (lecture/lab hours)
def extract_class_structure(course_table):
    lecture_lab = course_table.find('em', string=lambda x: x and ('hour' in x.lower() or 'lab' in x.lower()))
    return lecture_lab.text.strip() if lecture_lab else 'Class structure not found'

# Extract prerequisites and/or corequisites
def extract_prerequisites(course_table):
    prereq_text = []
    for strong_tag in course_table.find_all('strong'):
        if any(req in strong_tag.text for req in ['Prerequisite', 'Corequisite']):
            req_text = ''  # Initialize an empty string to store the requirements
            next_elem = strong_tag.next_sibling
            while next_elem and not (isinstance(next_elem, type(strong_tag)) and next_elem.name == 'strong'):
                if isinstance(next_elem, str):
                    req_text += next_elem.strip() + ' '
                elif hasattr(next_elem, 'text'):
                    req_text += next_elem.text.strip() + ' '
                next_elem = next_elem.next_sibling
            prereq_text.append(req_text.strip())
    # Only return the requirement details without the "Prerequisite" or "Corequisite" label
    return ' | '.join(prereq_text).replace('Prerequisite:', '').replace('Corequisite:', '').strip() if prereq_text else 'No prerequisites/corequisites listed'

# Extract grading information
def extract_grading(course_table):
    grading_tag = course_table.find('strong', string=lambda x: x and 'Grading' in x)
    if grading_tag:
        grading_text = ''  # Initialize an empty string to store the grading info
        next_elem = grading_tag.next_sibling
        while next_elem and not (isinstance(next_elem, type(grading_tag)) and next_elem.name == 'strong'):
            if isinstance(next_elem, str):
                grading_text += next_elem.strip() + ' '
            next_elem = next_elem.next_sibling
        return grading_text.strip()  # Return only the grading information without the label
    return 'Grading info not found'

def expand_all_course_links(driver):
    course_links = driver.find_elements(By.XPATH, "//td[@class='width']/a[contains(@onclick, 'showCourse')]")

    for course_link in course_links:
        try:
            # Click each course link to expand
            ActionChains(driver).move_to_element(course_link).click().perform()
            time.sleep(2)
        except Exception as e:
            print(f"Error clicking course link: {e}")
            continue

def extract_course_details(driver):
    course_details = []

    expand_all_course_links(driver)

    soup = BeautifulSoup(driver.page_source, 'html.parser')
    course_tables = soup.find_all('table', class_='td_dark')

    for course_table in course_tables:
        details = {
            'title': course_table.find('h3').text.strip() if course_table.find('h3') else 'Title not found',
            'units': extract_units(course_table),
            'description': extract_description(course_table),
            'class_structure': extract_class_structure(course_table),
            'prerequisites/corequisite': extract_prerequisites(course_table),
            'grading': extract_grading(course_table)
        }

        course_details.append(details)

    return course_details

def extract_all_course_details(driver, pages=[1, 2]):
    all_course_details = []

    # Loop over the Page(s) available
    for page in pages:
        # Dynamically adjust the URL for the page
        url = f"https://catalog.sjsu.edu/content.php?catoid=15&navoid=5382&filter%5B27%5D=CMPE&filter%5Bcpage%5D={page}&filter%5Bexact_match%5D=1&filter%5Bitem_type%5D=3&filter%5Bonly_active%5D=1"

        driver.get(url)
        time.sleep(5)

        # Extract course details from the current page
        course_details = extract_course_details(driver)
        all_course_details.extend(course_details)

    pd.set_option('display.max_colwidth', None)
    pd.set_option('display.width', 1000)
    df = pd.DataFrame(all_course_details)

    df['id'] = range(1, len(df) + 1)
    # Reorder columns to move 'id' to the front
    df = df[['id'] + [col for col in df.columns if col != 'id']]  # Reorder columns to move 'id' to the front

    return df

# def extract_all_course_details(driver, course_filters=['CMPE', 'CS'], pages=[1, 2]):
#     all_course_details = []

#     # Loop over each course filter (e.g., CMPE, CS)
#     for course_filter in course_filters:
#         # Loop over the Page(s) available for the current course filter
#         for page in pages:
#             # Dynamically adjust the URL for the page and course filter
#             url = f"https://catalog.sjsu.edu/content.php?catoid=15&navoid=5382&filter%5B27%5D={course_filter}&filter%5Bcpage%5D={page}&filter%5Bexact_match%5D=1&filter%5Bitem_type%5D=3&filter%5Bonly_active%5D=1"

#             driver.get(url)
#             time.sleep(5)

#             # Check if the current page exists by checking for page navigation or a specific element
#             if "Page" in driver.page_source:
#                 soup = BeautifulSoup(driver.page_source, 'html.parser')
#                 page_links = soup.find('tr', string=lambda x: x and 'Page' in x)
#                 if page_links:
#                     available_pages = len(page_links.find_all('a'))
#                     print(f"Available pages for {course_filter}: {available_pages}")
#                     # If only one page, adjust the 'pages' list to have only [1]
#                     if available_pages == 1:
#                         pages = [1]
#                         break

#             # Extract course details from the current page
#             course_details = extract_course_details(driver)
#             all_course_details.extend(course_details)

#     pd.set_option('display.max_colwidth', None)
#     pd.set_option('display.width', 1000)
#     df = pd.DataFrame(all_course_details)

#     # Add 'id' column at the first position
#     df['id'] = range(1, len(df) + 1)
#     df = df[['id'] + [col for col in df.columns if col != 'id']]  # Reorder columns to move 'id' to the front

#     return df

def save_to_json(df, filename="CMPE_courses_dataset.json"):
    course_json = df.to_dict(orient='records')

    with open(filename, 'w') as f:
        json.dump(course_json, f, indent=4)

    print(f"Data saved to {filename}")

In [131]:
CMPE_df = extract_all_course_details(driver, pages=[1, 2])
CMPE_df_display = CMPE_df.drop(columns='id')
CMPE_df_display

Unnamed: 0,title,units,description,class_structure,prerequisites/corequisite,grading
0,CMPE 30 - Programming Concepts and Methodology,3 unit(s),"Introduction to programming; overview of computer organization and introduction to software engineering. Topics include methodologies for program design, development, style, testing, and documentation; algorithms, control structures, functions, and elementary data structures.",Lecture 2 hours/lab 3 hours.,MATH 30 or MATH 30X or equivalent.,Letter Graded
1,CMPE 50 - Object-Oriented Concepts and Methodology,3 unit(s),"Application of object-oriented software engineering techniques to the design and development of larger programs; data abstraction, structures, classes and associated algorithms.",Misc/Lab: Lecture 2 hours/lab 3 hours.,CMPE 30 with a minimum grade of “C-“. Computer Engineering and Software Engineering Majors only.,Letter Graded
2,CMPE 92 - International Program Studies,1-6 unit(s),"Study Abroad and Away transfer credit course. Study Abroad and Away provides students the opportunity to study abroad on long term programs (Exchange Programs, CSU International Programs, and International Student Exchange Programs) and short-term programs (Faculty-Led Programs and Summer School Abroad Programs) for academic credit, offering Alternative Break Programs for cultural immersion, and designing other globally focused opportunities. This course is designated as a placeholder course for Study Abroad and Away programs. Mixed Grading",Class structure not found,No prerequisites/corequisites listed,Mixed Grading
3,CMPE 102 - Assembly Language Programming,3 unit(s),"Assembly programming; assembly-C interface; CPU and memory organization; addressing modes; arithmetic, logic and branch instructions; arrays, pointers, subroutines, stack and procedure calls; software interrupts; multiplication, division and floating point arithmetic. CMPE 50 or CS 46B (with grade of “C-” or better) Sophomore or upper division standing. Allowed Declared Majors: Computer Engineering, Software Engineering. Letter Graded Close Close",Class structure not found,"CMPE 50 or CS 46B (with grade of “C-” or better) Sophomore or upper division standing. Allowed Declared Majors: Computer Engineering, Software Engineering.",Letter Graded
4,CMPE 110 - Electronics for Computing Systems,3 unit(s),"RC, RL and RLC circuit analysis, diodes and diode circuits, MOSFET and bipolar transistor I-V characteristics and circuits, CMOS logic circuits, CMOS-TTL interface, sensors and signal conditioning circuits using operational amplifiers, A/D and D/A converters, electromechanical device control.",Lecture 2 hours/lab 3 hours.,"EE 97 , EE 98 , MATH 33LA or MATH 39 + MATH 33A , all with a grade of “C” or better. Allowed Declared Majors: Computer Engineering and Software Engineering.",Letter Graded.
...,...,...,...,...,...,...
99,CMPE 297 - Special Topics in Computer/Software Engineering,3 unit(s),Special topics to augment regularly-scheduled graduate courses. May be taken up to three times in different topic areas.,Class structure not found,Instructor consent.,Letter Graded
100,CMPE 298 - Special Problems,1-6 unit(s),Advanced individual work in computer engineering. Instructor consent. Not available to Open University Students. Mandatory Credit/No Credit/RP,Class structure not found,Instructor consent. Not available to Open University Students.,Mandatory Credit/No Credit/RP
101,CMPE 298I - Computer/Software Engineering Internship,1-6 unit(s),"Field work for computer and software engineering graduate students. A report is required at the end of the semester addressing the goals set at the start of the assignment. Completed 6 units degree core plus six additional degree program units, classified status, in good standing and graduate advisor consent Mandatory Credit/No Credit/RP",Class structure not found,"Completed 6 units degree core plus six additional degree program units, classified status, in good standing and graduate advisor consent",Mandatory Credit/No Credit/RP
102,CMPE 299A - Master Thesis I,3 unit(s),"The first part of a thesis culminating the work for the master’s degree in the specialization. Classified status, good standing, completion of at least 15 units of graduation degree credit, two core courses, and at least one specialization course; and graduate director consent. Credit/No Credit/RP.",Class structure not found,"Classified status, good standing, completion of at least 15 units of graduation degree credit, two core courses, and at least one specialization course; and graduate director consent.",Credit/No Credit/RP.


In [132]:
save_to_json(CMPE_df, filename="CMPE_courses_dataset.json")

Data saved to CMPE_courses_dataset.json
