<a href="https://colab.research.google.com/github/jeffreyong15/Counsel.NLP/blob/main/Baseline%20Experiment/Data%20Collection/Data_Collection_Jeffrey.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Synthetic Academic Advising Dataset

In [None]:
import pandas as pd
import random

In [None]:
# Define categories and templates for prompts and responses
data_templates = {
    "Prerequisites": [
        ("What are the prerequisites for {course}?", "You need to complete {course1} and {course2}."),
        ("Can I take {course} without {course1}?", "No, you need to complete {course1} first."),
        ("Are there prerequisites for {course}?", "Yes, you need to complete {course1} and {course2}.")
    ],
    "Graduation Requirements": [
        ("How many credits do I need to graduate?", "You need a total of {credits} credits."),
        ("What are the core requirements for graduation?", "You must complete core courses in Math, Science, and English."),
        ("Do I need elective credits to graduate?", "Yes, you need at least {elective_credits} elective credits.")
    ],
    "Academic Support": [
        ("Where can I find tutoring services?", "Tutoring services are available at the Academic Resource Center."),
        ("Is there a study group for {course}?", "Yes, check the bulletin board for study group information for {course}."),
        ("How can I get help with assignments?", "You can get help from tutors and your course TA.")
    ],
    "Course Scheduling": [
        ("When is {course} offered?", "{course} is offered every {semester}."),
        ("Are summer courses available?", "Yes, summer courses are available for selected subjects."),
        ("How do I register for next semester?", "You can register through the online portal starting in October.")
    ],
    "Changing Major": [
        ("How can I change my major?", "Meet with an academic advisor to discuss changing your major."),
        ("What are the steps to change my major?", "Fill out a change of major form and get approval from your advisor."),
        ("Can I switch to a double major?", "Yes, you can discuss this option with your advisor.")
    ],
    "Academic Policies": [
        ("What is the grading scale?", "The grading scale is A, B, C, D, and F."),
        ("What happens if I fail a course?", "You should meet with your advisor to discuss options."),
        ("Can I retake a course for a better grade?", "Yes, you can retake a course, and the new grade will replace the old one.")
    ],
    "Senior Project Requirements": [
        ("When should I take the Senior Project course?", "The Senior Project course should be taken in your final semester."),
        ("What is required for the Senior project?", "The Senior project requires a comprehensive research or practical project."),
        ("Is there a prerequisite for the Senior Project course?", "Yes, you need to complete all core courses before the Senior Project course.")
    ]
}

In [None]:
# Generate random values
def random_course_code():
    return f"CS{random.randint(100, 499)}"

def random_credits():
    return random.choice([120, 130, 140])

def random_semester():
    return random.choice(["Fall", "Spring", "Fall and Spring", "Summer"])

In [None]:
num_samples = 10000
rows = []

for _ in range(num_samples):
    category = random.choice(list(data_templates.keys()))
    query_template, response_template = random.choice(data_templates[category])

    course = random_course_code()
    course1 = random_course_code()
    course2 = random_course_code()
    credits = random_credits()
    elective_credits = random.choice([20, 30, 40])
    semester = random_semester()

    query = query_template.format(
        course=course,
        course1=course1,
        course2=course2,
        credits=credits,
        elective_credits=elective_credits,
        semester=semester
    )
    response = response_template.format(
        course=course,
        course1=course1,
        course2=course2,
        credits=credits,
        elective_credits=elective_credits,
        semester=semester
    )

    rows.append((query, response, category))

df = pd.DataFrame(rows, columns=["Prompt", "Response", "Category"])

output_path = "academic_advising_data.csv"
df.to_csv(output_path, index=False)

In [None]:
output_path

'academic_advising_data.csv'

## Real Student Academic Dataset

#### Install Required Libraries

In [1]:
!echo | sudo add-apt-repository ppa:saiarcot895/chromium-beta
!sudo apt remove chromium-browser
!sudo snap remove chromium
!sudo apt install chromium-browser -qq
# Chromium (an open-source version of Chrome) and Chromium WebDriver (which allows Selenium to control Chromium).

PPA publishes dbgsym, you may need to include 'main/debug' component
Repository: 'deb https://ppa.launchpadcontent.net/saiarcot895/chromium-beta/ubuntu/ jammy main'
Description:
This PPA contains the latest Chromium Beta builds, with hardware video decoding enabled (hidden behind a flag), and support for Widevine (needed for viewing many DRM-protected videos) enabled.

== Hardware Video Decoding ==

To enable hardware video decoding, start Chromium with the --enable-features=VaapiVideoDecoder argument. To make this persistent, create a file at /etc/chromium-browser/customizations/92-vaapi-hardware-decoding with the following contents:

CHROMIUM_FLAGS="${CHROMIUM_FLAGS} --enable-features=VaapiVideoDecoder"

See also https://wiki.archlinux.org/title/Chromium#Hardware_video_acceleration for more information on VAAPI video decoding support.

=== Widevine Support ===

The packages in this PPA have support for Widevine inside Chromium enabled. However, you still need to copy some files from 

In [2]:
!pip3 install selenium --quiet
!apt-get update
!apt install chromium-chromedriver -qq
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
#Selenium requires a browser driver (in this case, chromedriver) to communicate with the browser. You're installing it using the chromium-chromedriver package and copying it to /usr/bin for easy access.

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.5/9.5 MB[0m [31m72.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.3/486.3 kB[0m [31m32.1 MB/s[0m eta [36m0:00:00[0m
Hit:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:3 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:4 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:5 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:6 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:7 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/saiarcot895/chromium-beta/ubuntu jammy InRelease
Hit:11 https://ppa.launchpadcontent.net

In [3]:
!pip install selenium
!apt-get update
!apt-get install -y chromium-chromedriver

Hit:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:3 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:4 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:5 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:6 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:7 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/saiarcot895/chromium-beta/ubuntu jammy InRelease
Hit:11 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Reading package lists... Done
W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.lis

#### Import Library

In [4]:
import time
import sys
import warnings
import json
import glob
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
from urllib.parse import urljoin
from bs4.element import NavigableString, Tag

#### Load the chrome webdriver

In [13]:
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
# download the selenium chromedriver executable file and paste the link in the following code
# this code should open a new chrome window in your machine
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_service = ChromeService(
    executable_path='/usr/lib/chromium-browser/chromedriver',
    log_path='/dev/null'  # You can change the log path as needed
)
driver = webdriver.Chrome(service=chrome_service,options=chrome_options)
#The ChromeService class sets up the path to the chromedriver

#### Data Scraping and Collection Function

In [26]:
# Extract course description
def extract_description(course_table):
    description = "Description not found"
    hr_tag = course_table.find('hr')

    if hr_tag:
        description_parts = []
        content_div = course_table.find('div', {'class': None, 'style': None})

        if content_div:
            current = content_div.find('hr').next_sibling

            # Skip irrelevant siblings
            while current and (
                (isinstance(current, NavigableString) and not current.strip()) or
                (isinstance(current, Tag) and current.name == 'em' and 'unit' in current.text.lower())
            ):
                current = current.next_sibling

            # Collect description until encountering a stopping keyword
            stop_keywords = ['Lecture', 'Prerequisite(s)', 'Corequisite(s)', 'Grading', 'Notes(s)']

            while current:
                if isinstance(current, NavigableString):
                    text = current.strip()
                    if text:
                        description_parts.append(text)
                elif isinstance(current, Tag):
                    if current.name == 'br':
                        next_sibling = current.next_sibling
                        while isinstance(next_sibling, NavigableString) and not next_sibling.strip():
                            next_sibling = next_sibling.next_sibling
                        if isinstance(next_sibling, Tag) and next_sibling.name == 'strong':
                            if any(keyword in next_sibling.text for keyword in stop_keywords):
                                break
                    elif current.name == 'strong' and any(keyword in current.text for keyword in stop_keywords):
                        break
                    elif current.name not in ['em', 'strong']:
                        if current.text and 'unit' not in current.text.lower():
                            description_parts.append(current.text.strip())

                current = current.next_sibling

            if description_parts:
                description = ' '.join(description_parts).strip()

    return description

# Extract course units
def extract_units(course_table):
    hr_tag = course_table.find('hr')
    if hr_tag:
        unit_ems = hr_tag.find_next_siblings('em', limit=2)
        if len(unit_ems) >= 2:
            return f"{unit_ems[0].text.strip()} {unit_ems[1].text.strip()}"
    return 'Units not found'

# Extract class structure (lecture/lab hours)
def extract_class_structure(course_table):
    lecture_lab = course_table.find('em', string=lambda x: x and ('hour' in x.lower() or 'lab' in x.lower()))
    return lecture_lab.text.strip() if lecture_lab else 'Class structure not found'

# Extract only prerequisites
def extract_prerequisites(course_table):
    prerequisites = []

    for strong_tag in course_table.find_all('strong'):
        text = strong_tag.get_text(strip=True)
        if "Prerequisite(s)" in text:
            next_elem = strong_tag.next_sibling
            while next_elem and not (isinstance(next_elem, type(strong_tag)) and next_elem.name == 'strong'):
                if isinstance(next_elem, str):
                    prerequisites.append(next_elem.strip())
                elif hasattr(next_elem, 'get_text'):
                    prerequisites.append(next_elem.get_text(strip=True))
                next_elem = next_elem.next_sibling

    return " ".join(prerequisites).replace(" .", "").strip() if prerequisites else "No prerequisites listed"

# Extract only corequisites
def extract_corequisites(course_table):
    corequisites = []

    for strong_tag in course_table.find_all('strong'):
        text = strong_tag.get_text(strip=True)
        if "Corequisite(s)" in text and "Pre/Corequisite(s)" not in text:
            next_elem = strong_tag.next_sibling
            while next_elem and not (isinstance(next_elem, type(strong_tag)) and next_elem.name == 'strong'):
                if isinstance(next_elem, str):
                    corequisites.append(next_elem.strip())
                elif hasattr(next_elem, 'get_text'):
                    corequisites.append(next_elem.get_text(strip=True))
                next_elem = next_elem.next_sibling

    return " ".join(corequisites).replace(" .", "").strip() if corequisites else "No corequisites listed"

# Extract Pre/Corequisite(s)
def extract_pre_corequisites(course_table):
    pre_corequisites = []

    for strong_tag in course_table.find_all('strong'):
        text = strong_tag.get_text(strip=True)
        if "Pre/Corequisite(s)" in text:
            next_elem = strong_tag.next_sibling
            while next_elem and not (isinstance(next_elem, type(strong_tag)) and next_elem.name == 'strong'):
                if isinstance(next_elem, str):
                    pre_corequisites.append(next_elem.strip())
                elif hasattr(next_elem, 'get_text'):
                    pre_corequisites.append(next_elem.get_text(strip=True))
                next_elem = next_elem.next_sibling

    # Convert list to string, replace non-breaking spaces, and remove unwanted artifacts
    cleaned_text = " ".join(pre_corequisites).replace("\xa0", " ").strip()

    return cleaned_text if cleaned_text else "No pre/corequisites listed"

# Extract grading information
def extract_grading(course_table):
    grading_tag = course_table.find('strong', string=lambda x: x and 'Grading' in x)
    if grading_tag:
        next_elem = grading_tag.next_sibling
        while next_elem and isinstance(next_elem, NavigableString):
            grading_text = next_elem.strip()
            if grading_text:
                return grading_text  # Return only the grading information without the label
            next_elem = next_elem.next_sibling
    return 'Grading info not found'

# Extract Note(s) information
def extract_notes(course_table):
    notes = []

    for strong_tag in course_table.find_all('strong'):
        text = strong_tag.get_text(strip=True)
        if "Note(s)" in text:
            next_elem = strong_tag.next_sibling
            while next_elem:
                # Stop extracting if a <div> element is encountered (prevents unwanted "Close" text)
                if isinstance(next_elem, type(strong_tag)) and next_elem.name == 'div':
                    break
                if isinstance(next_elem, str):
                    notes.append(next_elem.strip())
                elif hasattr(next_elem, 'get_text'):
                    notes.append(next_elem.get_text(strip=True))
                next_elem = next_elem.next_sibling

    return " ".join(notes).replace(" .", "").strip() if notes else "No notes listed"

def expand_all_course_links(driver):
    course_links = driver.find_elements(By.XPATH, "//td[@class='width']/a[contains(@onclick, 'showCourse')]")

    for course_link in course_links:
        try:
            # Click each course link to expand
            ActionChains(driver).move_to_element(course_link).click().perform()
            time.sleep(2)
        except Exception as e:
            print(f"Error clicking course link: {e}")
            continue

def get_total_pages(soup):
    try:
        all_tds = soup.find_all('td')
        page_td = None
        for td in all_tds:
            if 'Page:' in td.get_text():
                page_td = td
                break

        if page_td:
            page_links = page_td.find_all('a')
            if page_links:
                last_page = max(int(link.text.strip()) for link in page_links if link.text.strip().isdigit())
                return last_page
    except Exception as e:
        print(f"Error in page detection: {str(e)}")
    return 1

def count_course_pages(driver, course_filters):
    page_counts = {}

    for course_filter in course_filters:
        # Construct the URL
        url = (f"https://catalog.sjsu.edu/content.php?"
               f"catoid=15&navoid=5382&filter%5B27%5D={course_filter}"
               f"&filter%5Bexact_match%5D=1&filter%5Bitem_type%5D=3"
               f"&filter%5Bonly_active%5D=1")

        try:
            driver.get(url)
            time.sleep(3)

            html = driver.page_source
            soup = BeautifulSoup(html, 'html.parser')

            total_pages = get_total_pages(soup)
            page_counts[course_filter] = total_pages
            print(f"Found {total_pages} pages for {course_filter}")

        except Exception as e:
            print(f"Error processing {course_filter}: {str(e)}")
            page_counts[course_filter] = 1

    return page_counts

def extract_course_details(driver):
    course_details = []

    expand_all_course_links(driver)

    soup = BeautifulSoup(driver.page_source, 'html.parser')
    course_tables = soup.find_all('table', class_='td_dark')

    for course_table in course_tables:
        raw_title = course_table.find('h3').text.strip() if course_table.find('h3') else 'Title not found'
        clean_title = " ".join(raw_title.split())  # Removes weird spaces/non-breaking spaces
        details = {
            'title': clean_title,
            'units': extract_units(course_table),
            'description': extract_description(course_table),
            'class_structure': extract_class_structure(course_table),
            'prerequisite(s)': extract_prerequisites(course_table),
            'corequisite(s)': extract_corequisites(course_table),
            'pre/corequisite(s)': extract_pre_corequisites(course_table),
            'grading': extract_grading(course_table),
            'note(s)': extract_notes(course_table)
        }

        course_details.append(details)

    return course_details

def extract_courses_details(driver, course_filters):  # '-1' for all courses
    all_course_details = []

    # Get the total pages for each course filter
    page_counts = count_course_pages(driver, course_filters)

    for course_filter in course_filters:
        # Get the total pages for the current course filter
        total_pages = page_counts.get(course_filter, 1)

        for page in range(1, total_pages + 1):
            url = (f"https://catalog.sjsu.edu/content.php?catoid=15&navoid=5382&filter%5B27%5D={course_filter}"
                   f"&filter%5Bcpage%5D={page}&filter%5Bexact_match%5D=1&filter%5Bitem_type%5D=3&filter%5Bonly_active%5D=1")

            driver.get(url)
            time.sleep(3)

            course_details = extract_course_details(driver)
            all_course_details.extend(course_details)

    pd.set_option('display.max_colwidth', None)
    pd.set_option('display.width', 1000)
    df = pd.DataFrame(all_course_details)

    df['id'] = range(1, len(df) + 1)
    df = df[['id'] + [col for col in df.columns if col != 'id']]  # Reorder columns to move 'id' to the front

    return df

def extract_all_course_details(driver, course_filters=['-1'], start_page=1, end_page=10, save_interval=10):
    all_course_details = []
    page_counts = count_course_pages(driver, course_filters)

    for course_filter in course_filters:
        total_pages = page_counts.get(course_filter, 1)

        # Ensure the end_page doesn't exceed the total available pages
        end_page = min(end_page, total_pages)

        for page in range(start_page, end_page + 1):
            print(f"Scraping page {page} now...")

            url = (f"https://catalog.sjsu.edu/content.php?catoid=15&navoid=5382&filter%5B27%5D={course_filter}"
                   f"&filter%5Bcpage%5D={page}&filter%5Bexact_match%5D=1&filter%5Bitem_type%5D=3&filter%5Bonly_active%5D=1")

            driver.get(url)
            time.sleep(3)

            course_details = extract_course_details(driver)
            all_course_details.extend(course_details)

            # Save progress after every 'save_interval' pages
            if (page - start_page + 1) % save_interval == 0 or page == end_page:
                partial_df = pd.DataFrame(all_course_details)
                partial_df['id'] = range(1, len(partial_df) + 1)
                partial_df = partial_df[['id'] + [col for col in partial_df.columns if col != 'id']]

    # Display settings
    pd.set_option('display.max_colwidth', None)
    pd.set_option('display.width', 1000)

    # Create and format the final DataFrame
    df = pd.DataFrame(all_course_details)
    df['id'] = range(1, len(df) + 1)
    df = df[['id'] + [col for col in df.columns if col != 'id']]  # Reorder columns to move 'id' to the front

    return df

def save_to_json(df, filename="SJSU_courses_dataset.json"):
    course_json = df.to_dict(orient='records')

    with open(filename, 'w') as f:
        json.dump(course_json, f, indent=4)

    print(f"Data saved to {filename}")

#### SJSU Catalog Dataset

In [23]:
courses_df = extract_courses_details(driver, course_filters=['AMS'])
courses_df_display = courses_df.drop(columns='id')
courses_df_display

Found 1 pages for AMS


Unnamed: 0,title,units,description,class_structure,prerequisite(s),corequisite(s),pre/corequisite(s),note(s),grading
0,AMS 1A - American Cultures to 1877,6 unit(s),"American culture examined through political, literary, artistic, economic and social development. American values, ideas and institutions from popular culture as well as traditional sources.",Lecture 3 hours/lecture 3 hours,No prerequisites listed,No corequisites listed,No pre/corequisites listed,Entire sequence satisfies GE Areas C1+C2+D+ American Institutions (US123),Letter Graded.
1,AMS 1B - American Cultures 1877 to Present,6 unit(s),"American culture examined through political, literary, artistic, economic and social development. American values, ideas and institutions from popular culture as well as traditional sources.",Lecture 3 hours/lecture 3 hours.,AMS 1A,No corequisites listed,No pre/corequisites listed,Entire sequence satisfies GE Areas C1+C2+D+ American Institutions (US123),Letter Graded.
2,AMS 10 - Stories that Make America,3 unit(s),"Introduces students to the political and historical origins of the U.S., as well as the ways these origins have been mythologized and reimagined in literature and social constructions of public memory. Along with studying primary source documents and archives within their historical context, students learn to analyze literature, popular culture, and public discourse to better understand the uses and misuses of historical memory. Focuses on a 100+ year span of history.",Class structure not found,No prerequisites listed,No corequisites listed,"Completion of, or co‐registration in, ENGL 1A is required.",No notes listed,Letter Graded
3,AMS 11 - Visions of Democracy,3 unit(s),"Foregrounds social movements as a way to understand the impact of the U.S. government and the State of California on their residents. Focus on how both the U.S. and California governments work, how they are interrelated, and how groups of Americans have responded to their experiences of injustice and the failures of democracy through art, literature, social action, and politics. When taken with AMS 10, completes the American Institutions requirements (US1, US2, and US3). Satisfies D: Social Sciences + US2: US Constitution + US3: California Government.Grading: Letter graded.",Class structure not found,No prerequisites listed,No corequisites listed,No pre/corequisites listed,No notes listed,Letter graded.
4,AMS 12 - Intro to US Popular Culture: Serious Fun!,3 unit(s),"Focuses on the relationships among power, representation, audience, historical context, and genre in American popular culture through the study of aesthetics, representation, visual technology, and public discourse. Develops a foundational understanding of the histories, practices, and aesthetic strategies that have shaped popular culture in the U.S., and explores the intersections of gender, race, class, sexuality, colonialism, im/migration, and community identity that shape its production and consumption, as well as its potential as a platform for social change.",Class structure not found,No prerequisites listed,No corequisites listed,No pre/corequisites listed,No notes listed,Letter graded.
5,AMS 92 - International Program Studies,1-12 unit(s),"Study Abroad and Away transfer credit course. Study Abroad and Away provides students the opportunity to study abroad on long term programs (Exchange Programs, CSU International Programs, and International Student Exchange Programs) and short-term programs (Faculty-Led Programs and Summer School Abroad Programs) for academic credit, offering Alternative Break Programs for cultural immersion, and designing other globally focused opportunities. This course is designated as a placeholder course for Study Abroad and Away programs.",Class structure not found,No prerequisites listed,No corequisites listed,No pre/corequisites listed,No notes listed,Mixed Grading
6,AMS 100W - Writing in the Humanities & Interdisciplinary Arts,3 unit(s),"Advanced workshop in composition and reading for the critical and comparative study of the humanities and interdisciplinary arts. Students explore and practice the thinking and writing skills that help us communicate insight, critique, interpretation, description, and analysis of arts, literature, history, and culture, including expository writing and library research. Readings and objects of study include a range of topics and at least two different art forms.",Class structure not found,A3 or equivalent second-semester composition course (with a grade of C- or better); completion of core GE; and upper-division standing. Or Graduate or Postbaccalaureate level.,No corequisites listed,No pre/corequisites listed,Must be passed with C or better to satisfy the CSU Graduation Writing Assessment Requirement (GWAR). Cross-listed with HUM 100W / RELS 100W Humanities is responsible for scheduling.,Letter Graded
7,AMS 129 - The U.S. in a Global Context,3 unit(s),"The United States has been interconnected with and part of a larger Western hemisphere, as well as part of global political, economic, ecological, and cultural networks, from the times of European colonization to the present. This course explores the ways that Indigenous American and U.S. cultures, peoples, and institutions have been globally interconnected. To explore these connections and links, students will analyze a range of texts, artifacts, works of art, etc., from around the world. Topics may include foreign policy, colonization & decolonization, immigration, mercantilism and global capitalism, consumerism, propaganda, mass culture, warfare, environmental colonialism, etc.",Class structure not found,Upper division standing,No corequisites listed,No pre/corequisites listed,No notes listed,Letter Graded
8,AMS 139 - Animals and Society,3 unit(s),"Introduction to the practices of animal observation and the critical methods of the interdisciplinary field of Human-Animal Studies, paying particular attention to intersections among behavior, ecology, space, and critical theories of race, class, gender, colonialism, and power. To engage more deeply with the scientific methods and cultural theories that underpin this course, students engage in field observations and read texts from ethology (the study of animal behavior), evolutionary biology, ecological theory, history, philosophy, literature, and cultural studies.",Class structure not found,"Completion of Core General Education and upper division standing are prerequisites to all SJSU studies courses. Completion of, or co-registration in, 100W is strongly recommended.",No corequisites listed,No pre/corequisites listed,No notes listed,Letter Graded
9,AMS 159 - Nature and World Cultures,3 unit(s),The influence of industrialization and globalization on earth and the environment as seen through culture.,Class structure not found,"Completion of Core General Education, and upper-division standing are prerequisites to all SJSU studies courses. Completion of, or co-registration in, 100W is strongly recommended.",No corequisites listed,No pre/corequisites listed,No notes listed,Letter Graded


In [25]:
save_to_json(courses_df, filename="SJSU_courses_dataset.json")

Data saved to SJSU_courses_dataset.json


In [27]:
# Run for pages 1 to 10
courses_df1_10 = extract_all_course_details(driver, start_page=1, end_page=10)
courses_df1_10_display = courses_df1_10.drop(columns='id')
courses_df1_10_display

Found 54 pages for -1
Scraping page 1 now...
Scraping page 2 now...
Scraping page 3 now...
Scraping page 4 now...
Scraping page 5 now...
Scraping page 6 now...
Scraping page 7 now...
Scraping page 8 now...
Scraping page 9 now...
Scraping page 10 now...


Unnamed: 0,title,units,description,class_structure,prerequisite(s),corequisite(s),pre/corequisite(s),grading,note(s)
0,KIN 1 - Adapted Physical Activities,1 unit(s),"Structured individualized physical activities to enhance physical/motor fitness and develop an active, health-oriented lifestyle for students unable to participate in the general activity program.",Class structure not found,No prerequisites listed,No corequisites listed,No pre/corequisites listed,Letter Graded,Movement Area 2 Fitness
1,KIN 2A - Beginning Swimming,1 unit(s),This course is designed for the non-swimmer and beginning swimmer. It is assumed that all students enrolled in the class have had little or no experience in learning the basic skills of swimming. The course is designed to instruct the student in the basic skills necessary to enable him/her to swim safely in deep water. There are no prerequisites for the course.,Class structure not found,No prerequisites listed,No corequisites listed,No pre/corequisites listed,Letter Graded,Movement Area 4 Individual/Dual
2,KIN 2B - Intermediate Swimming,1 unit(s),This course is designed to meet the needs of students who have satisfactorily completed the skills involved in beginning swimming.,Class structure not found,Beginning level or its equivalent.,No corequisites listed,No pre/corequisites listed,Letter Graded,Movement Area 4 Individual/Dual
3,KIN 2C - Advanced Swimming,1 unit(s),This course is designed to refine and extend the development of advanced skills in swimming.,Class structure not found,Intermediate level or its equivalent.,No corequisites listed,No pre/corequisites listed,Letter Graded,Movement Area 4 Individual/Dual
4,KIN 3 - Water Polo,1 unit(s),"Fundamental skills, techniques, strategies, rules, and knowledge necessary to safely and correctly play water polo.",Class structure not found,Beginning level swimming proficiency.,No corequisites listed,No pre/corequisites listed,Letter Graded,Movement Area 5 Team
...,...,...,...,...,...,...,...,...,...
995,BUS 298I - Applied Business Experience Internship,1 unit(s),For the student with a specific internship providing a quality experience that reinforces the curriculum and involves meaningful work. The student must submit a one-page formal proposal to the graduate program director. A final report is required. The internship must qualify as Curricular Practical Training (CPT) for international students.,Class structure not found,Approved advancement to candidacy.,No corequisites listed,No pre/corequisites listed,Mandatory Credit/No Credit/RP,No notes listed
996,BUS 299 - Master’s Thesis,1-4 unit(s),Master’s Thesis Plan A.,Class structure not found,Approval of the instructor and advancement to candidacy. Not available to Open University Students,No corequisites listed,No pre/corequisites listed,Mandatory Credit/No Credit/RP,No notes listed
997,"HSPM 1 - Travel to Learn, Learn to Travel",3 unit(s),"Course examines the relations among tourists, locals, and the tourism industry and how the global tourism industry facilitates diverse travelers¿ travel experience from beginning to end. Focus on the industry¿s history, growth, development, impacts, trends, technology and career opportunity.",Class structure not found,No prerequisites listed,No corequisites listed,No pre/corequisites listed,Letter Graded.,No notes listed
998,HSPM 11 - Restaurant Entrepreneurship,3 unit(s),"The comprehensive process of conceptualizing, planning, starting, and managing a restaurant business. The topics cover business planning, operations, menu planning, staffing, marketing, and customer service.",Class structure not found,No prerequisites listed,No corequisites listed,No pre/corequisites listed,Letter Graded.,No notes listed


In [28]:
# Run for pages 11 to 21
courses_df11_21 = extract_all_course_details(driver, start_page=11, end_page=21)
courses_df11_21_display = courses_df11_21.drop(columns='id')
courses_df11_21_display

Found 54 pages for -1
Scraping page 11 now...
Scraping page 12 now...
Scraping page 13 now...
Scraping page 14 now...
Scraping page 15 now...
Scraping page 16 now...
Scraping page 17 now...
Scraping page 18 now...
Scraping page 19 now...
Scraping page 20 now...
Scraping page 21 now...


Unnamed: 0,title,units,description,class_structure,prerequisite(s),corequisite(s),pre/corequisite(s),grading,note(s)
0,HSPM 20 - Sanitation and Environmental Issues in the Hospitality Industry,2 unit(s),"Sanitation in food service, hotel and travel/tourism industries; study of pathogenic organisms and food handling procedures. Occupational health, safety and environmental control in the hospitality industry.",Class structure not found,No prerequisites listed,No corequisites listed,No pre/corequisites listed,Letter Graded,No notes listed
1,HSPM 22 - Catering and Beverage Management,3 unit(s),"Planning and executing catering and buffet functions. Evaluation of alcoholic and non-alcoholic beverages regarding purchasing, storage, preparation, merchandising and regulations.",Misc/Lab: Lecture 2 hours /lab 3 hours.,NUFS 20 or instructor consent.,No corequisites listed,No pre/corequisites listed,Letter Graded,No notes listed
2,HSPM 65 - Professional Seminar in Hospitality Mgmt (To Be Inactivated after Fall 2022),1 unit(s),"Designed for students who have declared a major in the Hospitality, Tourism and Event Management degree. The topics selected will facilitate the student’s entry into the academic program and the profession of hospitality management.",Class structure not found,No prerequisites listed,HSPM 1,No pre/corequisites listed,Letter Graded,"This course is part of the teach out plan for the Hospitality, Tourism, and Event Management BS. It will be offered in Fall 2021 and Fall 2022. The course will subsequently be discontinued."
3,HSPM 86 - Special Events Management in Hospitality,3 unit(s),"Hands-on experience in the operation, coordination, and management of special events as they relate to hospitality and tourism. Students develop management skills and experience in planning and execution of a major event.",Class structure not found,No prerequisites listed,No corequisites listed,No pre/corequisites listed,Letter Graded,No notes listed
4,HSPM 92 - International Program Studies,1-12 unit(s),"Study Abroad and Away transfer credit course. Study Abroad and Away provides students the opportunity to study abroad on long term programs (Exchange Programs, CSU International Programs, and International Student Exchange Programs) and short-term programs (Faculty-Led Programs and Summer School Abroad Programs) for academic credit, offering Alternative Break Programs for cultural immersion, and designing other globally focused opportunities. This course is designated as a placeholder course for Study Abroad and Away programs.",Class structure not found,No prerequisites listed,No corequisites listed,No pre/corequisites listed,Mixed Grading,No notes listed
...,...,...,...,...,...,...,...,...,...
1094,EDEL 108D - Curriculum: Mathematics,3 unit(s),Elementary school mathematics curriculum and methodology relationships between instructional materials and how children construct knowledge; the role of technology and issues that bear on the teaching of school mathematics. May be repeated for different subtitle.,Class structure not found,Upper division standing.,No corequisites listed,No pre/corequisites listed,Letter Graded,No notes listed
1095,EDEL 108E - Teaching Reading in Linguistically and Culturally Diverse Classrooms,3 unit(s),"Assessing and teaching reading in diverse classrooms in grades K-8. Integrates research, theory and practice. Requires some classroom observation and working in schools with K-8 students.",Class structure not found,"LING 108 , ENGL 103 or LING 107",No corequisites listed,No pre/corequisites listed,Letter Graded,Should be taken in final semester of undergraduate program.
1096,EDEL 143A - Beginning Student-Teaching (Phase I),1-6 unit(s),"Role of state and local government in education. Clinical observation of classroom, school and district organization. Emphasis on lesson planning.",Class structure not found,EDTE 162 or EDTE 262,No corequisites listed,No pre/corequisites listed,Credit/No Credit,No notes listed
1097,EDEL 143B - Advanced Student-Teaching (Phase II),1-10 unit(s),Practicum in public school classrooms at two grade levels for student teaching experience; includes field and campus seminar. Supervision by College of Education faculty.,Class structure not found,EDEL 143A,No corequisites listed,No pre/corequisites listed,Credit/No Credit,No notes listed


In [29]:
# Run for pages 22 to 32
courses_df22_32 = extract_all_course_details(driver, start_page=22, end_page=32)
courses_df22_32_display = courses_df22_32.drop(columns='id')
courses_df22_32_display

Found 54 pages for -1
Scraping page 22 now...
Scraping page 23 now...
Scraping page 24 now...
Scraping page 25 now...
Scraping page 26 now...
Scraping page 27 now...
Scraping page 28 now...
Scraping page 29 now...
Scraping page 30 now...
Scraping page 31 now...
Scraping page 32 now...


Unnamed: 0,title,units,description,class_structure,prerequisite(s),corequisite(s),pre/corequisite(s),grading,note(s)
0,EDEL 192 - International Program Studies,1-6 unit(s),"Study Abroad and Away transfer credit course. Study Abroad and Away provides students the opportunity to study abroad on long term programs (Exchange Programs, CSU International Programs, and International Student Exchange Programs) and short-term programs (Faculty-Led Programs and Summer School Abroad Programs) for academic credit, offering Alternative Break Programs for cultural immersion, and designing other globally focused opportunities. This course is designated as a placeholder course for Study Abroad and Away programs.",Class structure not found,No prerequisites listed,No corequisites listed,No pre/corequisites listed,Mixed Grading,No notes listed
1,ENGL 1A - First Year Writing,3 unit(s),"English 1A is an introductory course that prepares students to join scholarly conversations across the university. Students develop reading skills, rhetorical sophistication, and writing styles that give form and coherence to complex ideas for various audiences, using a variety of genres.",Class structure not found,Completion of Reflection on College Writing,No corequisites listed,No pre/corequisites listed,Letter Graded,No notes listed
2,ENGL 1AF - First-Year Writing: Stretch English I,3 unit(s),"Stretch I is the first semester of a year-long ENGL 1A that prepares students to join scholarly conversations across the university. Students develop reading skills, rhetorical sophistication, and writing styles that give form and coherence to complex ideas for various audiences, using various genres.",Class structure not found,Completion of Reflection on College Writing.,No corequisites listed,No pre/corequisites listed,Credit/No Credit,No notes listed
3,ENGL 1AS - First-Year Writing: Stretch English II,3 unit(s),"Stretch II is the second semester of a year-long ENGL 1A that prepares students to join scholarly conversations across the university. Students develop reading skills, rhetorical sophistication, and writing styles that give form and coherence to complex ideas for various audiences, using various genres.",Class structure not found,ENGL 1AF Stretch English I,No corequisites listed,No pre/corequisites listed,Letter Graded,No notes listed
4,ENGL 1B - Argument and Analysis,3 unit(s),"English 1B is a writing course that focuses on argumentation and analysis. Through the study of literary, rhetorical, and professional texts, students will develop the habits of mind required to write argumentative and critical essays. Particular emphasis is placed on writing and reading processes. Students will have repeated practice in prewriting, drafting, revising, and editing, and repeated practice in reading closely in a variety of forms, styles, structures, and modes.",Class structure not found,ENGL 1A or ENGL 1AS with a C- or better.,No corequisites listed,No pre/corequisites listed,Letter Graded,ENGL 1B is not open to students who successfully completed ENGL 2
...,...,...,...,...,...,...,...,...,...
1095,LSTP 85B - Fieldwork in Humanities B,1 unit(s),"Part of a 3-course series (LSTP 85A, B, C) that provides prospective K-8 teachers to California’s public school classrooms with initial exposure to the elementary classroom setting. LSTP 85B provides a dual focus on diversity in the classroom and the development of a teaching philosophy. Readings and field experience familiarize students with the various manifestations of diversity (racial, cultural, linguistic, ability) in the classroom and prepare students to create a teaching philosophy that is responsive to the realities of today’s diverse classrooms. Coursework involves 10 hours of volunteering in a public-school classroom with a credentialed teacher and online assignments.",Class structure not found,No prerequisites listed,No corequisites listed,No pre/corequisites listed,Letter Graded,No notes listed
1096,LSTP 85C - Fieldwork in Humanities C,1 unit(s),"Part of a 3-course series (LSTP 85A, B, C) that provides prospective K-8 teachers to California’s public school classrooms with initial exposure to the elementary classroom setting. LSTP 85C focuses on ethnography as a means of gaining a deeper understanding of classroom observations and classroom dynamics. It combines readings on conducting classroom ethnographies with field experience to produce “Ethnographic Snapshots”. Coursework involves 10 hours of volunteering in a public-school classroom with a credentialed teacher and online assignments.",Class structure not found,No prerequisites listed,No corequisites listed,No pre/corequisites listed,Letter Graded,No notes listed
1097,LSTP 139 - Education and (In)Equality,3 unit(s),"Education and (In)Equality explores the ways in which education cultivates free and empowered individuals that shape democratic egalitarian societies. It also explores the ways education can be used to oppress, dominate, and perpetuate social, economic and political inequalities. SJSU Studies Area: S",Class structure not found,"Completion of Core General Education and upper division standing are prerequisites to all SJSU studies courses. Completion of, or co-registration in, 100W is strongly recommended.",No corequisites listed,No pre/corequisites listed,Letter Graded,No notes listed
1098,LSTP 185 - Field Experience in Humanities,3 unit(s),"Supervised field work for liberal studies and humanities majors. Includes weekly meetings to discuss readings and field work experiences and to reflect upon humanities education, multicultural school settings and other nonprofit agencies and organizations that promote the humanities.",Lecture 3 hours/lab 2 hours.,No prerequisites listed,No corequisites listed,No pre/corequisites listed,Credit/No Credit,No notes listed


In [30]:
# Run for pages 33 to 43
courses_df33_43 = extract_all_course_details(driver, start_page=33, end_page=43)
courses_df33_43_display = courses_df33_43.drop(columns='id')
courses_df33_43_display

Found 54 pages for -1
Scraping page 33 now...
Scraping page 34 now...
Scraping page 35 now...
Scraping page 36 now...
Scraping page 37 now...
Scraping page 38 now...
Scraping page 39 now...
Scraping page 40 now...
Scraping page 41 now...
Scraping page 42 now...
Scraping page 43 now...


Unnamed: 0,title,units,description,class_structure,prerequisite(s),corequisite(s),pre/corequisite(s),grading,note(s)
0,LING 21 - Critical Thinking and Language,3 unit(s),"Exploring systems of language and logic in oral and written discourse, with a focus on the role of shared cultural assumptions, language style and the media of presentation in shaping the form and content of argumentation.",Class structure not found,Completion of GE Area A2 with a grade of C- or better.,No corequisites listed,No pre/corequisites listed,Letter Graded.,No notes listed
1,LING 22 - Language across the Lifespan,3 unit(s),"Introduction to what is known about how people successfully learn second languages, with a focus on physiological, psychological, social-cultural and linguistic factors that affect second language acquisition, and on skills and strategies that promote language learning across the lifespan.",Class structure not found,No prerequisites listed,No corequisites listed,No pre/corequisites listed,Letter Graded,No notes listed
2,"LING 24 - Language Variation in Space, Time, & Culture",3 unit(s),"Exploring the diverse structural patterns and social functions found in English; analyzing the social, cultural, political, historical, and technological factors underlying language change; developing critical thinking and effective argumentation in writing.",Class structure not found,Completion of GE Area A2 with a grade of C- or better.,No corequisites listed,No pre/corequisites listed,Letter Graded.,No notes listed
3,LING 25 - Languages of the World,3 unit(s),"Explore the diversity of the world’s languages through studying major language families, including basic language structure, language change and historical relationships, the sociocultural aspects of language, typological patterns and universals, writing systems and language vitality and endangerment.",Class structure not found,No prerequisites listed,No corequisites listed,No pre/corequisites listed,Letter Graded.,No notes listed
4,LING 26 - Quantitative Reasoning in Linguistic Diversity,3 unit(s),"An introduction to descriptive and inferential statistics, including the visual interpretation and presentation of data about linguistic diversity and related social phenomena like education and immigration. Interpretation of numerical and graphical data to draw inferences about complex social issues.",Class structure not found,"Mathematics Enrollment Category M-I or M-II; for Categories III and IV, LING 26W is required as a corequisite, unless a GE Area B4 course was previously completed with a grade of C- or better.",No corequisites listed,No pre/corequisites listed,Letter Graded,A grade of C- or better is required to satisfy GE Area B4.
...,...,...,...,...,...,...,...,...,...
1095,PHIL 118 - Latin American Philosophy,3 unit(s),"Analysis of main themes of Latin-American, Mexican and Mexican-American thought.",Class structure not found,3 units of philosophy or upper division standing.,No corequisites listed,No pre/corequisites listed,Letter Graded,No notes listed
1096,PHIL 119 - Africana Philosophy and Culture,3 unit(s),Philosophical examination of the ideological roots of social movements in black diaspora cultures from Be-Bop to Hip-Hop.,Class structure not found,3 units of philosophy or upper division standing.,No corequisites listed,No pre/corequisites listed,Letter Graded,No notes listed
1097,PHIL 120 - Comparative Philosophy Theory & Practice,3 unit(s),Examination of how different philosophical traditions (distinguished by culture or by style) via their relevant resources can talk to and learn from each other and make substantial joint contributions to our understanding and treatment of a range of philosophical issues.,Class structure not found,3 units of philosophy or upper division standing,No corequisites listed,No pre/corequisites listed,Letter Graded,No notes listed
1098,PHIL 121 - Philosophy and Feminism,3 unit(s),"A philosophical examination of writings that deal with issues of special concern to women, with emphasis on feminist writings.",Class structure not found,3 units of philosophy or upper division standing.,No corequisites listed,No pre/corequisites listed,Letter Graded,No notes listed


In [31]:
# Run for pages 44 to 54
courses_df44_54 = extract_all_course_details(driver, start_page=44, end_page=54)
courses_df44_54_display = courses_df44_54.drop(columns='id')
courses_df44_54_display

Found 54 pages for -1
Scraping page 44 now...
Scraping page 45 now...
Scraping page 46 now...
Scraping page 47 now...
Scraping page 48 now...
Scraping page 49 now...
Scraping page 50 now...
Scraping page 51 now...
Scraping page 52 now...
Scraping page 53 now...
Scraping page 54 now...


Unnamed: 0,title,units,description,class_structure,prerequisite(s),corequisite(s),pre/corequisite(s),grading,note(s)
0,PHIL 126 - Environmental Ethics and Philosophy,3 unit(s),"Extensions and applications of Kantian, Lockean, consequentialist and other philosophical theories of value to problems of the environment such as pollution, global warming, species depletion and overpopulation.",Class structure not found,3 units of philosophy or upper division standing.,No corequisites listed,No pre/corequisites listed,Letter Graded,No notes listed
1,PHIL 132 - Ethical Theory,3 unit(s),"Theoretical problems in the understanding of right conduct, value, obligation, justice, and virtue.",Class structure not found,3 units of philosophy or upper division standing.,No corequisites listed,No pre/corequisites listed,Letter Graded,No notes listed
2,PHIL 133 - Ethics in Science,3 unit(s),"An examination of values and practices in the culture of science. Issues: transmission of values in scientific communities, interactions between scientific and lay communities, historical development of norms of responsible research, cultural influence on scientific values.",Class structure not found,"Completion of Core General Education and upper division standing are prerequisites to all SJSU studies courses. Completion of, or co-registration in, 100W is strongly recommended.",No corequisites listed,No pre/corequisites listed,Letter Graded,No notes listed
3,"PHIL 134 - Computers, Ethics and Society",3 unit(s),The nature of privacy in a technologically interconnected world; the role of computer technologies in the exercise of the human intellect and imagination with respect to freedom of expression and the social good; rights and responsibilities of intellectual property ownership.,Class structure not found,"Completion of Core General Education and upper division standing are prerequisites to all SJSU studies courses. Completion of, or co-registration in, 100W is strongly recommended.",No corequisites listed,No pre/corequisites listed,Letter Graded,No notes listed
4,PHIL 137 - Puzzles in Innovation and Law,3 unit(s),Issues surrounding innovation: what exactly innovation is; whether it is created or discovered; why the law should protect it; whether innovators have special natural claims over their innovations; and whether the law ought to aim at a set of rules regarding innovation that maximizes societal well-being.,Class structure not found,Upper division standing or instructor consent.,No corequisites listed,No pre/corequisites listed,Letter Graded,No notes listed
...,...,...,...,...,...,...,...,...,...
1043,ZOOL 115 - Invertebrate Zoology and Natural History,4 unit(s),"The evolution, distribution, structure, natural history and systematics of invertebrates other than insects.",Lecture 2 hours/lab-field trips 6 hours.,BIOL 115 or BIOL 118 (with a grade of “C” or better). Must be a declared major in Biological Sciences; other majors with instructor consent.,No corequisites listed,No pre/corequisites listed,Letter Graded.,No notes listed
1044,ZOOL 116 - Vertebrate Evolution and Natural History,4 unit(s),"Origin, evolution, distribution and natural history of the vertebrates. Development, reproductive patterns, anatomy, morphology, behavior, ecology and systematics.",Lecture 2 hours/lab 6 hours with several field trips.,"BIOL 118 (with a grade of “C” or better), or instructor consent. Must be a declared major in Biological Sciences; other majors with instructor consent.",No corequisites listed,No pre/corequisites listed,Letter Graded,No notes listed
1045,ZOOL 143 - Biogeography,3 unit(s),"Examination of the patterns of biodiversity over space and through time. Using data and models from a variety of sources including botany, zoology, ecology, evolutionary biology, paleontology, and geology, effects of isolation, elevation, and latitude are examined to understand spatial patterns of biodiversity.",Class structure not found,"BIOL 31 or equivalent (with a grade of “C-” or better), or instructor consent. Must be a declared Biology Major (all).",No corequisites listed,No pre/corequisites listed,Letter Graded,No notes listed
1046,ZOOL 180 - Individual Studies,1-4 unit(s),Advanced lab work in special fields. Course is repeatable for credit.,Class structure not found,Instructor consent. Must be a declared Biology Major (all).,No corequisites listed,No pre/corequisites listed,Credit/No Credit,No notes listed


In [32]:
df_list = [courses_df1_10, courses_df11_21, courses_df22_32, courses_df33_43, courses_df44_54]

merged_df = pd.concat(df_list, ignore_index=True)

save_to_json(merged_df, filename="SJSU_courses_dataset.json")

Data saved to SJSU_courses_dataset.json
