Throughout the project, we do our analysis using data from RateMyProfessors and Reddit. Since the code is very long and tedious to webscrape, we split it into this notebook. This also helps because this code takes a while to run. All the data is outputed to CSV files.

In [6]:
# imports for the entire project

import requests
import bs4   # also needs html5lib
import string
import pandas
import csv

The function below allows us to scrape data from both UVA and Virginia Tech's subreddits. The function's parameters are the name of the subreddit, the name of the .csv file we want it to produce, and how many posts we want to parse through. For this analysis, we have specified for 500 posts to be collected. The data is stored in "VirginiaTechPosts.csv" and "UVAPosts.csv".

The following Youtube tutorial was used to help implement this functionality: https://www.youtube.com/watch?v=MpFvkptjekk

In [4]:
# imports for the reddit webscrapping API

import praw
import time

In [None]:
def scrape_subreddit_posts_to_csv(subreddit_name, filename="subreddit_posts.csv", num_posts=100):

    # Input your id that you aquire from the reddit.com
    try:
        reddit = praw.Reddit(
            client_id="",
            client_secret="",
            user_agent="",
        )

        subreddit = reddit.subreddit(subreddit_name)

        posts = []
        start_time = time.time()
        rate_limit = 100
        for i, post in enumerate(subreddit.new(limit=num_posts)):
            posts.append(post)

            # Abiding by the crawl limit so we don't get booted
            elapsed_time = time.time() - start_time
            if (i + 1) % rate_limit == 0:  # Check if we've reached the rate limit
                if elapsed_time < 60:
                    time.sleep(60 - elapsed_time)  # Wait for the remaining time
                start_time = time.time()  # Reset the timer

        with open(filename, 'w', newline='', encoding='utf-8') as csvfile:
            fieldnames = ['title', 'url', 'selftext', 'score', 'num_comments', 'created_utc']
            writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

            writer.writeheader()
            for post in posts:
                writer.writerow({
                    'title': post.title,
                    'url': post.url,
                    'selftext': post.selftext,
                    'score': post.score,
                    'num_comments': post.num_comments,
                    'created_utc': post.created_utc,
                })

        print(f"Successfully saved {num_posts} posts to {filename}")

    except Exception as e:
        print(f"An error occurred: {e}")

In [None]:
# Scrapes the Virginia Tech Reddit Posts
subreddit_name = "VirginiaTech"
scrape_subreddit_posts_to_csv(subreddit_name, filename="VirginiaTechPosts.csv", num_posts=500)

In [None]:
# Scrapes the UVA Reddit posts
subreddit_name = "UVA"
scrape_subreddit_posts_to_csv(subreddit_name, filename="UVAPosts.csv", num_posts=500)

In an effort to gain more insight into the RateMyProfessor comments, we went ahead and extracted the overall rating attached to the comment. With this we are hoping to explore if there is a relationship between the sentiment of the comment and rating itself.

Starting off, we used Selenium to scrap these reviews and attach them to the same comments we used in our inital analysis. This data is then stored in "VTRMPNEW.csv" and "UVARMPNEW.csv".

In [None]:
# Virginia Tech data
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time

# Initialize the WebDriver

driver = webdriver.Safari()

# Open the target RateMyProfessors page
url = "https://www.ratemyprofessors.com/school/1349"
driver.get(url)

# Wait for the page to load initially
time.sleep(2)

# List to store all the reviews
reviews = []

# Define a function to scrape reviews after clicking "Show More"
def scrape_reviews():

    # Find all review elements
    rating_elements = driver.find_elements(By.CLASS_NAME, 'GradeSquare__ColoredSquare-sc-6d97x2-0')
    review_elements = driver.find_elements(By.CLASS_NAME, 'SchoolRating__RatingComment-sb9dsm-6')

    # Extract the text of each review
    for rating, review in zip(rating_elements[12:], review_elements):
        reviews.append((float(rating.text), review.text))



# Try to click the "Show More" button and load more reviews
# Loop to click the "Show More" button multiple times
clicks = 10
while clicks != 0:
    try:
        # Find the "Show More" button by its XPath
        show_more_button = driver.find_element(By.XPATH, '//*[@id="root"]/div/div/main/div[3]/div[1]/button')

        # Click the "Show More" button
        show_more_button.click()

        # Wait for the content to load after clicking
        time.sleep(2)  # Adjust sleep time based on the page's loading speed


        clicks -= 1

    except Exception as e:
        # If there's no more "Show More" button (end of the reviews), break the loop
        print("No more 'Show More' button found or end of reviews.")
        break

scrape_reviews()


# Print the reviews collected
for review in reviews:
    print(review)


# Close the browser window
driver.quit()

df_vtrmp_new = pandas.DataFrame(reviews, columns=["Rating", "Review"])
df_vtrmp_new.to_csv("VTRMPNEW.csv", index=False, encoding='utf-8')

In [None]:
# UVA Data
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time

# Initialize the WebDriver

driver = webdriver.Safari()

# Open the target RateMyProfessors page
url = "https://www.ratemyprofessors.com/school/1277"
driver.get(url)

# Wait for the page to load initially
time.sleep(2)

# List to store all the reviews
reviews = []

# Define a function to scrape reviews after clicking "Show More"
def scrape_reviews():

    # Find all review elements
    rating_elements = driver.find_elements(By.CLASS_NAME, 'GradeSquare__ColoredSquare-sc-6d97x2-0')
    review_elements = driver.find_elements(By.CLASS_NAME, 'SchoolRating__RatingComment-sb9dsm-6')

    # Extract the text of each review
    for rating, review in zip(rating_elements[12:], review_elements):
        reviews.append((float(rating.text), review.text))


# Try to click the "Show More" button and load more reviews
# Loop to click the "Show More" button multiple times
clicks = 10
while clicks != 0:
    try:
        # Find the "Show More" button by its XPath
        show_more_button = driver.find_element(By.XPATH, '//*[@id="root"]/div/div/main/div[3]/div[1]/button')

        # Click the "Show More" button
        show_more_button.click()

        # Wait for the content to load after clicking
        time.sleep(2)  # Adjust sleep time based on the page's loading speed


        clicks -= 1

    except Exception as e:
        # If there's no more "Show More" button (end of the reviews), break the loop
        print("No more 'Show More' button found or end of reviews.")
        break

scrape_reviews()


# Print the reviews collected
for review in reviews:
    print(review)


# Close the browser window
driver.quit()

df_uvarmp_new = pandas.DataFrame(reviews, columns=["Rating", "Review"])
df_uvarmp_new.to_csv("UVARMPNEW.csv", index=False, encoding='utf-8')

Now we want to assess how the 10 different aspects of student life (Food, Facilities, Reputation, Happiness, Safety, Opportunities, Clubs, Social, Internet, Location) factor into these reviews. To do this, we're analyzing the RateMyProfessors comments by collecting the rating for each attribute associated with individual comments. We determine these ratings by recording the gray value of each of the 5 bar segments and using logic to identify filled versus unfilled bars. For example, if no bar segments are gray (unfilled), then the attribute has received a rating of 5.

These results are then stored in CSV files named "all_vt.csv" and "all_uva.csv".

In [None]:
import requests
import bs4   # also needs html5lib
from bs4 import BeautifulSoup
import os
import time
import html5lib
import pandas as pd
from selenium import webdriver
from selenium.webdriver.edge.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC


driver = webdriver.Chrome()


# Load the page
driver.get("https://www.ratemyprofessors.com/school/1349")
time.sleep(2)


show_more_xpath = '//*[@id="root"]/div/div/main/div[3]/div[1]/button'

# Click "Show More" if available
for iteration in range(10):
    try:
        # Wait until the button is clickable
        wait = WebDriverWait(driver, 10)
        btn = wait.until(EC.element_to_be_clickable((By.XPATH, show_more_xpath)))

        driver.execute_script(
            "var iframe = document.getElementsByName('IL_SR_FRAME2');"
            "if(iframe.length > 0){ iframe[0].remove(); }"
        )

        # Scroll the button into view
        driver.execute_script("arguments[0].scrollIntoView(true);", btn)
        time.sleep(1)  # Allow time for scrolling

        # Attempt to click the button
        btn.click()
        print(f"Clicked 'Show More' button ({iteration+1}/10)")
        time.sleep(2)  # Wait for new content to load

    except Exception as e:
        print(f"Error clicking 'Show More' at iteration {iteration+1}: {e}")
        try:
            # Fallback: Use JavaScript click if normal click fails

            btn = driver.find_element(By.XPATH, show_more_xpath)
            driver.execute_script("arguments[0].click();", btn)
            print(f"JavaScript clicked 'Show More' button at iteration {iteration+1}")
            time.sleep(2)
        except Exception as js_e:
            print(f"JavaScript click also failed at iteration {iteration+1}: {js_e}")
            break

wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located(
    (By.CSS_SELECTOR, "div.DisplaySlider__DisplaySliderContainer-sc-6etfq5-0")
))

# Extract updated HTML content
html = driver.page_source

driver.quit()


In [None]:
soup = bs4.BeautifulSoup(html, 'html5lib')

comment_containers = soup.find_all('div', class_="SchoolRatingSummary__SchoolRatingSummaryContainer-sc-50tcmg-0")

all_comments_data = []

for idx, comment in enumerate(comment_containers):
    # For each comment container, find the 10 slider sections.
    # Here, we assume each slider section has the class "DisplaySlider__DisplaySliderContainer-sc-6etfq5-0".
    slider_sections = comment.find_all('div', class_="DisplaySlider__DisplaySliderContainer-sc-6etfq5-0")

    if len(slider_sections) != 10:
        # Optionally, print a message if the expected number of slider sections is not found.
        print(f"Comment {idx}: Expected 10 slider sections but found {len(slider_sections)}. Skipping this comment.")
        continue  # or handle it differently if needed

    # Dictionary to hold the data for the current comment.
    comment_data = {}

    for section in slider_sections:
        data_dict = {}
        label_element = section.find('div', class_="DisplaySlider__DisplaySliderLabel-sc-6etfq5-1")
        label = label_element.text.strip() if label_element else "Unknown Label"

        slider_boxes = section.find_all('div', class_="DisplaySlider__DisplaySliderBox-sc-6etfq5-3")

        # Extract numerical values from bars (assuming there's relevant data inside)
        values = [box.get('class')[-1] for box in slider_boxes]  # Modify extraction logic as needed

        comment_data[label] = values if values else "N/A"

    all_comments_data.append(comment_data)

df = pd.DataFrame(all_comments_data)
print(df)

In [None]:
# Counts the number of gray levels in each bar
def count_grays(color_list):
    gray_values = {"gytHZE", "guPMgv", "dDIWaC"} # Set of known gray values
    return 5 - sum(1 for color in color_list if color in gray_values) # Returns the level for each bar

df_gray_counts = df.applymap(count_grays) # Gets the level of each datapoint in the frame

df_gray_counts

In [None]:
# Concat along columns (axis=1)
result = pd.concat([vt_data, df_gray_counts], axis=1)
result.to_csv("all_vt.csv", index=False)

In [None]:
driver = webdriver.Chrome()


# Load the page
driver.get("https://www.ratemyprofessors.com/school/1277")
time.sleep(2)


show_more_xpath = '//*[@id="root"]/div/div/main/div[3]/div[1]/button'

# Click "Show More" if available
for iteration in range(10):
    try:
        # Wait until the button is clickable
        wait = WebDriverWait(driver, 10)
        btn = wait.until(EC.element_to_be_clickable((By.XPATH, show_more_xpath)))

        driver.execute_script(
            "var iframe = document.getElementsByName('IL_SR_FRAME2');"
            "if(iframe.length > 0){ iframe[0].remove(); }"
        )

        # Scroll the button into view
        driver.execute_script("arguments[0].scrollIntoView(true);", btn)
        time.sleep(1)  # Allow time for scrolling

        # Attempt to click the button
        btn.click()
        print(f"Clicked 'Show More' button ({iteration+1}/10)")
        time.sleep(2)  # Wait for new content to load

    except Exception as e:
        print(f"Error clicking 'Show More' at iteration {iteration+1}: {e}")
        try:
            # Fallback: Use JavaScript click if normal click fails

            btn = driver.find_element(By.XPATH, show_more_xpath)
            driver.execute_script("arguments[0].click();", btn)
            print(f"JavaScript clicked 'Show More' button at iteration {iteration+1}")
            time.sleep(2)
        except Exception as js_e:
            print(f"JavaScript click also failed at iteration {iteration+1}: {js_e}")
            break

wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located(
    (By.CSS_SELECTOR, "div.DisplaySlider__DisplaySliderContainer-sc-6etfq5-0")
))

# Extract updated HTML content
html = driver.page_source

driver.quit()


In [None]:

soup = bs4.BeautifulSoup(html, 'html5lib')

comment_containers = soup.find_all('div', class_="SchoolRatingSummary__SchoolRatingSummaryContainer-sc-50tcmg-0")

all_comments_data = []

for idx, comment in enumerate(comment_containers):
    # For each comment container, find the 10 slider sections.
    # Here, we assume each slider section has the class "DisplaySlider__DisplaySliderContainer-sc-6etfq5-0".
    slider_sections = comment.find_all('div', class_="DisplaySlider__DisplaySliderContainer-sc-6etfq5-0")

    if len(slider_sections) != 10:
        # Optionally, print a message if the expected number of slider sections is not found.
        print(f"Comment {idx}: Expected 10 slider sections but found {len(slider_sections)}. Skipping this comment.")
        continue  # or handle it differently if needed

    # Dictionary to hold the data for the current comment.
    comment_data = {}

    for section in slider_sections:
        data_dict = {}
        label_element = section.find('div', class_="DisplaySlider__DisplaySliderLabel-sc-6etfq5-1")
        label = label_element.text.strip() if label_element else "Unknown Label"

        slider_boxes = section.find_all('div', class_="DisplaySlider__DisplaySliderBox-sc-6etfq5-3")

        # Extract numerical values from bars (assuming there's relevant data inside)
        values = [box.get('class')[-1] for box in slider_boxes]  # Modify extraction logic as needed

        comment_data[label] = values if values else "N/A"

    all_comments_data.append(comment_data)

df = pd.DataFrame(all_comments_data)
print(df)

In [None]:
df_gray_counts = df.applymap(count_grays) # Gets the level of each datapoint in the frame

df_gray_counts

In [None]:
# Concat along columns (axis=1)
result = pd.concat([uva_data, df_gray_counts], axis=1)
result.to_csv("all_uva.csv", index=False)