Skip to content

jeqcho/harvard_gem_finder

Repository files navigation

Harvard Gems

Screenshot of the Harvard Gem website Web scrapper, data analysis and website for the best classes at Harvard.

If you found it useful, you can

"Buy Me A Coffee"

Further analytics

Course ratings correlate well with recommendation score.

Course score vs recommendation score graph

Course ratings also correlate well with lecturer scores, but with more scatter.

Course score vs lecturer score graph

Sentiment analysis on the course comments also agree well with its average course rating.

Course score vs sentiment score graph

Most high-scoring courses have low workload.

Course score vs workload score graph

Harvard classes tend to have high ratings. It is rare to get a low score.

Histogram of the courses by rating

Most Harvard classes have a workload demand of around 5 hours per week outside of classes, though the distribution is skewed so some classes have much higher workloads.

Histogram of the courses by workload hours

There is little correlation between the number of students in the class and the score of the class.

Course score vs number of students graph

More analysis, and the code for the graphs can be found through this Colab Notebook. A copy of the notebook is also available in the repo above as course_ratings_analysis.ipynb. Remember to upload verbose_course_ratings.csv if you hope to tinker around.

Website

The code for the website can be found at this repo. This repo is for the scrapping and analytics.

Installation

pip install -r requirements.txt

Usage

You probably don't need to follow the steps below since the results can be found at verbose_course_ratings.csv, but this is a step-by-step guide on how to create that csv from scratch.

  1. Download the webpage from the link in scrapper.py as a HTML file namedQReports.html. Run scrapper.py to scrape the links for the QGuides for each course. The links generated will be stored at courses.csv.
  2. Visit any QGuide links scrapped at courses.csv to get the cookies (see the code of downloader.py for the search term secret_cookie) and paste it at a new file named secret_cookie.txt. Run downloader.py to download all the QGuides with the links scrapped from the previous step. The QGuides will be stored at the folder QGuides.
  3. Run analyzer.py to generate course_ratings.csv.
  4. Now we have to add details like divisional requirement or whether it fulfils quantitative reasoning with data (QRD), but most importantly we need to know whether this class is offered in Fall 2024 (the QGuides are for Fall 2023). Run myharvarddriver.py to use Selenium to get these necessary details from my.harvard.edu. Depending on your machine, you might need more setup to use Selenium, so you can check out the official guide. The webpages for each class will be stored as HTML files at the folder myharvard. This should be the step that takes the longest (around 1.5 hours), I usually leave it running overnight.
  5. New in 2024 Fall, some classes have sections to be chosen during registration, like CHNSE 130 and EXPOS 40. Run rescrape.py to handle these cases which require an additional click.
  6. Process these webpages to get the data by running append_details.py. This will generate verbose_course_ratings.csv as required.
  7. Start a Jupyter notebook session (jupyter notebook) and choose course_ratings_analysis.ipynb to run. This will generate the graphs above and the data at output_data. Follow through the notebook and play around!

About

Web scrapper, data analytics and sentiment analysis of Harvard FAS classes from my.harvard and Q Guide for the Harvard Gems website

Topics

Resources

License

Stars

Watchers

Forks

Languages