In [3]:
##########################################################################
##### Template code and instructions to scrape Rate My Professor Comments#
##########################################################################
### Step 1: install packages - 
##### packages are necessary to install and load, given that they have the built in functions necessary to run complex tasks. 
## They effectively act as one of the most crucial time saving activities that would otherwise lead to overly long and 
## duplicative scripts. 

## install pkgs 
import sys
!{sys.executable} -m pip install numpy
# !{sys.executable} -m pip install requests #; this code here can be used to install packages on anaconda/jupyter notebook 
### I believe the below should be installed by default 
import requests # web scraping 
from bs4 import BeautifulSoup # for web scraping 
import itertools # for efficient operation of loops 
import pandas as pd # necessary for reading in, creating, and manipulating data frames 
import csv ## for importing/exporting csvs 



In [4]:
### Step 2: name the url you will be scraping from, school, and prof name 
## note: This will be the section that you update manually the most. The rest should be automated 

## url to use ; will be the profs page, with the numeric code being what changes; should be changed manually 
url = 'https://www.ratemyprofessors.com/professor/1190096'

## college 
college = "OHIO STATE UNIVERSITY" # change as needed 

##prof last name 
prof_lastname = "BOWEN"

# prof first name 
prof_firstname = "RACHEL"

## note: you will want to manually seach the rate my professor website for the professor of interest. From there, you will 
# be able to grab the url for the professor of interest, which will be the last numeric digits of the url that will vary. 

In [5]:
### Step 3: grab the elements from the url 
### now scrape the url for elements 
page = requests.get(url) ## for syntax, the "requests" relates to the library pulling from, the "." as a means to pull the 
## command from the requests pkg, and the () signaling what object you are pulling from, which in this case is the url 
page # should be 200; if not, probably error 

<Response [200]>

In [6]:
### Step 4: create the "Soup" object, i.e. ALL the elements (xml code) from the url 
### create the soup object; best not to print since it is a bunch of meta code 
soup = BeautifulSoup(page.text, "html.parser") 

## for syntax, the command is "BeautifulSoup", with the page the object pulled from step 3. The ".text" is saying 
# "grab all of the text", with the section after the comment specifying to parse/separate into chunks based on html coding 

### uncomment below in the next box if you want to see what's going on

In [5]:
# soup 
# as you see, it is A LOT. Thankfully, we do not need to sort through this mess 


<!DOCTYPE html>

<!-- SSR -->
<html>
<head>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<meta content="#000000" name="theme-color"/>
<meta content="https://www.ratemyprofessors.com/build/thumbnail.svg" name="thumbnail"/>
<link href="/build/manifest.json" rel="manifest"/>
<link href="/static/css/main.1773c5b7.css" rel="stylesheet" type="text/css"/>
<!-- Google Optimize Anti-flicker snippet -->
<style>.async-hide { opacity: 0 !important} </style>
<script>(function(a,s,y,n,c,h,i,d,e){s.className+=' '+y;h.start=1*new Date;
        h.end=i=function(){s.className=s.className.replace(RegExp(' ?'+y),'')};
        (a[n]=a[n]||[]).hide=h;setTimeout(function(){i();h.end=null},c);h.timeout=c;
        })(window,document.documentElement,'async-hide','dataLayer',4000,
        {'OPT-MLW3VTZ':true});</script>
<!-- Google Optimize -->
<script async="" src="https://www.googleoptimize.com/optimize.js?id=OPT-MLW3VTZ"></script>
<script async="" data-pubkey="alticermp" data-sitekey

In [7]:
### Step 5: grab the comments 
## note: This might require additional inspection of the webpage, which can be found on the readme. Different updates to the 
# website leads this to vary on occassion. At the very least, if it is true for at least one comment on a given prof's page, 
# it will be consistent throughout 

prof_comments = soup.findAll("div", {"class": "Comments__StyledComments-dzzyvm-0 gRjWel" }) 
prof_comments 
## excellent! THe trick to this was by highlighting the relevant text, and then identifying the letters preceeding "class"; for
# the tags it was "span", though in the case of the comments it was 'div'; from there we then just had to copy the section 
# within the quotes, which I recommend right clicking and copying the element into a cell, and then extracting what is necessary


[<div class="Comments__StyledComments-dzzyvm-0 gRjWel">There was a lot of reading but it was not necessary. Could retake most quizzes an unlimited number of times. Directions for papers were sometimes unclear, but she was very good at responding to emails in a timely manner and answered questions well. She is a very easy grader.   </div>,
 <div class="Comments__StyledComments-dzzyvm-0 gRjWel">Dr. Bowen is a really great professor. Her class was really interesting, interactive, and you can tell she's really passionate about what she does. She's also extremely nice; I was 0.02% away from a solid A in her class and didn't realize you could ask for a bump until the end of the next semester. She bent over backwards and bumped me anyway.</div>,
 <div class="Comments__StyledComments-dzzyvm-0 gRjWel">Professor Bowen is a decent professor at OSU Mansfield. She remains unbiased in the politics class, even though her opinion on the matter is clear. She is easy to get along with, and is quite unde

In [1]:
### Step 5.5: finding patterns through inspection 

##again, this should be consistent, but might change. If it does change, will require that you inspect and figure what 
# the new code/patterns are for the purpose of scraping/parsing 


## note: see that quality and difficulty are nearly the same, save for the text/numers following "-2"
## quality = <div class="CardNumRating__CardNumRatingNumber-sc-17t4b9u-2 gcFhmN">5.0</div>
## difficulty = <div class="CardNumRating__CardNumRatingNumber-sc-17t4b9u-2 cDKJcc">1.0</div>

### note 2: example of being aware of inconsistencies 


## the following are patterns found inspecting various grades. Note that the scores/quality of a class at a 4.5+ have the 
#same pattern. HOWEVER, that 3.5 score sees a different pattern following the "-2". This means that if we are too specific
# on the text we are looking for, we would miss any/all of the low ratings. This would be problematic, so we'll have to widen
# our net. This will require some extra parsing later, though 

# <div class="CardNumRating__CardNumRatingNumber-sc-17t4b9u-2 gcFhmN">4.5</div> # from a good class 
# <div class="CardNumRating__CardNumRatingNumber-sc-17t4b9u-2 gcFhmN">5.0</div>
# <div class="CardNumRating__CardNumRatingNumber-sc-17t4b9u-2 icXUyq">3.5</div> ; this is from an ave class

## class = <div class="RatingHeader__StyledClass-sc-1dlkqw1-2 gxDIt"> POLSCI1300</div>

## note on grade: seems to be missing a lot; probably best to skip, given that for the time being, it is irritating to 
# automate missingness for the purpose of creating data frames 

## grade = <div class="MetaItem__StyledMetaItem-y0ixml-0 LXClX">Grade: <span>A</span></div>
#<div class="Comments__StyledComments-dzzyvm-0 gRjWel">Professor Bowen is a decent professor at OSU Mansfield. She remains unbiased in the politics class, even though her opinion on the matter is clear. She is easy to get along with, and is quite understanding when you must miss a class. However, some of her material is difficult to understand. I often found myself using Google to look up answers.</div>
#<div class="Comments__StyledComments-dzzyvm-0 gRjWel">Pretty good professor. She's very liberal but is pretty good at hiding her bias. Her lectures can be pretty boring. She decided to drop all exams so all there is to do is a final constitutional 'simulation', that is a lot of fun. Easy A and fairly interesting GE.</div>
# <span class="Tag-bs9vf4-0 hHOVKF">Clear grading criteria</span>

In [8]:
### Step 6: grab the meta information (i.e. difficulty, quality, class)

### note: this object will hold both quality and difficulty data; will be parsed later 
prof_score = soup.findAll("div", {"class": "CardNumRating__CardNumRatingNumber-sc-17t4b9u-2" })

## get the class name 
prof_class = soup.findAll("div", {"class": "RatingHeader__StyledClass-sc-1dlkqw1-2 gxDIt" }) # double; just take the odds 

#prof_grade = soup.findAll("div", {"class": "MetaItem__StyledMetaItem-y0ixml-0 LXClX" })
## the above is missing too many elements; better to just go with difficulty 
                                  
                  

In [9]:
### step 7: check the lengths of what we pulled 

## we first get the length of the objects with the "len" command, storing them in their own objects 
prof_comment_len = len(prof_comments)
prof_score_len = len(prof_score)
prof_class_len = len(prof_class)


## next, we print out a message that combines text with the numeric data stored in the three above ojbjects. If all goes well,
# these should all be whole multiples of each other. The score and class length should be double 
print("The length of the student comments are ", prof_comment_len, "compared to ", prof_score_len, "for the score data, and"
     , prof_class_len, "for the class info.")

The length of the student comments are  16 compared to  32 for the score data, and 32 for the class info.


In [10]:
### Step 8: parse the data into smaller bits of the same length as the comments 

### First, let's grab everything score related to "difficulty" from the website 
prof_dif = prof_score[1::2] # this appears to have grabbed all of the difficulties
## note: what's going on here?
# Basically, the prof_score is a list object that contains 32 elements. It's the case that when we scraped the data, the first
# element is the "quality" score of the class, and the second element the "difficulty" when referring to an individual post. 
# We want the difficulty data to be grouped with difficulty data alone, and quality with quality. Therefore, having identified
# the pattern in the data, the []'s grab/subset the elements within the object. The first numeral tells you what element you 
# start with, and the second every nth element you are pulling from. Importantly, python codes the first element as "0." 
# Therefore, by specifying "1", we are actually saying in the above command to pull starting from the second element, and then
# grab every other element, thereby grabbing all of the even numbered elements, which consists of all the difficulty data.
# Cool, no? 

prof_qual = prof_score[0::2] ## Same logic as above, though starting with the first element, thereby grabbing all of the 
# quality scores 

## For prof class, appears that all of the elements are simply duplicated. Therefore, we could start either with the first or 
# second elements; it does not matter 
prof_class = prof_class[1::2]

In [11]:
### Step 9: create a data frame from all these lists 

## note: loops are the bread and butter of python. We will use these in a lot of scripts, so feel free to ask questions, 
# as you should not be embarassed if it does not click immediately. I'll annotate where I can to help. 

### use itertools and pandas to go across multiple lists 

d1 = [] # this command is simply creating an empty list object called "d1" that can be stored with stuff later. This is done
# by simply setting it equal to empty brackets. This tells python that you want an object, but that you want to set the 
#number of columns and rows later

for i, element in enumerate(prof_qual): ## this is the list proper. What's happening is that you are saying for every element 
    # "i", posit an element that comprises a single component within a vector running from 1 to what ever the length of the 
    # object "prof_qual" is; the length is known via the "enumerate" cmd 
    d1.append( # note that indentations are important. The indentation signals that the text following the ":" is part of a 
        #loop. The "d1" represents the empty list we created above. The "." signals that we are going to run a cmd/transform
        # of the object d1. The "append" is the command we are executing, which is where we will keep on adding rows for 
        # however long that object "prof_qual" is. This again sees the ({}) notation and more indentation playing a serious
        #role. If you ever get an error, the most common is not having your parentheses and brackets match up. 
        {
            'row': i+1, # the practice of putting text in "", followed by :, signals that you want to create/add to a column.
            # in this case, I am saying I want a column that simply notes the row number. I added in the +1, since python 
            #starts with 0. 
            'quality_of_class': prof_qual[i].get_text(), # column for quality of the class. I do this by specifying that list 
            # we created earlier by parsing the prof_score object, back in step 8. From there, the [] with the element "i" is
            # necessary, since I am saying for this column "quality_of_class" that I want to create, the ith row should be 
            # pulled from the ith element of the list. The ".get_text()" is a text transformation that cuts out all of the 
            # extra html code that would otherwise make the comments harder to read. 
            
            "difficulty_of_class": prof_dif[i].get_text(), # same logic as above, though now pulling from the prof_dif object,
            #which comprises the difficulty. Also, note that commas are necessary to separate the columns, up until the last one
            
            "class_code": prof_class[i].get_text(), # same logic as above, though now pulling the class name
            
            "college": college, ## We created this single element object in step 2, if you recall. We are now just saying for 
            # the entire data frame, just have the college column be the single value in the college object we created. 
            
            "prof_firstname": prof_firstname, # same logic as above, now for the prof's first name
            
            "prof_lastname": prof_lastname, # same logic as above, now for the prof's last name
            
            'comment': prof_comments[i].get_text() # same logic as class code, though now pulling the comments 
        }
    ) 
# end of loop 

## Finally, we just do a transformation to make this into a pandas data frame, which will be necessary for exporting later. 
# We do this here:
df = pd.DataFrame(d1) # Where df is going to be the pandas data frame object we output. The "pd." part says we want to pull 
# something from that pandas package. In our case, it is the command "DataFrame", whose parentheses around that now full 
# object will be made into the pandas dataframe.
df # this will have the data frame appear below. 

Unnamed: 0,row,quality_of_class,difficulty_of_class,class_code,college,prof_firstname,prof_lastname,comment
0,1,5.0,1.0,POL1200,OHIO STATE UNIVERSITY,RACHEL,BOWEN,There was a lot of reading but it was not nece...
1,2,5.0,1.0,POLSCI1300,OHIO STATE UNIVERSITY,RACHEL,BOWEN,Dr. Bowen is a really great professor. Her cla...
2,3,3.0,3.0,POLITSC1100,OHIO STATE UNIVERSITY,RACHEL,BOWEN,Professor Bowen is a decent professor at OSU M...
3,4,4.0,2.0,POL1200,OHIO STATE UNIVERSITY,RACHEL,BOWEN,Pretty good professor. She's very liberal but ...
4,5,5.0,2.0,POLITSC1100,OHIO STATE UNIVERSITY,RACHEL,BOWEN,Gives great feedback and clearly outlines what...
5,6,5.0,2.0,POLSC101,OHIO STATE UNIVERSITY,RACHEL,BOWEN,I loved her class! She gives great feedback an...
6,7,5.0,2.0,POLSC1300,OHIO STATE UNIVERSITY,RACHEL,BOWEN,one o my favorite professors eve!
7,8,5.0,4.0,POLSC1300,OHIO STATE UNIVERSITY,RACHEL,BOWEN,"Rachel Bowen is an awesome professor, you real..."
8,9,5.0,4.0,POLISCI1300,OHIO STATE UNIVERSITY,RACHEL,BOWEN,Dr. Bowen is very passionate about Political S...
9,10,4.5,3.0,AMPOLITICS2002,OHIO STATE UNIVERSITY,RACHEL,BOWEN,Very hands on course. She encourages students ...


In [20]:
## Step 10: Export the data frame

# note: We want to save the data, and prevent the chance that you might accidentally overwrite data. Therefore, we are going 
# to pull a trick with text that we set up in step 2. 

save_name = "scraped_data/rmp_data" +"_"+college +"_"+prof_lastname+"_"+prof_firstname+".csv"
save_name # what I just did is something called "concatenation," where the constant text is in quotes, but the objects we 
# created above now are pulled into here, adapting to change the text -- and therefore save name -- of the csv we will export.
# note, by having the "scraped_data/" with that slash prcede the "rmp_data", we are saying we want the data saved into the 
# folder "scraped_data." This will be important, as you always want to keep folder hierarchy in mind, so as to not clutter
# a project. 

## this command below is saying to take the object "df" -- which is a pandas data frame -- and run the command "to_csv", which
# is imported from the csv library we read in during step 1. The parentheses are the syntax, which simply requires the name 
# by which we want to save the file as, which in our case is the text object "save_name" we created above. The "index=FALSE"
# syntax simply tells python that we do not want a special column that notes the row number created for the csv, since we 
#already created it. 
df.to_csv(save_name, index=False)




In [13]:
names = df["prof_firstname"]
names[1]

'RACHEL'