# Scrape RateMyTeacher.com

In this notebook we can scrape the entirety of RateMyTeacher.com. This process consists of a layered approach. The first step is to download a map of all the reviews on RateMyTeacher.com 

In [1]:
import multiprocessing
import pandas as pd
import os 

import rmt_scraper

# During dev, reload the module after any alterations
#from importlib import reload 
#rmt_scraper = reload(rmt_scraper)

# find the number of processors 
processors = multiprocessing.cpu_count() - 1

Instantiate the scraper for the US by feeding it the US homepage for rmt. RMT contains reviews from many english speaking countries. The homepage for any of these may be submitted via this scraper. Once the scraper is instantiated it will automatically populate all the available states (or country equivalent) for the submitted homepage. 

In [2]:
# instantiate the scraper to scrape the U.S. database
scraper = rmt_scraper.rmt_scraper('https://www.ratemyteachers.com')

# load the last instance of the database 
scraper.load_sitemap()

# load the school description database so it can be built out as well
scraper.load_schooldb()
print('\nFind the sitemap under "scraper.reference_database" and the school database has been saved in "scraper.school_database"')

Scraper setup for https://www.ratemyteachers.com
Populating states

Find the sitemap under "scraper.reference_database" and the school database has been saved in "scraper.school_database"


You can review the states by typing *scraper.states* to view the states attribute. 

## Generate Site Map
<br>**THIS STEP GENERATES THE <font color = red>"reference_database.csv"</font> AND <font color=red>"school_database.csv"</font> DATASETS, SKIP IF YOU PLAN ON USING THE EXISTING VERSIONS IN THE "data" DIRECTORY**</font>

<br>In this next step we use the *refresh_database* method to generate a ".csv" file with database (site map) containing the nested fields ***State > City > School > Teacher*** where Teacher is the most specific webpage. Reviews can then be collected from each teacher page. 

**Scraping the sitemap can take a few hours and fail due to connection interruptions, but all is not lost!!!** The *refresh_database* method automatically saves everything that was scraped to a ".csv" file in the data directory called "reference_database.csv" and a complimentary dataset containing school descriptions to a dataset called "school_database.csv". If you rerun the scraper (even if you've instantiated a new one) it will automatically look for these files in the data directory and ask you if you want to load it, you can also submit a custom pandas DataFrame object to *refresh_database* for the scraper to build on.

Note that using the *refresh_database* command will generate this site map for all available states in *scraper.states*. To view use a subset of states simply set the *scraper.states* attribute equal to a subset of itself. To regain the full state list without re-instantiating the scraper use *scraper.get_states(scraper.home_url)*.

### Threaded Scraping 

The scraper rmt_scraper object supports multi-threaded scraping which can be used to significantly speed up the process. Use the workers argument in the *refresh_database* method to indicate the number of CPUs to use. Feed it *(workers = -1)* to use all available cores. If you want to stop this code block at any time I suggest you disconnect the internet, it will be faster than interrupting the kernel and won't risk losing you your data in a crash.

**Site Map Dataset** - The columns represent the nested levels Teacher < School < City < State  

In [3]:
# you don't need to scrape this if the reference database already has everything you need
#%%time  # uncomment to time 

# use the command below to scrape, you can scrape it from its last point with the use_existing option. 
#scraper.refresh_database(workers = processors, use_existing = scraper.reference_database, use_schools = school_descs) #uncomment to scrape

# remove any scrap variables from the dataset and display a sample of each of the datasets collected so far 
scraper.site_map = scraper.site_map.drop([c for c in scraper.site_map.columns if 'Unnamed' in c], 1)
print('Site Map:')
scraper.site_map.sample(10).head(5)

Site Map:


Unnamed: 0,City,School,State,Teacher
228754,/georgia/conyers,/rockdale-county-high-school/6086-s,https://www.ratemyteachers.com/Georgia,/mary-lyles/6926695-t
147130,/florida/orlando,/timber-creek-high-school/40034-s,https://www.ratemyteachers.com/Florida,/victoria-velez/7473759-t
593106,/new-york/long-island-city,/long-island-city-high-school/156849-s,https://www.ratemyteachers.com/New-York,/disanto/4117381-t
169822,/florida/sarasota,/mcintosh-middle-school/45945-s,https://www.ratemyteachers.com/Florida,/ralph-jarvis/1045200-t
786846,/texas/spring,/the-woodlands-high-school/19038-s,https://www.ratemyteachers.com/Texas,/donna-brawner/4309963-t


**School Description Dataset** - Descriptions of each school, can be merged with the Site Map Dataset by School

In [4]:
# do the same for the school descriptions dataset 
scraper.school_database = scraper.school_database.drop([c for c in scraper.school_database.columns if 'Unnamed' in c], 1)
print('Here is a sample description:\n %s' % str(scraper.school_database.loc[0,'school_description']).strip())
print('\nSchool Descriptions:')
scraper.school_database.sample(10).head(5)

Here is a sample description:
 Abbott Loop Elementary School is a public elementary school located in Anchorage, Alaska and part of Anchorage School District.

School Descriptions:


Unnamed: 0,School,address,gender,phone,school_description,student,website
3771,/st-gregory-the-great-school/39923-s,"85 Great Plain Rd, Danbury, Connecticut 06811",,203-748-1217,St Gregory The Great School is a private midd...,,
15198,/highland-elementary-school/290015-s,"1915 Buffalo Lake Rd, Sanford, North Carolina ...",428 Male | 419 Female,919-499-2200,Highland Elementary School is a public elemen...,847 Students Enrolled,http://www.harnett.k12.nc.us/Schools/highland/...
6209,/pivot-charter-school/272203-s,"8129 N Pine Island Rd, Tamarac, Florida 33321",,954-720-3001,Pivot Charter School is a public charter high...,,http://pivotcharterschoolfortlauderdale.com
25283,/collin-college/141032-s,"3452 Spur 399, Mc Kinney, Texas 75069",0 Male | 0 Female,972-548-6790,"Collin College is located in Mc Kinney, Texas...",0 Students Enrolled,http://www.collin.edu/
24249,/white-rock-elementary-school/220745-s,"9229 Chiswell Rd, Dallas, Texas 75238",,469-593-2700,White Rock Elementary School is a public elem...,,http://www.richardson.k12.tx.us/administration...


## Scraping Reviews

We use the site map to guide the review scraper. We can srape by State, School, City or Teacher and use the *for_values* option to scrape for certain States, Schools, Cities or Teachers respectively (default by City). The scraped data will be saved as .pickle files at the level selected (e.g. 1 .pickle per City). The level cannot be 'Teacher'.

**Note** that this can take a very long time depending on how much you want to scrape. You may run into similar connection issues as above where the server cuts off requests for a time or the connection is persistently interrupted. In these cases simply rerun the command. The *get_reviews_by* detects the files that have already been collected and collects those that are missing. Use the *overwrite* option to ignore existing files. 

In [5]:
# scrape reviews 
#scraper.get_reviews_by(by = 'City', workers = processors, prefix = 'US ')

# show the files in the data folder
files = [f for f in list(os.walk(os.getcwd() + '/data'))[0][2] if 'US' in f]
files

['US Alabama alabaster.csv',
 'US Alabama albertville.csv',
 'US Alabama alexander-city.csv',
 'US Alabama alexandria.csv',
 'US Alabama anniston.csv',
 'US Alabama arab.csv',
 'US Alabama ardmore.csv',
 'US Alabama ashford.csv',
 'US Alabama athens.csv',
 'US Alabama auburn.csv',
 'US Alabama birmingham.csv',
 'US Alabama carbon-hill.csv',
 'US Alabama citronelle.csv',
 'US Alabama clanton.csv',
 'US Alabama cleveland.csv',
 'US Alabama cullman.csv',
 'US Alabama dadeville.csv',
 'US Alabama daphne.csv',
 'US Alabama decatur.csv',
 'US Alabama dora.csv',
 'US Alabama dothan.csv',
 'US Alabama enterprise.csv',
 'US Alabama eufaula.csv',
 'US Alabama falkville.csv',
 'US Alabama florence.csv',
 'US Alabama fort-payne.csv',
 'US Alabama gadsden.csv',
 'US Alabama gardendale.csv',
 'US Alabama hartselle.csv',
 'US Alabama harvest.csv',
 'US Alabama hazel-green.csv',
 'US Alabama heflin.csv',
 'US Alabama holly-pond.csv',
 'US Alabama homewood.csv',
 'US Alabama hoover.csv',
 'US Alabama h