# Review Collection Script [Script A]

*This notebook contains the script used to collect reviews from the Glassdoor job posting site. We use the python library Selenium to carry out the scraping. Note that this notebook's code was heavily inspired by the script written by Maria Vasilenko for Glassdoor scraping, published on Medium. However, this inspiration only appears in the structure of the code. The content of the code was written from scratch to serve our own scraping goals.*

---
*References: https://mashavasilenko.medium.com/scrape-your-way-to-thousands-of-interview-reviews-f6dba8063539*


## Package installations and imports

In [None]:
!pip install selenium --quiet
!apt-get update --quiet # to update ubuntu to correctly run apt install
!apt install chromium-chromedriver --quiet
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome('chromedriver',chrome_options=chrome_options)

In [None]:
from selenium.common.exceptions import NoSuchElementException, ElementClickInterceptedException, TimeoutException, StaleElementReferenceException
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

import time
import pandas as pd
import re
import math

In [None]:
VERBOSE = False #Used for debugging

## Notebook Methods

### Login method

*This method makes sure we are logged in and have full access to the Glassdoor website*

In [None]:
## Login Method
## Code based on login method from https://mashavasilenko.medium.com/scrape-your-way-to-thousands-of-interview-reviews-f6dba8063539

def login(driver, login_url, email, password):
    
    try:
        driver.get(login_url)
        if VERBOSE:
            print('Arrived at login page')
    except:
        print('Failed to find login page')
    
    # Navigate to the e-mail field
    try:
        if VERBOSE:
          print('Finding email field')
        email_field = driver.find_element_by_xpath("//*[@type='submit']//preceding::input[2]")
    except NoSuchElementException:
        if VERBOSE:
            print('Could not find email field')
    # Send e-mail
    email_field.send_keys(email)
    
    # Navigate to the password field
    try:
      if VERBOSE:
          print('Finding password field')
      pwd_field = driver.find_element_by_xpath("//*[@type='submit']//preceding::input[1]")
    except NoSuchElementException:
        if VERBOSE:
            print('Could not find password field')
    pwd_field.send_keys(password)
    pwd_field.submit()
    time.sleep(3)
    print('Successful login')

### Accepting Recommended Cookies Method

*This method ensures that we are not blocked by the accept cookie prompt that usually pops up in all sites, following the introduction of GDPR restrictions*

In [None]:
def accept_cookies(driver):
    #Accepting recommended cookies
        try:
            accept_button = driver.find_element_by_id('onetrust-accept-btn-handler')
            accept_button.click()
            if VERBOSE:
                print('Recommended cookies accepted')
        except NoSuchElementException:
            if VERBOSE:
                print('No accept recommended cookies button')

### Get Company IDs Method

*This method allows us to collect a list of companies with technical positions within the United States, along with their IDs. The company names and IDs on Glassdoor will be needed to collect their reviews*

In [None]:
def get_company_ids(page_limit):
    
    #Accepting recommended cookies if necessary
    accept_cookies(driver)
        
    url = 'https://www.glassdoor.com/Explore/browse-companies.htm?overall_rating_low=0&page=1&isHiringSurge=0&locId=1&locType=N&locName=US&occ=Data'
    driver.get(url)
    
    
    #Give time for the page to load
    time.sleep(4)
    
    #Accepting recommended cookies
    accept_cookies(driver)
    
    
    #Initialize list of companies
    companies = []
    
    #Get the number of pages of companies which fit the search criteria. If too many pages, limit it to 100 pages
    companies_count = driver.find_element_by_xpath('//span[@class="common__commonStyles__subtleText resultCount"]').find_elements_by_tag_name('strong')[2].text
    num_pages = math.ceil(int(companies_count)/10)
    if num_pages > page_limit:
        num_pages = page_limit
    current_page = 1
    
    #Loop through pages and get companies
    while current_page <= num_pages: 
        
        #Give time for the page to load
        time.sleep(4)
    
        employer_cards = driver.find_elements_by_xpath('//section[@class="employerCard__EmployerCardStyles__employerCard common__commonStyles__module p-std mt-0 mb-std mx-std mx-sm-0"]')
        for employer_card in employer_cards:
            link = employer_card.find_elements_by_tag_name('a')[0].get_attribute('href')
            relevant_info = re.search("Reviews/(.*)-Reviews-(.*)\.", link)
            if relevant_info:
                company_name = relevant_info.group(1)
                company_id = relevant_info.group(2)

            companies.append({"company_id":company_id, "company_name":company_name})
        
        #Go to next page
        try:
            next_button = driver.find_element_by_css_selector("[aria-label=Next]")
            next_button.click()
            current_page +=1
            print("Successfully moved to next page "+str(current_page))
        except NoSuchElementException:
            print ('No more pages to scan')
            break
    
    return companies
        

### Get Company Reviews

*This method scrapes Glassdoor to get reviews from the companies that are passed in as a parameter. If at any point, the review collection fails, the method returns the reviews it has managed to collect*

In [None]:
def get_reviews(companies, page_limit, debug):
    
    
    #Accepting recommended cookies
    #accept_cookies(driver)
    
    
    column_names = ['company', 'rating', 'employee', 'headline', 'jobtitledate', 'pros', 'cons']
    reviews = pd.DataFrame(columns = column_names)
    
    try:

      for company in companies:
          
          url = 'https://www.glassdoor.com/Reviews/'+company['company_name']+'-Reviews-' + company['company_id'] +'.htm?sort.sortType=RD&sort.ascending=false&filter.iso3Language=eng'
          driver.get(url)
      
          #Give time for the page to load
          time.sleep(4)


          #Accepting recommended cookies
          accept_cookies(driver)

          #Setup lists of data to be extracted
          ratings, employee_types, headlines, job_title_dates, pros, cons = ([] for i in range(6))

          #Get the number of pages of review. If too many pages, limit it to 100 pages
          pagination_element  = driver.find_element_by_class_name("paginationFooter").text
          pagination = re.search('of (.+?) Reviews', pagination_element)
          if pagination:
              num_pages = int(pagination.group(1).replace(",", ""))
          if num_pages > page_limit:
              num_pages = page_limit
          current_page = 1
          
          try:
              #Loop through pages and get reviews
              while current_page <= num_pages: 

                  if VERBOSE:
                      print('Moving through reviews for company ' + str(company['company_name']))

                  #Let the page load. Change this number based on your internet speed.
                  #Or, wait until the webpage is loaded, instead of hardcoding it.
                  time.sleep(4)

                  #Going through each review on page
                  review_boxes = driver.find_elements_by_class_name("gdReview")
                  rating_spans = driver.find_elements_by_xpath('//span[@class="ratingNumber mr-xsm"]')
                  employee_type_spans = driver.find_elements_by_xpath('//span[@class="pt-xsm pt-md-0 css-1qxtz39 eg4psks0"]')
                  headline_spans = driver.find_elements_by_class_name("reviewLink")
                  job_title_date_spans = driver.find_elements_by_class_name("authorInfo")  
                  
                  for rating_span in rating_spans:
                      ratings.append(rating_span.text)

                  for employee_type_span in employee_type_spans:
                      employee_types.append(employee_type_span.text)

                  for headline_span in headline_spans:
                      headlines.append(headline_span.text)

                  for job_title_date_span in job_title_date_spans:
                      job_title_dates.append(job_title_date_span.text)
                      
                  ##TEST
                  for review_box in review_boxes:
                      pros_cons_div = review_box.find_elements_by_class_name("v2__EIReviewDetailsV2__fullWidth")
                      if len(pros_cons_div) == 2:
                          if VERBOSE:
                              print('Both pros and cons found')
                          for div in pros_cons_div:
                              paragraphs = div.find_elements_by_tag_name('p')
                              if paragraphs[0].text == 'Pros':
                                  pros.append(paragraphs[1].text)
                              elif paragraphs[0].text == 'Cons':
                                  cons.append(paragraphs[1].text)
                      elif len(pros_cons_div) == 1:
                          if VERBOSE:
                              print('Only pros OR cons found')
                          paragraphs = pros_cons_div[0].find_elements_by_tag_name('p')
                          if paragraphs[0].text == 'Pros':
                              if VERBOSE:
                                  print('Only pros found')
                              pros.append(paragraphs[1].text)
                              cons.append('NA')
                          elif paragraphs[0].text == 'Cons':
                              if VERBOSE:
                                  print('Only cons found')
                              pros.append('NA')
                              cons.append(paragraphs[1].text)
                      else:
                          if VERBOSE:
                              print('No pros or cons')
                          pros.append('NA')
                          cons.append('NA')


                  #Go to next page
                  try:
                      driver.find_element_by_class_name("nextButton").click()  #Find next page button and click
                      current_page +=1
                      if VERBOSE:
                          print("Successfully moved to next page "+str(current_page))
                  except NoSuchElementException:
                      print ('No more pages to scan')
                      break
          except:
              print('Review collection failed for company '+ str(company['company_name']))

          company_reviews = pd.DataFrame({'company': company['company_name'], 'rating': ratings, 'employee': employee_types, 'headline': headlines, 'jobtitledate': job_title_dates, 'pros': pros, 'cons' : cons})
          reviews = pd.concat([reviews, company_reviews])
    except:
      print('Review collection stopped')
    finally:
      return reviews
    return reviews

## Review Collection

In [None]:
#Login to Glassdoor
login(driver, 'https://www.glassdoor.com/profile/login_input.htm', 'seckml17@gmail.com', 'seckadjamberry02')

In [None]:
#Collect a list of companies with Data roles in the United States.
company_list = []

#Note that this line will collect 1000 companies (10 companies per page for 100 pages).
#In reality, we broke down the collection per 200 companies and ran this script multiple 
#times over a three-month period 
company_list.append(get_company_ids(100))

#We export the company list to CSV in case we need access to it in other scripts
company_list = [c for company in company_list for c in company]
import csv
keys = company_list[0].keys()
company_file = open("company_list.csv", "w")
dict_writer = csv.DictWriter(company_file, keys)
dict_writer.writeheader()
dict_writer.writerows(company_list)
company_file.close()

In [None]:
#Collect company reviews based on company list
#Note that this line will collect 1000 reviews per company (10 reviews per page for 100 pages).
#In reality, we broke down the collection and ran this script multiple times over a three-month 
#period 
reviews = get_reviews(company_list, 100, True)
#Clean up reviews collected and split the jobtitledata column into the data and title columns
reviews.reset_index(drop=True, inplace=True)
reviews['date'] = reviews['jobtitledate'].str.split('-',expand=True)[0]
reviews['title'] = reviews['jobtitledate'].str.split('-',expand=True)[1]
reviews.drop(columns='jobtitledate')

# We export the company list to CSV in case we need access to it in other scripts
reviews.to_csv('review_data.csv') 
files.download('review_data.csv')