# Webscraping Glassdoor Reviews

In this section, I will go through the steps needed to create a webscraping script that will scrape company reviews from glassdoor.com.

In [1]:
from bs4 import BeautifulSoup
import pandas as pd
import requests
import time
import re

We need to change the user-agent to avoid the Python script being blocked.

In [2]:
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

Let's pull in a single page from Booking.com reviews (page 50) and display the HTML contents.

In [3]:
page = requests.get('https://www.glassdoor.com/Reviews/Booking-com-Reviews-E256653_P50.htm', headers=headers) # page 50
soup = BeautifulSoup(page.content, 'lxml') # pip install lxml

In [4]:
print(soup.prettify())
# looks like a hot mess

<!DOCTYPE html>
<html class="flex" lang="en" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://opengraph.org/schema/">
 <head prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb# glassdoor: http://ogp.me/ns/fb/glassdoor#">
  <meta content="2,770 Booking.com reviews. A free inside look at company reviews and salaries posted anonymously by employees." name="description"/>
  <meta content="2,770 Booking.com reviews. A free inside look at company reviews and salaries posted anonymously by employees." name="og:description"/>
  <link href="https://www.glassdoor.com/Reviews/Booking-com-Reviews-E256653_P50.htm" rel="canonical"/>
  <link href="https://www.glassdoor.com/Reviews/Booking-com-Reviews-E256653_P49.htm" rel="prev"/>
  <link href="https://www.glassdoor.com/Reviews/Booking-com-Reviews-E256653_P51.htm" rel="next"/>
  <!-- because the getter clears the value -->
  <script>
   window.gdGlobals = window.gdGlobals ||
		[{
			'analyticsId':                      "UA-2595786-1",



From parsing through the HTML, we can obtain key elements of each review (title, rating, pros, cons) by referencing certain tags. It looks like the rating title is contained with an "a" tag, the rating is contained in the "value-title" class within a "span" tag, and the pros and cons are contained in the "mt-0 mb-xsm v2__..." class within a "p" tag.

Below are defined functions that obtain the above information from the HTML:

### Functions for retrieving titles, ratings, and pros/cons from Glassdoor reviews:

In [7]:
# Obtain all review titles
def get_titles(soup):
    parsed_titles = soup.select('div h2 a')
    return [title.text for title in parsed_titles]


# Obtain all star ratings
# need to omit first rating since it refers to overall rating of the company
def get_ratings(soup):
    parsed_ratings = soup.find_all('span', class_="value-title")
    return [float(rating['title']) for rating in parsed_ratings][1:]
    

# Pros and Cons per page. Should be 20 total for 10 reviews. Pros and Cons are required fields when entering a review
# so there should not be any empty fields.
# returns a single tuple with a list of pros, and a list of cons
def get_pros_cons(soup):
    parsed_reviews = soup.find_all('p', class_="mt-0 mb-xsm v2__EIReviewDetailsV2__bodyColor v2__EIReviewDetailsV2__lineHeightLarge")
    reviews = [review.text for review in parsed_reviews]
    pros = reviews[0::2]
    cons = reviews[1::2]
    return pros,cons

Testing page 50 of Booking.com. It seems like some reviews contain company responses, which inconveniently get added to our pros/cons list. There are 10 reviews per page on glassdoor.com, which should correspond to 10 elements in each list. You can see on these pages the length of the pros/cons lists is greater than 10.

In [6]:
# page = requests.get('https://www.glassdoor.com/Reviews/Booking-com-Reviews-E256653_P50.htm', headers=headers) # page 50
# soup = BeautifulSoup(page.content, 'lxml')
print(len(get_titles(soup)))
print(len(get_ratings(soup)))
print(len(get_pros_cons(soup)[0]))
print(len(get_pros_cons(soup)[1]))

10
10
12
11


We will have to keep track of this in the web scraping script and make sure to avoid pages with company responses.

## Web-scraping script:

In [11]:
# url = page 1 url of reviews
# start_pg = first page to scrape
# end_pg = last page to scrape
# df = data frame to append reviews to (default=blank)

def scrape_reviews(url, company, start_pg, end_pg, df=pd.DataFrame({'title': [], 'rating':[], 'pros':[], 'cons':[]})):
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/718.93 (KHTML, like Gecko) Chrome/79.0.4927.49 Safari/489.72'}
    t0 = time.time()
    page_num = ''
    counter = 0
    error_pages = []

    for j in range(start_pg, end_pg+1):
        # only add page reference to url for 2nd page onwards
        if j !=1:
            page_num = f"_P{j}"
        
        # add '_P#' etc to end of url
        new_url = f"{url[:-4]}{page_num}.htm"
        # wait 1 second before requesting page
        time.sleep(1) 
        page = requests.get(new_url, headers=headers)
        print(new_url)
        
        # using predefined functions to create lists of titles, ratings, pros, cons
        soup = BeautifulSoup(page.content, 'lxml')
        titles = get_titles(soup)
        ratings = get_ratings(soup)
        pros = get_pros_cons(soup)[0]
        cons = get_pros_cons(soup)[1]
        
        print('Array lengths: ', len(titles), len(ratings), len(pros), len(cons))
        
        # handle company responses, as they will get included in pros/cons
        try: 
            temp_df = pd.DataFrame({'title':titles, 'rating':ratings, 'pros':pros, 'cons':cons})
            df = df.append(temp_df, ignore_index=True)
        except ValueError:
            counter +=1
            print(f'Company responses on {counter} pages')
            error_pages.append(j)
            pass
        else:
            pass

        print('Dataframe shape: ', df.shape, '\n')
        
        #save the df every 50 pages just in case the script fails
        if j%50 == 0:
            df.to_csv('../data/reviews_TEMP.csv')
        
    # print number of pages with company responses
    print('\n', f'Errors on {len(error_pages)} pages:{error_pages}')
    
    # add a 'company' column to the dataframe
    df['company'] = company
    
    t1 = time.time()
    print(f'This operation took {(t1-t0)/60} minutes to scrape {end_pg-start_pg+1} pages at a pace of {(t1-t0)/(end_pg-start_pg+1)} sec/page.')
    
    return df

## Web-scraping script in action:

Let's test the web-scraping script on 10 pages of Google reviews:

In [21]:
df = scrape_reviews('https://www.glassdoor.com/Reviews/Google-Reviews-E9079.htm','Google',1,10)

https://www.glassdoor.com/Reviews/Google-Reviews-E9079.htm
Array lengths:  10 10 10 10
Dataframe shape:  (10, 4) 

https://www.glassdoor.com/Reviews/Google-Reviews-E9079_P2.htm
Array lengths:  10 10 10 10
Dataframe shape:  (20, 4) 

https://www.glassdoor.com/Reviews/Google-Reviews-E9079_P3.htm
Array lengths:  10 10 10 10
Dataframe shape:  (30, 4) 

https://www.glassdoor.com/Reviews/Google-Reviews-E9079_P4.htm
Array lengths:  10 10 10 10
Dataframe shape:  (40, 4) 

https://www.glassdoor.com/Reviews/Google-Reviews-E9079_P5.htm
Array lengths:  10 10 10 10
Dataframe shape:  (50, 4) 

https://www.glassdoor.com/Reviews/Google-Reviews-E9079_P6.htm
Array lengths:  10 10 10 10
Dataframe shape:  (60, 4) 

https://www.glassdoor.com/Reviews/Google-Reviews-E9079_P7.htm
Array lengths:  10 10 10 10
Dataframe shape:  (70, 4) 

https://www.glassdoor.com/Reviews/Google-Reviews-E9079_P8.htm
Array lengths:  10 10 10 10
Dataframe shape:  (80, 4) 

https://www.glassdoor.com/Reviews/Google-Reviews-E9079_P9.h

In [22]:
print(df.shape, '\n')
print(df.info())

(100, 5) 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 5 columns):
title      100 non-null object
rating     100 non-null float64
pros       100 non-null object
cons       100 non-null object
company    100 non-null object
dtypes: float64(1), object(4)
memory usage: 4.0+ KB
None


The script worked as planned, displaying progress information and array lengths concurrently!
Let's preview the data:

In [23]:
df.head()

Unnamed: 0,title,rating,pros,cons,company
0,"""Good environment""",5.0,"Positive work environment, lots of room for gr...",Can be lots of pressure from management at times.,Google
1,"""Moving at the speed of light, burn out is ine...",4.0,"1) Food, food, food. 15+ cafes on main campus...",1) Work/life balance. What balance? All thos...,Google
2,"""Great balance between big-company security an...",5.0,"* If you're a software engineer, you're among ...","* It *is* becoming larger, and with it comes g...",Google
3,"""The best place I've worked and also the most ...",5.0,You can't find a more well-regarded company th...,I live in SF so the commute can take between 1...,Google
4,"""Great""",5.0,Great benefits and pay,Nothing really to complain about,Google


### Text cleaning

We'll have to do some text cleaning to remove line endings from the review text:

In [19]:
df.pros = df.pros.map(lambda x: re.sub('\s+', ' ', x))
df.cons = df.cons.map(lambda x: re.sub('\s+', ' ', x))

Now to get rid of numerical/special characters:

In [26]:
df.title = df.title.map(lambda x: re.sub('[^a-zA-Z0-9 \n\.]', '', x.lower()))
df.pros = df.pros.map(lambda x: re.sub('[^a-zA-Z0-9 \n\.]', '', x.lower()))
df.cons = df.cons.map(lambda x: re.sub('[^a-zA-Z0-9 \n\.]', '', x.lower()))

Preview the final dataframe:

In [27]:
df.head(10)

Unnamed: 0,title,rating,pros,cons,company
0,good environment,5.0,positive work environment lots of room for gro...,can be lots of pressure from management at times.,Google
1,moving at the speed of light burn out is inevi...,4.0,1 food food food. 15 cafes on main campus mtv...,1 worklife balance. what balance all those p...,Google
2,great balance between bigcompany security and ...,5.0,if youre a software engineer youre among the ...,it is becoming larger and with it comes growi...,Google
3,the best place ive worked and also the most de...,5.0,you cant find a more wellregarded company that...,i live in sf so the commute can take between 1...,Google
4,great,5.0,great benefits and pay,nothing really to complain about,Google
5,great,5.0,product is great in todays market,no cons to report at this time,Google
6,working at google,5.0,amazing perks incredible people great resume b...,hard to get promoted differentiate yourself.,Google
7,good company,5.0,nice environment to new grads,na. i think it is good.,Google
8,good,5.0,google has good work culture,no cons has been found,Google
9,wonderful company and excellent experience,5.0,friendly supervisors positive work environment...,some may not enjoy certain aspects of job such...,Google


In [28]:
# Export final dataframe
df.to_csv('../data/sample_reviews.csv')