<a href="https://colab.research.google.com/github/monipip3/women_in_data_datathon/blob/main/Scraping_Glassdoor_Company_Reviews_parental_leave_only.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


**[Check out the full post on my blog here.](https://bulletbyte.weebly.com/tech/how-to-scrape-a-companys-glassdoor-reviews-using-python)**

#1. Setting Up

In [1]:
#import the libraries
import os
import time
import joblib
import numpy as np
import pandas as pd
import math

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen

#2. Scraping Company Glassdoor Reviews - Global

##2.1 Creating a Function to Scrape any Glassdoor Company review page (Global)


In [2]:
#create a function to scrape any Glassdoor company review page
#the code still works when I run it on 7 Sep, 2021, but the html content of Glassdoor webpages changes all the time
#please inspect the webpage and make the necessary changes to the html tags if any of the list returns empty

def review_scraper(url=''):
  #scraping the web page content
  hdr = {'User-Agent': 'Mozilla/5.0'}

  req = Request(url,headers=hdr)
  page = urlopen(req)
  soup = BeautifulSoup(page, "html.parser") 

  #define some lists
  Summary=[]
  Date=[]
  JobTitle=[]
  OverallRating=[]

  #get the Summary
  for x in soup.find_all('p', {'class':'mt-std mb-std mt-sm-xsm'}):
    Summary.append(x.text)

  #get the Posted Date
  for x in soup.find_all('span', {'data-test':'review-date'}):
    Date.append(x.text)

  #get Job Title
  for x in soup.find_all('div', {'class':'mt-xxsm mt-sm-0 ml-0 ml-sm-xsm css-1uyte9r'}):
    JobTitle.append(x.text)

   #get Overall Rating
  for x in soup.find_all('strong', {'class':'mr-xxsm css-b63kyi'}):
    OverallRating.append(float(x.text))

  #putting everything together
  Reviews = pd.DataFrame(list(zip(Summary, Date, JobTitle, OverallRating)), 
                    columns = ['Summary', 'Date', 'JobTitle', 'OverallRating'])
  
  return(Reviews)
  #return(Summary, Date, JobTitle, AuthorLocation, OverallRating, Pros, Cons)

##2.2 Setting the Target URL and Checking for Max number of Pages to Scrape

In [None]:
#paste/replace the url to the first page of the company's Glassdoor review in between the ""
input_url="https://www.glassdoor.com/Benefits/Walmart-Maternity-and-Paternity-Leave-US-BNFT23_E715_N1.htm"

In [None]:
#scraping the first page content
hdr = {'User-Agent': 'Mozilla/5.0'}
req = Request(input_url, headers=hdr)
page = urlopen(req)
soup = BeautifulSoup(page, "html.parser") 

In [None]:
#check the total number of reviews
countReviews = soup.find('div', {'data-test':'pagination-footer-text'}).text
countReviews = float(countReviews.split(' Reviews')[0].split('of ')[1].replace(',',''))

#calculate the max number of pages (assuming 10 reviews a page)
countPages = math.ceil(countReviews/10)
countPages

29

##2.3 Scraping Multiple Pages of Glassdoor Company Review

In [None]:
#I'm setting the max pages to scrape to 3 here to save time
maxPage = 3 + 1
#uncomment the line below to set the max page to scrape (based on total number of reviews)
#maxPage = countPages + 1

In [None]:
output = review_scraper(url=input_url)

In [None]:
print(output)

                                             Summary          Date  \
0  I dont really know much about it I do know you...  Sep 19, 2022   
1  I was able to take 4 months fully paid materni...  Sep 18, 2022   
2                       Available for up to 16 weeks  Sep 16, 2022   
3                 as usual not more that competitors   Sep 8, 2022   
4  It is fairly long comapared to others, you can...   Sep 1, 2022   
5                Maternity leave of absence 3 months  Aug 23, 2022   
6       generous benefits for both mother and father  Aug 15, 2022   
7                            Nothing is bad about it   Aug 7, 2022   
8  Available for full time associates with 1 year...   Aug 1, 2022   
9                             12 weeks paid time off  Jul 20, 2022   

                                            JobTitle  OverallRating  
0                         Current Claims in nullnull            3.0  
1                                    Former Employee            5.0  
2                  

In [None]:
#scraping multiple pages of company glassdoor review
output = review_scraper(url=input_url)
for x in range(2,maxPage):
  input_url="https://www.glassdoor.com/Benefits/Walmart-Maternity-and-Paternity-Leave-US-BNFT23_E715_N1_IP{}.htm".format(x)
  output = output.append(review_scraper(url=input_url), ignore_index=True)
#display the output
display(output)

Unnamed: 0,Summary,Date,JobTitle,OverallRating
0,I dont really know much about it I do know you...,"Sep 19, 2022",Current Claims in nullnull,3.0
1,I was able to take 4 months fully paid materni...,"Sep 18, 2022",Former Employee,5.0
2,Available for up to 16 weeks,"Sep 16, 2022",Current Employee,4.0
3,as usual not more that competitors,"Sep 8, 2022",Former Inventory Associate in nullnull,5.0
4,"It is fairly long comapared to others, you can...","Sep 1, 2022",Current Customer Service Representative (CSR) ...,3.0
5,Maternity leave of absence 3 months,"Aug 23, 2022",Current Fufillment Associate in nullnull,5.0
6,generous benefits for both mother and father,"Aug 15, 2022",Current Employee,5.0
7,Nothing is bad about it,"Aug 7, 2022",Former Employee,5.0
8,Available for full time associates with 1 year...,"Aug 1, 2022",Current People Lead in nullnull,5.0
9,12 weeks paid time off,"Jul 20, 2022",Current Employee,5.0


Scraping the Top 50 companies in United State Reviews Parental leave

In [3]:
from google.colab import drive
drive.mount('/content/drive',force_remount=True)

Mounted at /content/drive


In [4]:
%cd drive/MyDrive/Datathon/

/content/drive/MyDrive/Datathon


In [5]:
large_companies = pd.read_excel('large_companies.xlsx',sheet_name='Paid leave review links')

In [6]:
large_companies.head(5)

Unnamed: 0,Rank,Company,Url,avg_paid_leave_2022,glassdoor_paid_leave_reviews_link
0,1.0,Walmart,https://www.walmart.com/,3.9,https://www.glassdoor.com/Benefits/Walmart-Mat...
1,2.0,Amazon,https://www.amazon.com/,4.1,https://www.glassdoor.com/Benefits/Amazon-Mate...
2,3.0,Apple,https://www.apple.com/,4.6,https://www.glassdoor.com/Benefits/Apple-Mater...
3,4.0,CVS Health,https://www.cvshealth.com/,3.2,https://www.glassdoor.com/Benefits/CVS-Health-...
4,5.0,UnitedHealth Group,https://www.unitedhealthgroup.com/,3.6,https://www.glassdoor.com/Benefits/UnitedHealt...


In [7]:
companies = large_companies.Company.values
company_urls = large_companies.glassdoor_paid_leave_reviews_link.values

In [14]:
companies[33]

'Lowes'

In [15]:
company_urls[33]

'https://www.glassdoor.com/Benefits/Lowe-s-Home-Improvement-Maternity-and-Paternity-Leave-US-BNFT23_E415_N1.htm'

In [None]:
company_urls[0].split('.htm')

['https://www.glassdoor.com/Benefits/Walmart-Maternity-and-Paternity-Leave-US-BNFT23_E715_N1',
 '']

In [17]:
for i in range(1,len(companies)):
#for i in range(33,34):
  print(f'Scraping {companies[i]} Reviews')
  input_url = company_urls[i]
  output = review_scraper(url=input_url)
  for x in range(2,50):
   # print(f'On Page {x}')
    input_url = company_urls[i].split('.htm')[0]
    input_url=input_url + "_IP{}.htm".format(x)
    output = output.append(review_scraper(url=input_url), ignore_index=True)
  output['company'] = companies[i]
  joblib.dump(output,f'./paid_leave_reviews/paid_leave_reviews_{companies[i]}.pkl')
  print(f'Finished Scraping {companies[i]} Reviews')

Scraping Lowes Reviews
Finished Scraping Lowes Reviews
