# Glassdoor Company Scraper


This scraper downloads firms reviews from Glassdoor website.
The scope is to create a small DB for research purposes.

The notebook is organized with the following sections:

- Setup of the env (install libraries, set up variables and credentials, ...)
- Sign in with your credentials
- Download of the index (with Selenium and Chrome Browser libraries)
- Parse DOM of the web pages and download the reviews
- Store the data on CSV files

### Setup of the env

Install and import of python libraries 

In [2]:
!pip3 install selenium
!pip3 install pprint
!pip3 install pandas

Collecting selenium
[?25l  Downloading https://files.pythonhosted.org/packages/80/d6/4294f0b4bce4de0abf13e17190289f9d0613b0a44e5dd6a7f5ca98459853/selenium-3.141.0-py2.py3-none-any.whl (904kB)
[K     |████████████████████████████████| 911kB 541kB/s 
Installing collected packages: selenium
Successfully installed selenium-3.141.0
Collecting pprint
  Downloading https://files.pythonhosted.org/packages/99/12/b6383259ef85c2b942ab9135f322c0dce83fdca8600d87122d2b0181451f/pprint-0.1.tar.gz
Building wheels for collected packages: pprint
  Building wheel for pprint (setup.py) ... [?25ldone
[?25h  Created wheel for pprint: filename=pprint-0.1-cp37-none-any.whl size=1250 sha256=c27a6256ada274ef77ac33df82dee021da880b38948ef2dd94c0a585b9897f5a
  Stored in directory: /Users/mauropelucchi/Library/Caches/pip/wheels/42/d4/c6/16a6495aecc1bda5d5857bd036efd50617789ba9bea4a05124
Successfully built pprint
Installing collected packages: pprint
Successfully installed pprint-0.1


In [3]:
import requests
import pprint
import pandas as pd
import time
from selenium import webdriver as wd
import selenium
import json

Set the following variables to download data:

- locations array: to download all firms from different place
- max_page: max number of pages to index and download the firms
- sleep_time: to be polite with glassdoor (number of seconds between different request)




In [4]:
locations = ['milano','roma']
max_page = 1
max_page_reviews = 10
sleep_time = 1


This notebook uses Chrome Driver to simulate user interaction with glassdoor.
To set up Chrome Driver on your laptop please refer to https://chromedriver.chromium.org/downloads

The notebook is tested with
`ChromeDriver 79.0.3945.36`

Please set up `chromedriver_path` to your Chrome Driver folder.
For example:

~~~~~
chromedriver_path =  '/Users/mauropelucchi/Downloads/chromedriver2'
~~~~~

In [5]:
chromedriver_path =  '/Users/mauropelucchi/Downloads/chromedriver'

### Glassdoor credentials

To obtain firms reviews you have to sign in to Glassdoor.
Please provide your credentials here:

In [13]:
username = "****@**"
password = "******"

# Sign in to Glassdoor

`get_browser` method sets the browser and start Chrome Driver

`sign_in` simulates the user login to glassdoor:

- Click the "cookie accept button"
- Digit your username
- Digit your password
- Click the login button


In [29]:
# from https://github.com/MatthewChatham/glassdoor-review-scraper/blob/master/main.py

def get_browser():
    chrome_options = wd.ChromeOptions()
    chrome_options.add_argument('log-level=3')
    browser = wd.Chrome(chromedriver_path, options=chrome_options)
    return browser

browser = get_browser()

def sign_in():
    print(f'Signing in to {username}')
    url = 'https://www.glassdoor.it/profile/login_input.htm'
    browser.get(url)
    time.sleep(4)
    cookie_btn = browser.find_element_by_id('_evidon-accept-button')
    cookie_btn.click()
    email_field = browser.find_element_by_name('username')
    password_field = browser.find_element_by_name('password')
    submit_btn = browser.find_element_by_xpath('//button[@type="submit"]')
    email_field.send_keys(username)
    password_field.send_keys(password)
    submit_btn.click()
    time.sleep(1)
    
sign_in()

Signing in to mauro.pelucchi@unimib.it


# Get firm data

`get_firm_data` function gets a response and produces a dict with

~~~~
{'company_name': ' Accenture ',
  'link': '/Panoramica/Lavorando-in-Accenture-EI_IE4138.13,22.htm',
  'rating': 3.8}
~~~~


You can use this function to obtain the dataset of reviews for a single firm following these steps:

- Set up the link to Glassdoor company page
~~~~~
company_url = "https://www.glassdoor.it/Panoramica/Lavorando-in-Intesa-Sanpaolo-EI_IE10537.13,28.htm"
~~~~~
- Run `get_firm_dat(company_url)`
- Store result on a csv


For example:
~~~~~
company_url = "https://www.glassdoor.it/Panoramica/Lavorando-in-Intesa-Sanpaolo-EI_IE10537.13,28.htm"
reviews = get_firm_data(company_url)
df = pd.DataFrame.from_dict(reviews)
df = df.to_csv('reviews.csv')
~~~~~

In [28]:

def get_firms():
    doc_firms =  browser.find_elements_by_class_name('eiHdrModule')
    print(len(doc_firms))
    my_firms = []
    for d_firm in doc_firms:
        my_firm = {"company_name": "", "rating": 0, "link": ""}
        my_firm['company_name'] = d_firm.find_element_by_class_name("tightAll").text
        try:
            my_firm['rating'] = float(d_firm.find_element_by_class_name("ratingsSummary").text.replace(",","."))
        except:
            my_firm['rating'] = d_firm.find_element_by_class_name("ratingsSummary").text.replace(",",".")
        my_firm['link'] = d_firm.find_element_by_class_name("tightAll").get_attribute('href').replace("Panoramica","Recensioni")
        my_firms.append(my_firm)
    my_firms_final = []
    for my_firm in my_firms:
        my_firm['reviews'] = get_firm_data(my_firm['link'])
        my_firms_final.append(my_firm)
    return my_firms_final

def get_firm_data(link):
    reviews = []
    if link.endswith(".htm"):
        page_link = link
        for page_number in range(1, max_page_reviews+1):
            #.replace(".htm","") + "_P" + str(page_number) + ".htm"
            print(page_link)
            reviews.extend(get_firm_reviews(page_link))
            page_link = browser.find_element_by_class_name("pagination__ArrowStyle__nextArrow").get_attribute('href')
    else:
        reviews.extend(get_firm_reviews(link))
    return reviews

def get_firm_reviews(link):
    browser.get(link.replace("Panoramica","Recensioni"))
    time.sleep(5)
    reviews = []
    doc_reviews =  browser.find_elements_by_class_name('empReview')
    for doc_rev in doc_reviews:
        main_text = doc_rev.find_element_by_class_name('mainText').text.replace('\n',' ')
        date = doc_rev.find_element_by_class_name('date').text.replace('\n',' ')
        reviewer = doc_rev.find_element_by_class_name('reviewer').text.replace('\n',' ')
        texts =  doc_rev.find_elements_by_class_name('common__EiReviewTextStyles__allowLineBreaks')
        benefits = texts[0].text.replace('\n',' ')  if len(texts) > 0 else ''
        drawbacks = texts[1].text.replace('\n',' ') if len(texts) > 1 else ''
        tips = texts[2].text.replace('\n',' ') if len(texts) > 2 else ''
        ratings = doc_rev.find_elements_by_css_selector(".subRatings ul li .gdBars")
        balance = ratings[0].get_attribute('title') if len(ratings) > 0 else ''
        culture = ratings[1].get_attribute('title') if len(ratings) > 1 else ''
        opportunity = ratings[2].get_attribute('title') if len(ratings) > 2 else ''
        salary = ratings[3].get_attribute('title') if len(ratings) > 3 else ''
        executives = ratings[4].get_attribute('title') if len(ratings) > 4 else ''
        review = {"main_text": main_text, \
                  "date": date, \
                  "reviewer": reviewer, \
                  "benefits": benefits, \
                  "drawbacks": drawbacks, \
                  "tips": tips, \
                  "balance": balance, \
                  "culture": culture, \
                  "opportunity": opportunity, \
                  "salary": salary, \
                  "executives": executives
                 }
        reviews.append(review)
    return reviews


## Download a list of companies by locations

`download_index` downloads the index pages from Glassdoor and calls `get_firms` to build a list of firm with its review

In [None]:

def download_index(location):
    results = []
    for page_number in range(1,max_page+1):
        page_index = f"https://www.glassdoor.it/Recensioni/{location}-recensioni-SRCH_IL.0,6_IM1058_IP{page_number}.htm"
        current_firms = []
        print(f"Download data from {page_index} - Page {page_number}")
        browser.get(page_index)
        current_firms = get_firms()
        results.extend(current_firms)
        time.sleep(sleep_time)
    return results
        

In [None]:
total_firms = []
for location in locations:
    total_firms.extend(download_index(location))

Review the downloaded data:

In [None]:
pprint.pprint(total_firms)

Store the data on a json file:

In [None]:
with open('my_data1.json', 'w') as fp:
    json.dump(total_firms, fp)

## Download reviews of banks

For example, we can apply this notebook to download reviews of major bank to select the best place where work.

Here is an example to build a dataset for: UniCredit, Intesa Sanpaolo e Deutsche Bank.



In [30]:
company_url = "https://www.glassdoor.it/Panoramica/Lavorando-in-Intesa-Sanpaolo-EI_IE10537.13,28.htm"
reviews = get_firm_data(company_url)
df = pd.DataFrame.from_dict(reviews)
df = df.to_csv('intesa_san_paolo.csv')

https://www.glassdoor.it/Panoramica/Lavorando-in-Intesa-Sanpaolo-EI_IE10537.13,28.htm
https://www.glassdoor.it/Recensioni/Intesa-Sanpaolo-Recensioni-E10537_P2.htm
https://www.glassdoor.it/Recensioni/Intesa-Sanpaolo-Recensioni-E10537_P3.htm
https://www.glassdoor.it/Recensioni/Intesa-Sanpaolo-Recensioni-E10537_P4.htm
https://www.glassdoor.it/Recensioni/Intesa-Sanpaolo-Recensioni-E10537_P5.htm
https://www.glassdoor.it/Recensioni/Intesa-Sanpaolo-Recensioni-E10537_P6.htm
https://www.glassdoor.it/Recensioni/Intesa-Sanpaolo-Recensioni-E10537_P7.htm
https://www.glassdoor.it/Recensioni/Intesa-Sanpaolo-Recensioni-E10537_P8.htm
https://www.glassdoor.it/Recensioni/Intesa-Sanpaolo-Recensioni-E10537_P9.htm
https://www.glassdoor.it/Recensioni/Intesa-Sanpaolo-Recensioni-E10537_P10.htm


In [31]:
company_url = "https://www.glassdoor.it/Panoramica/Lavorando-in-UniCredit-Group-EI_IE10546.13,28.htm"
reviews = get_firm_data(company_url)
df = pd.DataFrame.from_dict(reviews)
df = df.to_csv('unicredit.csv')

https://www.glassdoor.it/Panoramica/Lavorando-in-UniCredit-Group-EI_IE10546.13,28.htm
https://www.glassdoor.it/Recensioni/UniCredit-Group-Recensioni-E10546_P2.htm
https://www.glassdoor.it/Recensioni/UniCredit-Group-Recensioni-E10546_P3.htm
https://www.glassdoor.it/Recensioni/UniCredit-Group-Recensioni-E10546_P4.htm
https://www.glassdoor.it/Recensioni/UniCredit-Group-Recensioni-E10546_P5.htm
https://www.glassdoor.it/Recensioni/UniCredit-Group-Recensioni-E10546_P6.htm
https://www.glassdoor.it/Recensioni/UniCredit-Group-Recensioni-E10546_P7.htm
https://www.glassdoor.it/Recensioni/UniCredit-Group-Recensioni-E10546_P8.htm
https://www.glassdoor.it/Recensioni/UniCredit-Group-Recensioni-E10546_P9.htm
https://www.glassdoor.it/Recensioni/UniCredit-Group-Recensioni-E10546_P10.htm


In [32]:
company_url = "https://www.glassdoor.it/Panoramica/Lavorando-in-Deutsche-Bank-EI_IE3150.13,26.htm"
reviews = get_firm_data(company_url)
df = pd.DataFrame.from_dict(reviews)
df = df.to_csv('deutsche_bank.csv')

https://www.glassdoor.it/Panoramica/Lavorando-in-Deutsche-Bank-EI_IE3150.13,26.htm
https://www.glassdoor.it/Recensioni/Deutsche-Bank-Recensioni-E3150_P2.htm
https://www.glassdoor.it/Recensioni/Deutsche-Bank-Recensioni-E3150_P3.htm
https://www.glassdoor.it/Recensioni/Deutsche-Bank-Recensioni-E3150_P4.htm
https://www.glassdoor.it/Recensioni/Deutsche-Bank-Recensioni-E3150_P5.htm
https://www.glassdoor.it/Recensioni/Deutsche-Bank-Recensioni-E3150_P6.htm
https://www.glassdoor.it/Recensioni/Deutsche-Bank-Recensioni-E3150_P7.htm
https://www.glassdoor.it/Recensioni/Deutsche-Bank-Recensioni-E3150_P8.htm
https://www.glassdoor.it/Recensioni/Deutsche-Bank-Recensioni-E3150_P9.htm
https://www.glassdoor.it/Recensioni/Deutsche-Bank-Recensioni-E3150_P10.htm
