# GlassDoor Scrape Analysis and Function
## Table of Contents
- [Brief EDA](#Brief-EDA)
- [Data Scrape - Selenium (with Problem Statement)](#Data-Scrape---Selenium)
- [Final Working Function](#Final-Working-Function)
- [Takeaways and conclusion](#Takeaways-and-conclusion)
- [Example Codes](#Example-Codes)


## Executive Summary:
We were able to successfully build a scraper. 
- Inputs: keyword, location(optional). 
- Outputs: City, State, Jobname, Average Salary, quantity of postings sorted by elapsed days

Our intended use of this scraper is to capture live job list and wage data, and return the information in a format that can be iterated over for building a data frame. State and City will match search criteria from our main DataFrame. How we use the scraper will depend on the use-case and context, for now it will be to capture basic job information alongside the functionality of being able to iterate and build a dataframe over time. 

##### Sources:
- [Glass Door Developer Page](https://www.glassdoor.com/developer/index.htm) : Dev page for API scrape
- [Glass Door first URL](https://www.glassdoor.com/Job/index.htm) : Initial URL
- [Glass Door final URL](https://www.glassdoor.com/Job/jobs.htm?suggestCount=0&suggestChosen=false&clickSource=searchBtn&typedKeyword=data+scientes&sc.keyword=data+scientest&locT=&locId=&jobType=) : Final URL that worked with our scraper
- [Glassdoor scrape example - Selenium (git)](https://github.com/arapfaik/scraping-glassdoor-selenium) : github link to author of Medium article
- [Glassdoor scrape example - Selenium (Medium)](https://medium.com/@jamievaron/to-anyone-who-has-lost-themselves-9c5e3049cb13) : Link to Medium article covering Glassdoor and Selenium
- [Selenium Review SEA-FLEX-11](https://git.generalassemb.ly/charles-rice/SEA-Flex-11/tree/master/08_week/selenium-webscraping) : Selenium flex review lab completed with our instructor.
- [Xpath guide](https://devhints.io/xpath) : reference guide for Xpath commands
- [Selenium Keys_Input Documentation](https://selenium-python.readthedocs.io/api.html) : reference guide for special keys

### Brief EDA

In [1]:
# imports
from selenium import webdriver
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
from time import sleep
import string

In [2]:
# load in dataset
df = pd.read_csv('../data/main_df.csv')

In [3]:
# check the head, verify columns and information
df.head(2)

Unnamed: 0,disasterNumber,state,incidentType,year,month,occ_code,occ_title,tot_emp,h_mean,a_mean,employment_rate_during,employment_rate_before,employment_rate_after,employment_rate_change,wage_change
0,1190,NE,Severe Storm(s),1997.0,11.0,13002,Financial Managers,3730.0,24.5,50960,71.5,71.5,71.6,-0.1,-43864.8
1,1190,NE,Severe Storm(s),1997.0,11.0,13005,"Personnel, Training, and Labor Relations Managers",1420.0,21.41,44540,71.5,71.5,71.6,-0.1,-14593.056


In [4]:
# verify columns look clean, no mistakes slipped in
df.columns

Index(['disasterNumber', 'state', 'incidentType', 'year', 'month', 'occ_code',
       'occ_title', 'tot_emp', 'h_mean', 'a_mean', 'employment_rate_during',
       'employment_rate_before', 'employment_rate_after',
       'employment_rate_change', 'wage_change'],
      dtype='object')

In [5]:
# basic eda of dtypes for future references
df.dtypes

disasterNumber              int64
state                      object
incidentType               object
year                      float64
month                     float64
occ_code                   object
occ_title                  object
tot_emp                   float64
h_mean                    float64
a_mean                     object
employment_rate_during    float64
employment_rate_before    float64
employment_rate_after     float64
employment_rate_change    float64
wage_change               float64
dtype: object

In [6]:
# We seek categorize each of business types/categories in our main df. This is a shared code ran by each member
## of our group who is working with this dataframe.

# Code from Patrick Kajibale 
df['Business_type']= df['occ_code'].apply(lambda x : x[:2])
#pd.set_option('display.max_row()',None)
industry = {
        '13' :'Business and Financial Operations',
        '15' :'Computer and Mathematical',
        '17' :'Architecture and Engineering',
        '19' :'Life, Physical, and Social Science',
        '21' :'Community and Social Service',
        '23' :'Legal',
        '25' :'Educational Instruction and Library',
        '27' :'Arts, Design, Entertainment, Sports, and Media',
        '29' :'Healthcare Practitioners and Technical',
        '31' :'Healthcare Support',
        '33' :'Protective Service',
        '35' :'Food Preparation and Serving Related',
        '37' :'Building and Grounds Cleaning and Maintenance',
        '39' :'Personal Care and Service',
        '41' :'Sales and Related',
        '43' :'Office and Administrative Support',
        '45' :'Farming, Fishing, and Forestry',
        '47' :'Construction and Extraction',
        '49' :'Installation, Maintenance, and Repair',
        '51' :'Production',
        '53' :'Transportation and Material Moving'}
df['Business_type'].replace(industry, inplace = True)

In [7]:
df.head(2)
## we'll keep business_type categorization in case we need it later

Unnamed: 0,disasterNumber,state,incidentType,year,month,occ_code,occ_title,tot_emp,h_mean,a_mean,employment_rate_during,employment_rate_before,employment_rate_after,employment_rate_change,wage_change,Business_type
0,1190,NE,Severe Storm(s),1997.0,11.0,13002,Financial Managers,3730.0,24.5,50960,71.5,71.5,71.6,-0.1,-43864.8,Business and Financial Operations
1,1190,NE,Severe Storm(s),1997.0,11.0,13005,"Personnel, Training, and Labor Relations Managers",1420.0,21.41,44540,71.5,71.5,71.6,-0.1,-14593.056,Business and Financial Operations


In [229]:
# nothing else from the dataset is needed for now as we focus on scraping

### Data Scrape - Selenium

My initial workflow is as such:
1. Establish a working cell that: opens page successfully
2. Build a function that can output some piece of information as a page (a test)
3. Continue to iterate function to include more extraction features<br>
### Problem Statement: 
We want a function that can scrape glass door's job listings based on name and location, and return the average salary, location, and listing quantities. Our goal is to have a scraper that can assist with building a database from live listings to report and model on during/after a major disaster.

Notes:
I referenced two examples of Selenium usage
##### Medium/Github
 - This was a medium article and jupyter notebook (pulled from github) containing a dataframe scrape function.
 - A few issues I had with this was:
  - Base URL had embedded location and preferences, thus the tool only worked for that area. 
  - Next, GlassDoor has some anti-scrape methods that prevent customizing search selections via url code, no matter what you type for key word, it would select a location (with keyed ID number) and insert that into the URL for search criteria. **This is where selenium's human impersonations work** 
  - This code did direct me into reading more about XPath, how it works, and using it for building data frames.
 

##### SEA-Flex-11 
 - The Second guide I used was a recording from our local instructor and Data aficionado
 - Taught me how to interact beautiful soup alongside Selenium.
 - Code included examples of mimicking human operations, I studied this extensively
 - THANK YOU AS ALWAYS CHARLIE!
 
 
##### Last notes before Function Code 
- A guide used suggests I will experience a pop-up to sign up for Glassdoor.
 - I am not experiencing a log-in prompt every time I click, I will include this code as a net in case this is dependent on mine own machine
  - This is not included in the current running code
 - `options.add_argument(‘headless’)` to bypass Chrome window opening
  - this is not included in the current example


#### Failed URL Options:
***These are my initial notes and breakdown of url to find a way to specify location***
- full search url example:
 - `https://www.glassdoor.com/Job/jobs.htm?suggestCount=0&suggestChosen=false&clickSource=searchBtn&typedKeyword=data+scientist&sc.keyword=data+scientist&locT=&locId=&jobType=`
 <br><br> What we need is: 
 - `https://www.glassdoor.com/Job/jobs.htm?suggestCount=0&suggestChosen=false&clickSource=searchBtn&typedKeyword=` + *JOBNAME* + `'&locT=&locId=&jobType="`
 - Result: Unfortunately, this resulted in Chrome defaulting to Glassdoor's search homepage. This becomes an issue because of Selenium's interaction with Chrome drivers, GlassDoor would ignore any input we inserted for location and would default to our IP geo-tag. 
 <br>
 - **Solution**: We simply use a url with results already displayed, as it has `location` and `keyword` search boxes that DO respond to input.
  - NOTE: GlassDoor requires that the search url contains a numerical locid locked behind their API. My workaround was to have Selenium arrow-down + tab to select best match. I applied this to both `keyword` search and `location`.
  
#### Blank search form:
- https://www.glassdoor.com/Job/jobs.htm?suggestCount=0&suggestChosen=false&clickSource=searchBtn&typedKeyword=&locT=&locId=&jobType=&context=Jobs&sc.keyword=&dropdown=0
- Result: Still defaulted to geo-ip.

#### Final URL:
- Our best result was opting to input the search bars available on a page already displaying results (as opposed to the Glassdoor Search Homepage and its resulting auto-location issues.
- https://www.glassdoor.com/Job/jobs.htm?suggestCount=0&suggestChosen=false&clickSource=searchBtn&typedKeyword=data+scientist&sc.keyword=data+scientist&locT=C&locId=1138213&jobType=
- Result: It's a nice nod to Data Science and a workaround our problem. 

Search Dictionary:
- `typedKeyword=&` >> captures job search input in search bar
- `sc.keyword=&` >> mimics above
- `locT=&` >> required* numerical
- `locId=&` >> required* numerical
- `jobType=` >> unused

Notes: Originally I was tracking url indexing for search terms, ultimately I found that tinkering with these resulted poorly as Glassdoor masks a lot of their categories (location id, settings functionality, etc) behind numerical ids with no key. API is also very restrictive, requiring a partner's license to access (we were denied).

#### First Step: Ensure we can open Glass door.

In [8]:
# set url for test
# this url will take us to a blank search page with 'Jobs' and 'Location' 
url = 'https://www.glassdoor.com/Job/index.htm'

In [9]:
driver = webdriver.Chrome(executable_path="./chromedriver/macos/chromedriver")
driver.get(url)

In [33]:
# this opens upa. page with two search criteria, Job Title, and Area. 
# we will have to mimic clicking and entering our inputs for each category

### Inspect Elements:
This is a reference page I built out for tracking the two search bars from a clean page and a page with results already displayed. The **2nd attempt** was our working url and searchbar combo.


#### Job Search Bar
**Xpath**<br>`//*[@id="LocationSearch"]//div[1]` <br>
**HTML** <br> `<input id="LocationSearch" class="loc" type="text" tabindex="0" value="Bellevue, WA" data-srch-type="popular" data-test="search-bar-location-input" placeholder="Location" aria-label="Location">`

#### Location Search Bar
**Xpath** <br> `//*[@id="LocationSearch"]`
**HTML** <br> `<input id="LocationSearch" class="loc" type="text" tabindex="0" value="Bellevue, WA" data-srch-type="popular" data-test="search-bar-location-input" placeholder="Location" aria-label="Location">`


## 2nd attempt, via new hyperlink
#### Job Search Bar
**Xpath**<br>`//*[@id="sc.keyword"]` <br>
**HTML** <br> `<input name="sc.keyword" id="sc.keyword" class="keyword" type="text" tabindex="0" value="" placeholder="Job Title, Keywords, or Company" data-auto-complete="true" data-ac-version="New" data-test="search-bar-keyword-input" aria-label="Keyword" autocomplete="off">`

#### Location Search Bar
**Xpath** <br> `//*[@id="sc.location"]`
**HTML** <br> `<input id="sc.location" class="loc" type="text" tabindex="0" value="Bellevue, WA" data-srch-type="popular" data-test="search-bar-location-input" placeholder="Location" aria-label="Location" autocomplete="off">`

#### 2nd Step: build a function that successfully navigates us to the requested search page.

In [10]:
def job_track(keyword, location = None):
    # query chrome driver for selenium
    driver = webdriver.Chrome(executable_path="./chromedriver/macos/chromedriver")
    
    # set url
    url = 'https://www.glassdoor.com/Job/index.htm'
    
    # execute driver on url
    driver.get(url)
    
    
    #locates the search field specifying 'Job Title, Keyewards, or Company'
    kw = driver.find_element_by_xpath('//*[@id="KeywordSearch"]')
    kw.clear()
    kw.send_keys(keyword)
    sleep(2)
    #locates the search field specifying 'Location'
    ## current issue is html/glassdoor default your location by mac/ip
    ## so we are manually entering our location search as a possible field
    loc = driver.find_element_by_xpath('//*[@id="LocationSearch"]')
    loc.clear()
    loc.send_keys(location)
    sleep(2)
    loc.send_keys(u'\ue006')
    
    # and we xpath the search button to click search
#     click = driver.find_element_by_xpath('//*[@id="HeroSearchButton"]')
#     click.click()
    sleep(1)
    driver.close()


In [11]:
job_track('data scientist', location = 'Virginia')

Findings: So, this works, but regardless of location input glassdoor overwrites your location with their handling.

#### Step 2 revised: changing url to that of returned results (still has access to search bars)
Our idea is to test the autocompletion and bypass it by being able to locate and input our location query, adding lag delay and clicking first option from the search menu.

In [12]:
# running a test cell to verify new url works
# changing url - this cell was ran to verify our search bars updated
url = "https://www.glassdoor.com/Job/jobs.htm?suggestCount=0&suggestChosen=false&clickSource=searchBtn&typedKeyword=data+scientist&sc.keyword=data+scientist&locT=C&locId=1138213&jobType="
driver = webdriver.Chrome(executable_path="./chromedriver/macos/chromedriver")
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'lxml')
sleep(2)
driver.close()

# this is only needed if page waits too long, it closes the pop-up I've only seen occur once.
# driver.find_element_by_class_name("qual_x_svg_X").click()

## below are the search bar xpath locations
# kw = driver.find_element_by_xpath('//*[@id="KeywordSearch"]')    # searches key word
# loc = driver.find_element_by_xpath('//*[@id="LocationSearch"]')  # searches location

#### ... Now I try extracting information and processing it for dataframe building.

In [13]:
# we just rebuild the function with the new url and test
def job_track(keyword, location = None):
    ''' This function simply takes a job name and location, opens up glassdoor
    and searches your criteria and returns how many listings are available in that area'''
    # set base url
    url = "https://www.glassdoor.com/Job/jobs.htm?suggestCount=0&suggestChosen=false&clickSource=searchBtn&typedKeyword=data+scientist&sc.keyword=data+scientist&locT=C&locId=1138213&jobType="
    # query chrome driver for selenium
    driver = webdriver.Chrome(
        executable_path="./chromedriver/macos/chromedriver")
    
    # execute driver on url
    driver.get(url)
    
    
    #locates the search field specifying 'Job Title, Keyewards, or Company'
    kw = driver.find_element_by_xpath('/html/body/header/div[3]/div[2]/form/input[5]')
    kw.clear()
    kw.send_keys(keyword)
    sleep(1)
    kw.send_keys(u'\ue015') # down arrow
    sleep(1)
    kw.send_keys(u'\ue004') # tab to select closest match
    sleep(1)
    #locates the search field specifying 'Location'
    ## current issue is html/glassdoor default your location by mac/ip
    ## so we are manually entering our location search as a possible field
    loc = driver.find_element_by_xpath('/html/body/header/div[3]/div[2]/form/input[6]')
    loc.clear()
    loc.send_keys(location)
    sleep(1)
    loc.send_keys(u'\ue015') # down arrow
    sleep(1)
    loc.send_keys(u'\ue004') # tab to select closest match
    sleep(1)
    
    # and we xpath the search button to click search
    click = driver.find_element_by_xpath('//*[@id="HeroSearchButton"]/span')
    click.click()
    sleep(.1)
    driver.close()



In [14]:
job_track("data scientist", 'bellevue')

#### ... now let's try extracting information from the new page
- note: we had to tell the function to update the new url once it opened up search results

In [21]:
# now we extract information, let's pull total listings
## confirmed working function!
def total_listings(keyword, location = None):
    ''' This function simply takes a job name and location, opens up glassdoor
    and returns the number of available. jobs in that area''' 
    # set base url
    url = "https://www.glassdoor.com/Job/jobs.htm?suggestCount=0&suggestChosen=false&clickSource=searchBtn&typedKeyword=data+scientist&sc.keyword=data+scientist&locT=C&locId=1138213&jobType="
    # query chrome driver for selenium
    driver = webdriver.Chrome(
        executable_path="./chromedriver/macos/chromedriver")
    
    # execute driver on url
    driver.get(url)
    
    
    #locates the search field specifying 'Job Title, Keyewards, or Company'
    kw = driver.find_element_by_xpath('/html/body/header/div[3]/div[2]/form/input[5]')
    kw.clear()
    kw.send_keys(keyword)   # enters keyword search (Job name preferably)
    sleep(1)
    kw.send_keys(u'\ue015') # down arrow
    sleep(1)
    kw.send_keys(u'\ue004') # tab to select closest match
    sleep(1)
    #locates the search field specifying 'Location'
    ## current issue is html/glassdoor default your location by mac/ip
    ## so we are manually entering our location search as a possible field
    loc = driver.find_element_by_xpath('/html/body/header/div[3]/div[2]/form/input[6]')
    loc.clear()
    loc.send_keys(location)  # enters location details in location search
    sleep(1)
    loc.send_keys(u'\ue015') # down arrow
    sleep(1)
    loc.send_keys(u'\ue004') # tab to select closest match
    sleep(1)
    
    # and we xpath the search button to click search
    click = driver.find_element_by_xpath('//*[@id="HeroSearchButton"]/span') 
    click.click() # boop!
    
    url = driver.current_url #since we opened a new page, we need to update our url reference
    driver.get(url)
    try:
        text = driver.find_element_by_xpath('//*[@id="MainColSummary"]/div/div/div[2]').text
        sleep(.1)
        driver.close()
    except:
        print("Please check your search criteria")
        sleep(.1)
        driver.close()
    return text

In [22]:
total_listings('data scientist', 'Bellevue WA')

'441 Lead Data Scientist Jobs in Bellevue, WA'

In [23]:
# testing string.split for parsing results pulled
string = '441 Lead Data Scientist Jobs in Bellevue, WA'

In [24]:
string.split(' ')[0]

'441'

### Final Working Function
- here we have a scraper that pulls job information based on:
<br><t>**inputs** 
  - `keyword` : job title/name
  - `location` : US-based city or state
<br> Our function will bypass login and geo-tracking for auto-completion on searches by using an url source of a completed search. <br><br> Once selenium opens the page, it visits each search bar, inputs our entries, then scrapes hte resulting page into a simple dataframe.<br><br><t>If we choose, we can loop this function over each state in our data frame for a specific job, or even multiple jobs. *Granted*, this is a selenium scrape and is limited by time frame and better used for commenting on current status <br><br><br>
##### Another note on `location`
Glassdoor does not allow searching off a keyword match for locations (as they do with jobs). Every search will auto-complete onto their string name. In the URL it is coded behind a unique id and id list which they have hidden for anti-scraping purposes. Our solution to this is simply have selenium use the arrow keys to select fist (and typically, best/closest) match.

In [27]:
## now we piece together the original function to return a data frame of our findings

def job_status(keyword, location = None):
    ''' This function simply takes a job name and location, opens up glassdoor
    and returns an informative dataframe regarding the job status and availability''' 
    # set base url
    url = "https://www.glassdoor.com/Job/jobs.htm?suggestCount=0&suggestChosen=false&clickSource=searchBtn&typedKeyword=data+scientist&sc.keyword=data+scientist&locT=C&locId=1138213&jobType="
    
    # query chrome driver for selenium
    driver = webdriver.Chrome(
        executable_path="./chromedriver/macos/chromedriver")
    
    # execute driver on url
    driver.get(url)
    
    #locates the search field specifying 'Job Title, Keywords, or Company'
    kw = driver.find_element_by_xpath('/html/body/header/div[3]/div[2]/form/input[5]')  # locates keyword search bar
    
    kw.clear()              # clears the search bar
    
    kw.send_keys(keyword)   # enters our key word
    
    sleep(.1)
    
    kw.send_keys(u'\ue015') # down arrow to select closest match for search (more important for location)
    
    sleep(.1)
    
    kw.send_keys(u'\ue004') # tab to select closest match
    
    sleep(.1)
    
    #locates the search field specifying 'Location'
    ## current issue is html/glassdoor default your location by mac/ip
    ## so we are manually entering our location search as a possible field
    loc = driver.find_element_by_xpath('/html/body/header/div[3]/div[2]/form/input[6]')
    loc.clear()              # clears the search bar
    loc.send_keys(location)  # updates location
    sleep(.1)
    loc.send_keys(u'\ue015') # down arrow
    sleep(.1)
    loc.send_keys(u'\ue004') # tab to select closest match
    sleep(1)
    
    # and we xpath the search button to click search
    click = driver.find_element_by_xpath('//*[@id="HeroSearchButton"]/span')
    click.click() # boop!
    sleep(2)
    
  
    url = driver.current_url #since we opened a new page, we need to update our url reference
    driver.get(url)
    sleep(.1)
    
    #now we build our columns
    
    # empty list that we'll convert into a dataframe
    frame = []  
    
    # add city information
    try:
        city = driver.find_element_by_xpath('/html/body/header/div[3]/div[2]/form/input[6]').get_attribute("value").split(',')[0]
    except:
        city = 'Failed to obtain'
    
    
    # add state information
    try:
        state = driver.find_element_by_xpath('/html/body/header/div[3]/div[2]/form/input[6]').get_attribute("value").split(',')[1]
    except:
        state = 'Failed to obtain'
    
    
    # job name
    
    job_name = keyword
    
    # total job postings
    try:
        total_jobs = driver.find_element_by_xpath('/html/body/div[3]/div/div/div[1]/div/div[2]/section/article/div[1]/div[1]/div/div/div[2]').text.split(' ')[0]
    except:                
        total_jobs = 'Failed to obtain'

    # average salary, replaces K with numerical thousand
    try:
        avg_salary = driver.find_element_by_xpath('//*[@id="filter_minSalary"]/span[1]').text.replace('K', ',000\$')
    except:
        avg_salary = 'Failed to obtain'
    
    # we need to click a drop down menu to collect job posting quantity details
    drop = driver.find_element_by_xpath('//*[@id="filter_fromAge"]/span[1]')
    drop.click()


    # postings in last 3 days    
    try:       
        last_3_days =  driver.find_element_by_xpath('/html/body/div[3]/div/div/div[1]/div/div[1]/header/div[2]/ul/li[3]/span[1]/span').text.replace('(', '').replace(')', '')
    except:
        last_3_days = 'Failed to obtain'


    # postings in last week
    try:
        last_7_days =  driver.find_element_by_xpath('/html/body/div[3]/div/div/div[1]/div/div[1]/header/div[2]/ul/li[4]/span[1]/span').text.replace('(', '').replace(')', '')
    except:
        last_7_days = 'Failed to obtain'


    # last month
    try:
        last_30_days =  driver.find_element_by_xpath('/html/body/div[3]/div/div/div[1]/div/div[1]/header/div[2]/ul/li[6]/span[1]/span').text.replace('(', '').replace(')', '')
    except:
        last_30_days = 'Failed to obtain'
    
    frame.append({
        'City' : city,
        'State' : state,
        'Job' : keyword,
        'Total Listings' : total_jobs,
        'Average Salary (usd)' : avg_salary,
        'Posts (3 days)' : last_3_days,
        'Posts (7 days)' : last_7_days,
        'Posts (30 Days)' : last_30_days
    })
    driver.close()
    return pd.DataFrame(frame)

In [26]:
job_status('Electrician', 'Detroit')

Unnamed: 0,City,State,Job,Total Listings,Average Salary (usd),Posts (3 days),Posts (7 days),Posts (30 Days)
0,Detroit,MI,Electrician,109,"$16,000\$-$86,000\$",20,35,92


### Takeaways and conclusion
- *Medium scraper*: too specific, but was a great source of xpath study and interaction with selenium
- *GlassDoor*: Plenty of obvious anti scraping methods
 - Starting with their required partnership to receive a developer's API to access their data
 - You cannot search glassdoor without creating an account or, if you bypass it via url entry, you will get prompted every click to log in
 - Geo-tagged IP addresses and location ID's were other forms of anti scrape.

- `job_status(keyword, location= None)`
 - With this function, we can decide how we want to gather our data as we complement the dataset we are using separate of our scrape.
 - Options: 
  - iterate over a for loop for each state of a job title, build a data frame
  - update code to grab other information, and loop it through several pages
  - With enough time, we can attempt timeseries inspections on job postings following disasters.

#### Current status:
Our requirement of building a scrape tool that can report back current job status, salary, and listings is satisfied.
            

### Example Codes

- [Glassdoor scrape example - Selenium (git)](https://github.com/arapfaik/scraping-glassdoor-selenium) : github link to author of Medium article
- [Glassdoor scrape example - Selenium (Medium)](https://medium.com/@jamievaron/to-anyone-who-has-lost-themselves-9c5e3049cb13) : Link to Medium article covering Glassdoor and Selenium
- [Selenium Review SEA-FLEX-11](https://git.generalassemb.ly/charles-rice/SEA-Flex-11/tree/master/08_week/selenium-webscraping) : Selenium flex review lab completed with our instructor.
- These are reviewed heavily throughout the document.

In [None]:
def get_jobs(keyword, num_jobs, verbose):
    
    '''Gathers jobs as a dataframe, scraped from Glassdoor'''
    
    #Initializing the webdriver
    options = webdriver.ChromeOptions()
    
    #Uncomment the line below if you'd like to scrape without a new Chrome window every time.
    #options.add_argument('headless')
    
    #Change the path to where chromedriver is in your home folder.
    driver = webdriver.Chrome(executable_path="./chromedriver/macos/chromedriver", options=options)
    driver.set_window_size(1120, 1000)

    url = 'https://www.glassdoor.com/Job/jobs.htm?sc.keyword="' + keyword + '"&locT=C&locId=1147401&locKeyword=San%20Francisco,%20CA&jobType=all&fromAge=-1&minSalary=0&includeNoSalaryJobs=true&radius=100&cityId=-1&minRating=0.0&industryId=-1&sgocId=-1&seniorityType=all&companyId=-1&employerSizes=0&applicationType=0&remoteWorkType=0'
    driver.get(url)
    jobs = []

    while len(jobs) < num_jobs:  #If true, should be still looking for new jobs.

        #Let the page load. Change this number based on your internet speed.
        #Or, wait until the webpage is loaded, instead of hardcoding it.
        time.sleep(4)

        #Test for the "Sign Up" prompt and get rid of it.
        try:
            driver.find_element_by_class_name("selected").click()
        except ElementClickInterceptedException:
            pass

        time.sleep(.1)

        try:
            driver.find_element_by_class_name("ModalStyle__xBtn___29PT9").click()  #clicking to the X.
        except NoSuchElementException:
            pass

        
        #Going through each job in this page
        job_buttons = driver.find_elements_by_class_name("jl")  #jl for Job Listing. These are the buttons we're going to click.
        for job_button in job_buttons:  

            print("Progress: {}".format("" + str(len(jobs)) + "/" + str(num_jobs)))
            if len(jobs) >= num_jobs:
                break

            job_button.click()  #You might 
            time.sleep(1)
            collected_successfully = False
            
            while not collected_successfully:
                try:
                    company_name = driver.find_element_by_xpath('.//div[@class="employerName"]').text
                    location = driver.find_element_by_xpath('.//div[@class="location"]').text
                    job_title = driver.find_element_by_xpath('.//div[contains(@class, "title")]').text
                    job_description = driver.find_element_by_xpath('.//div[@class="jobDescriptionContent desc"]').text
                    collected_successfully = True
                except:
                    time.sleep(5)

            try:
                salary_estimate = driver.find_element_by_xpath('.//span[@class="gray small salary"]').text
            except NoSuchElementException:
                salary_estimate = -1 #You need to set a "not found value. It's important."
            
            try:
                rating = driver.find_element_by_xpath('.//span[@class="rating"]').text
            except NoSuchElementException:
                rating = -1 #You need to set a "not found value. It's important."

            #Printing for debugging
            if verbose:
                print("Job Title: {}".format(job_title))
                print("Salary Estimate: {}".format(salary_estimate))
                print("Job Description: {}".format(job_description[:500]))
                print("Rating: {}".format(rating))
                print("Company Name: {}".format(company_name))
                print("Location: {}".format(location))

            #Going to the Company tab...
            #clicking on this:
            #<div class="tab" data-tab-type="overview"><span>Company</span></div>
            try:
                driver.find_element_by_xpath('.//div[@class="tab" and @data-tab-type="overview"]').click()

                try:
                    #<div class="infoEntity">
                    #    <label>Headquarters</label>
                    #    <span class="value">San Francisco, CA</span>
                    #</div>
                    headquarters = driver.find_element_by_xpath('.//div[@class="infoEntity"]//label[text()="Headquarters"]//following-sibling::*').text
                except NoSuchElementException:
                    headquarters = -1

                try:
                    size = driver.find_element_by_xpath('.//div[@class="infoEntity"]//label[text()="Size"]//following-sibling::*').text
                except NoSuchElementException:
                    size = -1

                try:
                    founded = driver.find_element_by_xpath('.//div[@class="infoEntity"]//label[text()="Founded"]//following-sibling::*').text
                except NoSuchElementException:
                    founded = -1

                try:
                    type_of_ownership = driver.find_element_by_xpath('.//div[@class="infoEntity"]//label[text()="Type"]//following-sibling::*').text
                except NoSuchElementException:
                    type_of_ownership = -1

                try:
                    industry = driver.find_element_by_xpath('.//div[@class="infoEntity"]//label[text()="Industry"]//following-sibling::*').text
                except NoSuchElementException:
                    industry = -1

                try:
                    sector = driver.find_element_by_xpath('.//div[@class="infoEntity"]//label[text()="Sector"]//following-sibling::*').text
                except NoSuchElementException:
                    sector = -1

                try:
                    revenue = driver.find_element_by_xpath('.//div[@class="infoEntity"]//label[text()="Revenue"]//following-sibling::*').text
                except NoSuchElementException:
                    revenue = -1

                try:
                    competitors = driver.find_element_by_xpath('.//div[@class="infoEntity"]//label[text()="Competitors"]//following-sibling::*').text
                except NoSuchElementException:
                    competitors = -1

            except NoSuchElementException:  #Rarely, some job postings do not have the "Company" tab.
                headquarters = -1
                size = -1
                founded = -1
                type_of_ownership = -1
                industry = -1
                sector = -1
                revenue = -1
                competitors = -1

                
            if verbose:
                print("Headquarters: {}".format(headquarters))
                print("Size: {}".format(size))
                print("Founded: {}".format(founded))
                print("Type of Ownership: {}".format(type_of_ownership))
                print("Industry: {}".format(industry))
                print("Sector: {}".format(sector))
                print("Revenue: {}".format(revenue))
                print("Competitors: {}".format(competitors))
                print("@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@")

            jobs.append({"Job Title" : job_title,
            "Salary Estimate" : salary_estimate,
            "Job Description" : job_description,
            "Rating" : rating,
            "Company Name" : company_name,
            "Location" : location,
            "Headquarters" : headquarters,
            "Size" : size,
            "Founded" : founded,
            "Type of ownership" : type_of_ownership,
            "Industry" : industry,
            "Sector" : sector,
            "Revenue" : revenue,
            "Competitors" : competitors})
            #add job to jobs

        #Clicking on the "next page" button
        try:
            driver.find_element_by_xpath('.//li[@class="next"]//a').click()
        except NoSuchElementException:
            print("Scraping terminated before reaching target number of jobs. Needed {}, got {}.".format(num_jobs, len(jobs)))
            break

    return pd.DataFrame(jobs)  #This line converts the dictionary object into a pandas DataFrame.

- From SEA-FLEX-11 review with Charlie Rice

In [None]:
df = pd.DataFrame(columns=['name', 'location', 'price', 'cuisine','rating','reviews'])

# one big for loop!
for row in soup.find_all('div', {'class': 'rest-row-info'}):
    name = row.find('span', {'class':'rest-row-name-text'}).text
    loc = row.find('span',{'class':'rest-row-meta--location rest-row-meta-text sfx1388addContent'}).text
    price = int(row.find('i', {'class':'pricing--the-price'}).text.count('$'))
    cuisine = row.find('span', {'class':'rest-row-meta--cuisine rest-row-meta-text sfx1388addContent'}).text
    try:
        rating = row.find('div',{'class':'star-rating-score'}).attrs['aria-label'].rsplit('s')[0].strip()
    except:
        rating = 0
    try:
        reviews = row.find('a',{'class':'review-link'}).find('span').text.strip('()')
    except:
        reviews = 0
    df.loc[len(df)] = [name, loc, price, cuisine, rating, reviews]
    

df.head()

In [None]:
'https://www.glassdoor.com/Job/jobs.htm?sc.keyword="' + 
keyword + 
'"&locT=C&locId=1147401&locKeyword=San%20Francisco,%20CA&jobType=all&fromAge=-1&minSalary=0&includeNoSalaryJobs=true&radius=100&cityId=-1&minRating=0.0&industryId=-1&sgocId=-1&seniorityType=all&companyId=-1&employerSizes=0&applicationType=0&remoteWorkType=0'
