# Job Search Scraper

Job search sites such as Glassdoor and Indeed are very useful with their filters and job alerts, but these sites don't have an advanced filter for education level and years of experience. For example, the "entry level" filter often displays results for senior positions or asks for 5+ years of experience. Therefore, scanning each job description and narrowing down the search results that fit my criteria manually is very time consuming, hence the spark of the idea for this project!

Because of the aforementioned restrictions, I want to personalize and automate the job search by scraping sites for job postings, narrowing down results based on my criteria, and updating a CSV file of relevant postings.

(https://github.com/umangkshah/job-scraping-python/blob/master/job_scraper.ipynb)

# Scraping Logic

1. First, before scraping anything, it is a good practice to check if there is an API to fetch the data. If there isn't, web scraping is the alternate option. Web scraping can also be used when the API is not retrieving the information that we want.
    - API
        - Advantanges:
            - Much more stable process for retrieving info
            - Extremely regulated syntax (JSON or XML rather than HTML)
        - Disadvantages:
            - Query limitations
            - Less customizable because governed by API regulations
            - API can disappear
    - Webscraping
        - Advantanges:
            - Inexpensive
            - Easy to implement
            - Low maintance
            - Accurate
        - Disadvantages:
            - Less stable because uses HTML/CSS fields to capture data
            - Will crash if front end labels are changed
            - Slower than API calls

If API is not available or manual scraping is needed:

1. Construct the URL for the search results from the job search sites (Indeed, LinkedIn, Glassdoor).
2. For Glassdoor, we'll be using Selenium because it allows us to browse through a website mimicking the behavior of Chrome and insures that our code receives the content that we see in the browser even when content is purposely hidden from scraping.
3. Record a brief description of the post and relative link.
4. Browse through the list of links and retrieve the post description.
5. Both steps 3 and 4 are achieved using a self-defined `glassdoorScrape()` function and `get_short=True` parameter for short descriptions and `get_short=False` parameter for the complete post.

(https://github.com/nycdatasci/bootcamp006_project/blob/master/Project3-WebScraping/DiegoDeLazzari/Presentation2.ipynb)

# Web Scraping Indeed to Retrieve Job Search Data

In [89]:
from selenium import webdriver
from selenium.webdriver.common import action_chains, keys, alert
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.chrome.options import Options

from time import sleep # To prevent overwhelming the server between connections
from collections import Counter
from nltk.corpus import stopwords
import pandas as pd
import numpy as np
import sys
import re
import csv

from bs4 import BeautifulSoup
import urllib as ul
from requests import get
from requests.exceptions import RequestException
from contextlib import closing

## Personalize the Filters

**Let's create a list of words to avoid or include.**

In [2]:
red_flags = ['senior', 'sr.', 'staff', 'manager', 'director', 'lead', 'head', 'principal', 'photography', 'journalist',
             'frontend', 'backend', 'fullstack', 'front-end', 'back-end', 'full-stack', 'front end', 'back end', 
             'full stack', 'bio', 'recruiter', 'sourcer', 'phd', 'master', 'mba', 'lecturer', 'tutor', 'instructor']

** Let's write a function that determines whether or not to check a job posting based on whether the title contains red flag words.**

In [3]:
def title_qualifies(title):
    title = title.lower()
    for word in red_flags:
        if word in title:
            return False
    return True

title_qualifies('Director of Data Science')

False

**Next, let's define the Regex to personalize filters for:**
1. **Years of experience: no more than two years of experience required**
2. **Education level: Bachelor or BA or BS**

In [4]:
# Should not have 3 or more years of experience
yr_exper = re.compile('[3-9]\s*\+?-?\s*[2-9]?\s*[Yy]e?a?[Rr][Ss]?')

# Should not have master's requirement
masters1 = re.compile("[Mm]aster's required")
masters2 = re.compile('[Ms][Ss] required')
masters3 = re.compile('[Mm].[Ss]. required')
masters4 = re.compile('[Mm][Bb][Aa] required')
masters5 = re.compile('[Mm].[Bb].[Aa]. required')

# Should not have PHD requirement
phd1 = re.compile('[Pp][Hh][Dd] required')
phd2 = re.compile('[Pp][Hh].[Dd]. required')
phd3 = re.compile('[Pp].[Hh].[Dd]. required')

# Should not have any advanced degree requirement
adv_deg = re.compile('[Aa]dvanced degree required')

print(yr_exper.search('2 years of experience'))
print(yr_exper.search('2+ years of experience'))
print(yr_exper.search('2-4 years of experience'))
print(masters1.search("master's required"))
print(masters1.search('bachelor'))
print(masters3.search('M.s. required'))
print(phd1.search('PHd required'))
print(phd2.search('PH.d. required'))
print(adv_deg.search('Advanced degree required'))

None
None
<_sre.SRE_Match object at 0x1a155b3ac0>
<_sre.SRE_Match object at 0x1a155b3ac0>
None
<_sre.SRE_Match object at 0x1a155b3ac0>
<_sre.SRE_Match object at 0x1a155b3ac0>
<_sre.SRE_Match object at 0x1a155b3ac0>
<_sre.SRE_Match object at 0x1a155b3ac0>


## Create a Web Scraper

**The code below tests how to cross out popup alerts in Glassdoor. As of now I still haven't cracked this problem, so I need to manually click the "X" button for two popups for my code to fully run on the 30 pages of search results in Glassdoor. It's limited to 30 pages because Glassdoor can't retrieve jobs after the 30th page so roughly after 900 jobs.**

In [107]:
path_name = '/Users/ngapuileung/Desktop/DS/chromedriver'

chrome_options = Options()
chrome_options.add_argument('--disable-extensions')
chrome_options.add_argument('--profile-directory=Default')
chrome_options.add_argument('--incognito')
chrome_options.add_argument('--disable-plugins-discovery')
chrome_options.add_argument('--start-maximized')

browser = webdriver.Chrome(path_name, chrome_options=chrome_options)

browser.implicitly_wait(8)

# The browser.get() method will navigate to a page given by the URL address and wait 15 seconds
browser.get('https://www.glassdoor.com/Job/san-mateo-analyst-jobs-SRCH_IL.0,9_IC1147406_KO10,17_IP2.htm?minSalary=120000&jobType=fulltime')
sleep(3)

while True:
    try:
        sleep(2.1)
        browser.find_element_by_css_selector('.next').click()

        Alert(browser).dismiss()
        
        #xbutton = browser.find_element_by_class_name('xBtn')
        #xbutton = browser.find_element_by_class_name('prettyEmail modalContents')
        #xbutton = browser.find_element_by_class_name(' prettyEmail modalContents ')
        #xbutton.click()
        
        #button = browser.find_element_by_xpath("//div[contains(@class, 'xBtn')]")
        #button = browser.find_element_by_xpath("//div[@class='xBtn']")
        #button = browser.find_element_by_xpath("//div[@class=' prettyEmail modalContents ']/div[@class='xBtn']")
        #button.click()
        
        print('Got rid of the alert!')
        sleep(3.3)
    except:
        print("Darn! Couldn't find it :(")
        break

#browser.quit()

  # Remove the CWD from sys.path while we load stuff.


Darn! Couldn't find it :(


**Let's create a function `scrape_glassdoor` that scrapes Glassdoor jobs by using Selenium.**

In [105]:
def scrape_glassdoor(keyword_search, city_search, path_name):
    """
    Function that scrapes Glassdoor jobs and outputs a CSV file of the relevant job posting
    information when given the keyword(s), city, and path name of the chrome driver.
    """
    # Specifying incognito mode as you launch your browser
    chrome_options = Options()
    chrome_options.add_argument('--disable-extensions')
    chrome_options.add_argument('--profile-directory=Default')
    chrome_options.add_argument('--incognito')
    chrome_options.add_argument('--disable-plugins-discovery')
    chrome_options.add_argument('--start-maximized')

    # Create new Instance of Chrome in incognito mode
    browser = webdriver.Chrome(path_name, chrome_options=chrome_options)

    browser.implicitly_wait(10)

    # The browser.get() method will navigate to a page given by the URL address and wait 15 seconds
    browser.get('https://www.glassdoor.com/index.htm')

    sleep(5)

    # Search for bar to enter job title, keywords, or company
    job = browser.find_element_by_id('KeywordSearch')

    # Search for bar to enter location
    location = browser.find_element_by_id('LocationSearch')

    # Clear pre-populated location entry
    location.clear()

    sleep(3)

    # Type in job name in search
    job.send_keys(keyword_search)

    sleep(2)

    # Type in location in search
    location.send_keys(city_search)

    sleep(2)

    # Click the search button
    browser.find_element_by_class_name('gd-btn-mkt').click()

    sleep(8.2)

    # Print how many jobs resulted from the search
    num_jobs = browser.find_elements_by_xpath("//p[@class='jobsCount']")
    sleep(2)
    num_jobs = [x.text for x in num_jobs]
    num_jobs = num_jobs[0]
    num_jobs = int(''.join(c for c in num_jobs if c.isdigit()))
    print('There are {0} {1} jobs in {2} on Glassdoor.'.format(num_jobs, keyword_search, city_search))

    sleep(3)

    # Create an empty list to store our relevant job listings
    all_job_listings = []

    # Loop through each page and stop when there is no more "Next" button
    while True:
        try:
            # Get all job postings on page
            job_postings = browser.find_elements_by_xpath('//ul[@class="jlGrid hover"]//descendant::li')
            job_post = [x.text for x in job_postings]

            sleep(2.8)

            # Get all job posting URLs on page
            job_urls = browser.find_elements_by_xpath('//div[@class="logoWrap"]/a')
            sleep(3.6)
            urls = []
            for job in job_urls:
                url = job.get_attribute('href')
                sleep(5.4)
                urls.append(url)

            sleep(3.2)

            # Perform preliminary cleaning to split each part of the string into elements of a list
            job_post = [x.encode('utf-8') for x in job_post]
            job_post = [x.replace(' \xe2\x80\x93 ', '\n') for x in job_post]
            job_post = [x.splitlines() for x in job_post]
            replace_list = ['Hot', 'New', 'Today']
            remove_list = ['day ago', 'days ago', ' Logo', 'no.logo.alt', 'Top Company', 'EASY APPLY', "We're Hiring"]

            sleep(4.1)
            
            # Standardize the elements within each job post in the order of rating, job title, company, location, and salary
            new_job_post = []
            for job in job_post:
                for i in remove_list:
                    for j in replace_list:
                        for rating in np.arange(0.0, 5.1, 0.1):
                            # Replace specific words with empty string to prevent some jobs from being removed entirely
                            job = [x.replace(j, '') for x in job]
                            # Remove elements in each job post containing "days ago"
                            job = [x for x in job if not i in x]
                            # Remove all empty elements
                            job = [x for x in job if len(x) > 0]
                            # Remove ratings
                            job = [x for x in job if not x in str(rating)]

                # Fill in with empty string if job post is missing a salary
                if len(job) == 3:
                    job.append('')
                elif len(job) == 5:
                    job.pop(1)

                sleep(3.6)
                
                # Prints "Stop" statement when each job contains less than or more than 4 info (job title, company, location, and salary)
                if len(job) != 4:
                    print('STOP! Format of job information is off!')
                    break

                new_job_post.append(job)

            sleep(1.8)
            
            # Add url of each job posting as the last element in each list
            for i in range(0, len(urls)):
                new_job_post[i].append(urls[i])

            sleep(2)
            
            for job in new_job_post:
                all_job_listings.append(job)

            browser.find_element_by_css_selector('.next').click()
            
            sleep(3.3)
        except:
            print('No more pages!')
            break

    sleep(4)
    
    job_listings = []
    # Loop through all_job_listings to create a new list job_listings of only the matched jobs
    for job in all_job_listings:
        # Extract the job title info
        jt = job[0]
        # Extract the company info
        comp = job[1]
        # Extract the location info
        location = job[2]
        # Extract the salary info
        salary = job[3]
        # Extract the URL info
        url = job[4]

        # If the job title qualifies our criteria, create a BeautifulSoup string object from the HTML
        if(title_qualifies(jt)):    
            sleep(2.1)
            try:
                sleep(4.4)
                browser.get(url)
                sleep(3.6)
                html = browser.page_source
                sleep(2.3)
                soup = str(BeautifulSoup(html, 'html.parser').encode('utf-8'))
                sleep(3)
            except:
                continue;

            # Search through the job posting to sort jobs that fit the level of education and years of experience criteria
            a = yr_exper.search(soup)
            b = masters1.search(soup)
            c = masters2.search(soup)
            d = masters3.search(soup)
            e = masters4.search(soup)
            f = masters5.search(soup)
            g = phd1.search(soup)
            h = phd2.search(soup)
            i = phd3.search(soup)
            j = adv_deg.search(soup)

            # Append all jobs that fulfill criteria to the job_listings list
            if not any([a, b, c, d, e, f, g, h, i, j]):
                jobs = {
                        'Company': comp,
                        'Job Title': jt,
                        'Location': location,
                        'Salary': salary,
                        'URL': url
                        }
                job_listings.append(jobs)

    # Export as a csv file
    with open('glassdoor_job_results.csv', 'wb') as csvfile:
        fieldnames = ['Company', 'Job Title', 'Location', 'Salary', 'URL']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames, quoting=csv.QUOTE_ALL)
        writer.writeheader()
        if job_listings:
            for job in job_listings:
                writer.writerow(job)
            print('Done!')
        else:
            print('No matches for Data Science jobs in Glassdoor.')

    # Exit out of Chrome        
    browser.quit()

In [106]:
scrape_glassdoor('data science', 'San Mateo, CA', '/Users/ngapuileung/Desktop/DS/chromedriver')

  from ipykernel import kernelapp as app


There are 5023 data science jobs in San Mateo, CA on Glassdoor.
No more pages!


KeyboardInterrupt: 

**Let's remove duplicates in the csv file if applicable.**

In [None]:
import pandas as pd
glassdoor_search_results = pd.read_csv('glassdoor_job_results.csv', header=0)
glassdoor_search_results = glassdoor_search_results.drop_duplicates()
glassdoor_search_results.to_csv('glassdoor_job_results.csv')