# Job Search Scraper

Job search sites such as Glassdoor and Indeed are very useful with their filters and job alerts, but these sites don't have an advanced filter for education level and years of experience. For example, the "entry level" filter often displays results for senior positions or asks for 5+ years of experience. Therefore, scanning each job description and narrowing down the search results that fit my criteria manually is very time consuming, hence the spark of the idea for this project!

Because of the aforementioned restrictions, I want to personalize and automate the job search by scraping sites for job postings, narrowing down results based on my criteria, and updating a CSV file of relevant postings.

(https://github.com/umangkshah/job-scraping-python/blob/master/job_scraper.ipynb)

# Scraping Logic

1. First, before scraping anything, it is a good practice to check if there is an API to fetch the data. If there isn't, web scraping is the alternate option. Web scraping can also be used when the API is not retrieving the information that we want.
    - API
        - Advantanges:
            - Much more stable process for retrieving info
            - Extremely regulated syntax (JSON or XML rather than HTML)
        - Disadvantages:
            - Query limitations
            - Less customizable because governed by API regulations
            - API can disappear
    - Webscraping
        - Advantanges:
            - Inexpensive
            - Easy to implement
            - Low maintance
            - Accurate
        - Disadvantages:
            - Less stable because uses HTML/CSS fields to capture data
            - Will crash if front end labels are changed
            - Slower than API calls

If API is not available:

2. Construct the URL for the search results from the job search sites (Indeed, LinkedIn, Glassdoor).
2. Download HTML of the search result page using Python Requests
3. Parse the page using LXML: LXML lets you navigate the HTML Tree Structure using Xpaths 
4. Save the data to a CSV file

(https://www.scrapehero.com/how-to-scrape-job-listings-from-glassdoor-using-python-and-lxml/)

# Web Scraping Indeed to Retrieve Job Search Data

!pip install BeautifulSoup

In [1]:
from lxml import html, etree
import re
import os
import sys
import unicodecsv as csv
import argparse
import json

from bs4 import BeautifulSoup
import urllib as ul
import requests
from requests import get
from requests.exceptions import RequestException
from contextlib import closing

## Personalize the Filters

**Let's create a list of words to avoid or include.**

In [2]:
red_flags = ['senior', 'sr.', 'staff', 'manager', 'director', 'lead', 'head', 'principal']

** Let's write a function that determines whether or not to check a job posting based on whether the title contains red flag words.**

In [3]:
def title_qualifies(title):
    title = title.lower()
    for word in red_flags:
        if word in title:
            return False
    return True

title_qualifies('Director of Data Science')

False

**Next, let's define the Regex to personalize filters for:**
1. **Years of experience: no more than two years of experience required**
2. **Education level: Bachelor or BA or BS**

In [4]:
# Should not have 3 or more years of experience
yr_exper = re.compile('[3-9]\s*\+?-?\s*[2-9]?\s*[Yy]e?a?[Rr][Ss]?')

# Should not have master's requirement
masters1 = re.compile("[Mm]aster's required")
masters2 = re.compile('[Ms][Ss] required')
masters3 = re.compile('[Mm].[Ss]. required')
masters4 = re.compile('[Mm][Bb][Aa] required')
masters5 = re.compile('[Mm].[Bb].[Aa]. required')

# Should not have PHD requirement
phd1 = re.compile('[Pp][Hh][Dd] required')
phd2 = re.compile('[Pp][Hh].[Dd]. required')
phd3 = re.compile('[Pp].[Hh].[Dd]. required')

# Should not have any advanced degree requirement
adv_deg = re.compile('[Aa]dvanced degree required')

print(yr_exper.search('2 years of experience'))
print(yr_exper.search('2+ years of experience'))
print(yr_exper.search('2-4 years of experience'))
print(masters1.search("master's required"))
print(masters1.search('bachelor'))
print(masters3.search('M.s. required'))
print(phd1.search('PHd required'))
print(phd2.search('PH.d. required'))
print(adv_deg.search('Advanced degree required'))

None
None
<_sre.SRE_Match object at 0x10bbefd30>
<_sre.SRE_Match object at 0x10bbefd30>
None
<_sre.SRE_Match object at 0x10bbefd30>
<_sre.SRE_Match object at 0x10bbefd30>
<_sre.SRE_Match object at 0x10bbefd30>
<_sre.SRE_Match object at 0x10bbefd30>


## Create a Web Scraper

**When looking in the url bar in the browser, we can extract the base url. Usually, the page number is formatted at the end of the url to indicate which page of the search results you are on.**

In [6]:
indeed_base_url = 'https://www.indeed.com/jobs?q=data+(science+or+analysis+or+python+or+machine+or+learning+or+statistics)+-senior,+-manager,+-staff,+-head,+-director,+-sr.,+-principal,+-lead,+-JAVA,+-CSS,+-C%2B%2B,+-C%2B,+-HTML,+-full+-stack,+-front+-end,+-back+-end&l=San+Francisco+Bay+Area,+CA&limit=50&start'
#indeed_base_url = 'https://www.indeed.com/jobs?q=data+(science+or+analysis+or+python+or+machine+or+learning+or+statistics)+-senior,+-manager,+-staff,+-head,+-director,+-sr.,+-principal,+-lead,+-JAVA,+-CSS,+-C%2B%2B,+-C%2B,+-HTML,+-full+-stack,+-front+-end,+-back+-end&l=San+Francisco+Bay+Area,+CA&limit=50&radius=25&start='
pg_num = 0

try:
    response = ul.urlopen(indeed_base_url + str(pg_num))
    html_doc = response.read()
except:
    print('URL not accessible')
    
soup = BeautifulSoup(html_doc, 'html.parser')
'Ready.'

'Ready.'

**Let's see how many jobs have returned from our search query on Indeed.**

In [7]:
try:
    for d in soup.select('div'):
        # Get the text which is in the format "Page 1 of {total_results} jobs"
        total_results = soup.find(id='searchCount').get_text()
        # Strip all the text in from of the total_results number
        total_results = total_results[total_results.find('of ') + 3:]
        # Strip all the text after the total_results number
        total_results = total_results[:total_results.find(' jobs')]
        # Remove the comma and convert the string into an integer
        total_results = int(total_results.replace(',', ''))
    print('There are {} jobs on Indeed.'.format(total_results))
except:
    print('No jobs found.')

There are 1474 jobs on Indeed.


**Let's create a function building this a webscraper from scratch.**

In [9]:
def scrape_indeed(url_base, jobs_per_page):
    """
    Write a function that scrapes Indeed for the first time. After the first time, we will automate the web scraper to
    scrape new jobs every day and append the results to the existing csv file.
    """    
    # Use the first page of the search results to create a BeautifulSoup object
    pg_num = 0
    response = ul.urlopen(url_base + str(pg_num))
    html_doc = response.read()
    soup = BeautifulSoup(html_doc, 'html.parser')
    
    # Print the number of jobs returned from the search query
    for d in soup.select('div'):
        total_results = soup.find(id='searchCount').get_text()
        total_results = total_results[total_results.find('of ') + 3:]
        total_results = total_results[:total_results.find(' jobs')]
        total_results = int(total_results.replace(',', ''))
    print('There are {} jobs on Indeed.'.format(total_results))
    
    # Extract the desired job posting information for the jobs that meet our filtering criteria
    last_page = (total_results + jobs_per_page - 1) // jobs_per_page
    job_listings = []
    print('Getting jobs from Indeed...')
    for num in range(0, last_page * jobs_per_page, jobs_per_page):
        try:
            response = ul.urlopen(url_base + str(num))
            html = response.read()
        except:
            break;
        soup = BeautifulSoup(html, 'html.parser')
        for post in soup.find_all(class_='result'):
            # Get job post URL
            link = post.find(class_='turnstileLink')
            # Get job title aka jt
            try:
                jt = link.get('title')
            except:
                jt = ''
            # Get job post company
            try:
                comp = post.find(class_='company').get_text().strip()
            except:
                comp = ''
            # Get job post location
            try:
                location = post.find(class_='location').get_text().strip()
            except:
                location = ''
            # Get job post salary
            try:
                salary = post.find(class_='salary no-wrap').get_text().strip()
            except:
                salary = ''

            if(title_qualifies(jt)):
                job_match_url = 'http://www.indeed.com' + link.get('href')
                try:
                    html_doc = ul.urlopen(job_match_url).read().decode('utf-8')
                except:
                    continue;

                a = yr_exper.search(html_doc)
                b = masters1.search(html_doc)
                c = masters2.search(html_doc)
                d = masters3.search(html_doc)
                e = masters4.search(html_doc)
                f = masters5.search(html_doc)
                g = phd1.search(html_doc)
                h = phd2.search(html_doc)
                i = phd3.search(html_doc)
                j = adv_deg.search(html_doc)
                if not any([a, b, c, d, e, f, g, h, i, j]):
                    jobs = {
                            'Job Title': jt,
                            'Company': comp,
                            'Location': location,
                            'Salary': salary,
                            'URL': job_match_url
                            }
                    job_listings.append(jobs)
    
    # Export as a csv file
    with open('indeed_job_results.csv', 'wb') as csvfile:
        fieldnames = ['Job Title', 'Company', 'Location', 'Salary', 'URL']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames, quoting=csv.QUOTE_ALL)
        writer.writeheader()
        if job_listings:
            for job in job_listings:
                writer.writerow(job)
            print('Done!')
        else:
            print('No matches for Data Science jobs in Indeed.')

In [10]:
scrape_indeed(indeed_base_url, 50)

There are 1474 jobs on Indeed.
Getting jobs from Indeed...
Done!


## Automate the Web Scraper to Scrape New Job Postings in the Last 3 Days

In [None]:
indeed_recent_base_url = 'https://www.indeed.com/jobs?q=data+%28science+or+analysis+or+python+or+machine+or+learning+or+statistics%29+-senior%2C+-manager%2C+-staff%2C+-head%2C+-director%2C+-sr.%2C+-principal%2C+-lead%2C+-JAVA%2C+-CSS%2C+-C%2B%2B%2C+-C%2B%2C+-HTML%2C+-full+-stack%2C+-front+-end%2C+-back+-end&l=San+Francisco+Bay+Area%2C+CA&sort=date&limit=50&fromage=3&radius=25&start='
#scrape_indeed(indeed_recent_base_url)