 # Web Scraping Job Postings

#### What industry factors are most important in predicting the salary amounts?
For example, can required skills accurately predict job title?

We will focus on data-related job postings, e.g. data scientist, data analyst, research scientist, and business intelligence. We will decrease the scope by limiting our search to a single region.

#### Factors that impact salary
To predict salary we will be building either a classification or regression model, using features like the location, title, and summary of the job. If framing this as a regression problem, we will be estimating the listed salary amounts. If we choose to frame this as a classification problem, we will create labels from these salaries (high vs. low salary, for example) according to thresholds (such as median salary).

We will use the following techniques;

- NLP
- Unsupervised learning and dimensionality reduction techniques (PCA, clustering)
- Ensemble methods and decision tree models
- SVM models
- Whatever you decide to use, the most important thing is to justify your choices and interpret your results. Communication of your process is key. 
- Note that most listings DO NOT come with salary information. You'll need to able to extrapolate or predict the expected salaries for these listings.

Collecting data from Indeed.com on data-related jobs to use in predicting salary trends for your analysis.

In [1]:
import requests
from scrapy.selector import Selector
import pandas as pd
import numpy as np

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select

from time import sleep, time
from random import randint

from bs4 import BeautifulSoup
import urllib.request, urllib.parse, urllib.error
import itertools
import math

In [214]:
homepage = 'https://au.indeed.com/'
response = requests.get(homepage)
print (response.status_code)
HTML = response.text
HTML[0:150]

200


'<!DOCTYPE html>\n<html dir="ltr" lang="en">\n<head>\n    <script id="polyfill-script-bundle">\n            /* Disable minification (remove `.min` from URL'

In [121]:
xpath_selector_indeed = Selector(text=HTML)
xpath_selector_indeed

<Selector xpath=None data='<html dir="ltr" lang="en">\n<head>\n   ...'>

In [682]:
# visit the indeed page
driver = webdriver.Chrome(executable_path="./chromedriver/chromedriver")
driver.get("https://au.indeed.com/")
# always good to check we've got the page we think we do
assert "Indeed" in driver.title

In [683]:
# find the search position
elem = driver.find_element_by_name("q")
# clear it
elem.clear()
# type in pycon
elem.send_keys("data scientist")
# find the location position
elem_l = driver.find_element_by_name("l")
# clear it
elem_l.clear()
# type in sydney
elem_l.send_keys("sydney")
# send those keys
elem.send_keys(Keys.RETURN)

In [684]:
# Advanced Search Options
elem = driver.find_element_by_class_name(('sl')).click();
# Sleep 1 second
sleep(1)
# Drop down search results count option
elem = driver.find_element_by_id('limit')
for option in elem.find_elements_by_tag_name('option'):
    if option.text == '50':
        option.click()
        break
# Return to job listing page
elem = driver.find_element_by_id('fj').click();

In [685]:
#grab the page source
html = driver.page_source
html = BeautifulSoup(html, 'lxml')

In [185]:
html.find_all('div', {'class':'title'})[0]

<div class="title">
<a class="jobtitle turnstileLink" data-tn-element="jobTitle" href="/pagead/clk?mo=r&amp;ad=-6NYlbfkN0BAWXrQOY2d9ZyjtwpyS-77zGxevxFwit0Fl2sK8fghHFpESM8jRFl4xNHUOoYNumxHNLG4zE7BR4EBxRjnWbhjkGeaG0e79owVMKxo6yq8BfcE00EZlCet1VZ1PW1C241ZLkH5mpoQ6Hs2PCaXtwHQC59qC8E_mQFFY1q5zHX-jNtY_i67FuY6DMhde6BWPaBggNcMMQru-Kat26MAjiVLsq9vG_slCfHT0rIboQpfBK4G3uTV5iImAEE2a4zRbJ--R6-D3uc-U2V9n7d1KAqlB0A6EO_lZK5Oa1Yitv_isca9FSnmrSH9z22hBg4iz8lKZGcBPhCH7rtFhjdTQO4qbcKK6JOojxwsFvgAyG2LJiufRHr5a4BW6UnASUjxtflQc8QBOQHHeXLVNcrEmfWvTFxS-l0xg092qvW7-fJlBnWzrSec7Q69c6KFYBt08OUq7WcXdyhOmTQCyhQ8YsK0Q29Cj4oQbCeNnC_FNxY0fybG8XtPR-dF0JATtKXR8A4MLH7GcqudI4-gvI3ZJT0gZmtXUm7L4W7x5KvLL5x4D_VF4ZGXHniPcbGv9eWdAfdNRURHR9VvnKoD31ZEBsADpAnvcVmlKbY=&amp;p=0&amp;fvj=0&amp;vjs=3" id="sja0" onclick="setRefineByCookie(['radius']); sjoc('sja0', 1); convCtr('SJ'); rclk(this,jobmap[0],true,1);" onmousedown="sjomd('sja0'); clk('sja0'); rclk(this,jobmap[0],1);" rel="noopener nofollow" target="_blank" title="Logistics Anal

In [136]:
title = html.find_all('div', {'class':'title'})

In [137]:
titles = []
for i in range(len(title)):
    titles.append(title[i].get_text(strip=True))

In [138]:
company = html.find_all('span', {'class':'company'})

In [139]:
companies = []
for i in range(len(company)):
    companies.append(company[i].get_text(strip=True))

In [140]:
links = html.find_all('div', {'class':'title'})

htmls = []
for i in range(len(links)):
    htmls.append(links[i].a['href'])

jobs = []
for i in htmls:
    r = requests.get("https://au.indeed.com/" + i)
    html1 = BeautifulSoup(r.text, 'lxml')
    jobs.append(html1)

In [141]:
jobs = []
for i in htmls:
    r = requests.get("https://au.indeed.com/" + i)
    html1 = BeautifulSoup(r.text, 'lxml')
    jobs.append(html1)

In [278]:
salaries = []
for i in range(len(jobs)):
    try:
        a = jobs[i].find_all('span', {'class':'jobsearch-JobMetadataHeader-iconLabel'})
        salaries.append(a[1].text)
    except:
        salaries.append('NaN')

In [182]:
descriptions = []
for i in range(len(jobs)):
        a = jobs[i].find_all('div', {'class':'jobsearch-jobDescriptionText'})
        descriptions.append(a[0].get_text(strip=True))

In [680]:
#-------------------------------------------------
# Generate Delay
def generate_delay():
        mean = 1
        sigma = 0.4
        return math.fabs(np.random.normal(mean,sigma,1)[0])
    
#-------------------------------------------------
#Checking for 'Next' button on page
def next_check(driver):
                driver.find_elements_by_xpath('//*[@class="pagination"]/a')[-1].text == 'Next »'
                next_url = driver.find_elements_by_xpath('//*[@class="pagination"]/a')[-1].get_attribute('href')
                driver.get(next_url)
# -------------------------------------------------
def indeed_scrape(homepage ='https://au.indeed.com/advanced_search',
                  role = ['Data Analytics', 'Data Scientist', 'Business Intelligence', 'Data Consultant'],
                  salary_bands = ['$80,000-$89,000','$90,000-$99,000','$100,000-$109,000','$110,000-$119,000','$120,000-$139,000','$140,000-$160,000']):
    
    queries = list(itertools.product(role, salary_bands))
    
    driver = webdriver.Chrome(executable_path="./chromedriver/chromedriver")
    
    
    soup = []
    titles = []
    companies = []
    htmls = []
    salary_band = []
    URLS = [] 
    salaries = []
    descriptions = []
    suburbs = []
    
    
    for query in queries:
        driver.get(homepage)
        
        #search job title
        title = driver.find_element_by_name('as_and')
        title.clear()
        title.send_keys(query[0])

        #salary estimate band
        salary = driver.find_element_by_name('salary')
        salary.clear()
        salary.send_keys(query[1])

        #location
        location = driver.find_element_by_name('l')
        location.clear()
        location.send_keys('Sydney')

        # no recruiters
        elem = driver.find_element_by_id('norecruiters').click()

        #distance from Sydney (25km)
        dist = driver.find_element_by_xpath('//select[@id="radius"]/option[5]').click()
        
        #full-time job dropdown
        elem = driver.find_element_by_id('jt')
        for option in elem.find_elements_by_tag_name('option'):
            if option.text == 'full-time':
                option.click()
                break
        
        # Drop down no of results
        elem = driver.find_element_by_id('limit')
        for option in elem.find_elements_by_tag_name('option'):
            if option.text == '50':
                option.click()
                break

        #Bot detection avoidance
        generate_delay()

        # Return to job listing page
        elem = driver.find_element_by_id('fj').click();

        #grab the page source
        html = driver.page_source
        html_b4 = BeautifulSoup(html, 'lxml')
        
        # Getting individual job URL's    
        htmls = []
        links = html_b4.find_all('div', {'class':'title'})    
        for i in range(len(links)):
            htmls.append("https://au.indeed.com/" + links[i].a['href'])
            
        jobs = []
        for i in htmls:
            r = requests.get(i)
            html1 = BeautifulSoup(r.text, 'lxml')
            jobs.append(html1)
            URLS.append(i)
            
        #companies
        companies = []
        for i in range(len(jobs)):
            try:
                a1 = jobs[i].find('div', {'class':'icl-u-lg-mr--sm-icl-u-xs-mr--xs'})
                companies.append(a1.get_text(strip=True))
            except:
                companies.append('NaN')
        
        #Job Titles
        for i in range(len(jobs)):
            try:
                a2 = jobs[i].get_text(strip=True).split('-')[0]
                titles.append(a2)
            except:
                titles.append('NaN')
                
        #Suburbs
        for i in range(len(jobs)):
            try:
                a3 = jobs[i].find_all('span', {'class':'jobsearch-JobMetadataHeader-iconLabel'})
                suburbs.append(a3[0].get_text(strip=True))
            except:
                suburbs.append('NaN')

        #Salaries
        for i in range(len(jobs)):
            try:
                a4 = jobs[i].find_all('span', {'class':'jobsearch-JobMetadataHeader-iconLabel'})
                salaries.append(a4[2].text)
                salary_band.append(query[1])
            except:
                salaries.append('NaN')
                salary_band.append(query[1])
        
        #Job Descriptions
        for i in range(len(jobs)):
                a5 = jobs[i].find('div',{'id':'jobDescriptionText'}).get_text(strip=True)
                descriptions.append(a5)
        
        generate_delay()
                
        try:
            while driver.find_elements_by_xpath('//*[@class="pagination"]/a')[-1].text == 'Next »':
                next_url = driver.find_elements_by_xpath('//*[@class="pagination"]/a')[-1].get_attribute('href')
                driver.get(next_url)
                
                #Next Page Source
                html_next = driver.page_source
                html_b4_next = BeautifulSoup(html_next, 'lxml')
            
                generate_delay()
                     
                # Getting individual job URL's    
                links = html_b4_next.find_all('div', {'class':'title'})
                htmls1 = []
                for i in range(len(links)):
                    htmls1.append("https://au.indeed.com/" + links[i].a['href'])
                    
                jobs = []
                for i in htmls1:
                    r = requests.get(i)
                    html1 = BeautifulSoup(r.text, 'lxml')
                    jobs.append(html1)
                    URLS.append(i)
                    
                #companies
                companies = []
                for i in range(len(jobs)):
                    try:
                        a1 = jobs[i].find('div', {'class':'icl-u-lg-mr--sm-icl-u-xs-mr--xs'})
                        companies.append(a1.get_text(strip=True))
                    except:
                        companies.append('NaN')
                
                #Job Titles
                for i in range(len(jobs)):
                    try:
                        a2 = jobs[i].get_text(strip=True).split('-')[0]
                        titles.append(a2)
                        role.append(query[0])
                    except:
                        titles.append('NaN')
                        role.append(query[0])
                
                #Suburbs
                for i in range(len(jobs)):
                    try:
                        a3 = jobs[i].find_all('span', {'class':'jobsearch-JobMetadataHeader-iconLabel'})
                        suburbs.append(a3[0].get_text(strip=True))
                    except:
                        suburbs.append('NaN')
                
                #Salaries
                for i in range(len(jobs)):
                    try:
                        a4 = jobs[i].find_all('span', {'class':'jobsearch-JobMetadataHeader-iconLabel'})
                        salaries.append(a4[2].text)
                        salary_band.append(query[1])

                    except:
                        salaries.append('NaN')
                        salary_band.append(query[1])
                
                #Job Descriptions
                for i in range(len(jobs)):
                        a5 = jobs[i].find('div',{'id':'jobDescriptionText'}).get_text(strip=True)
                        descriptions.append(a5)
                
                generate_delay()
        
        except:
            pass
        
        print(query,len(titles),
                    len(companies),
                    len(suburbs),
                    len(descriptions),
                    len(salaries),
                    len(salary_band),
                    len(URLS))
        
    # Assigning to a DataFrame
    df = pd.DataFrame({'Job_Titles':titles, 'Company':companies, 'Suburbs':suburbs, 'Salaries':salaries, 'Descriptions':descriptions, 'Salary_Bracket':salary_band, 'URL':URLS}) 
    df.to_csv(path_or_buf="/")

    return df

In [681]:
indeed_scrape()

('Data Analytics', '$80,000-$89,000') 122 14 122 122 122 122 122
('Data Analytics', '$90,000-$99,000') 233 56 233 233 233 233 243


MaxRetryError: HTTPConnectionPool(host='127.0.0.1', port=61297): Max retries exceeded with url: /session/fb12f16596760af70fdc8f715bd8da44/url (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x14d051f10>: Failed to establish a new connection: [Errno 61] Connection refused'))