## LinkedIn Job Scraper

Below is a code for scraping LinkedIn job postings without being logged in - anonymous search. The reason is that in LinkedIn Terms of Service using any programs for automated extraction of data from the site is prohibited and could result in account suspension.

There are two major drawbacks of performing LinkedIn job search without being logged in:
- On many occasions, the search page has difficulty displaying all results. Because of this it is advisable to do a limited search at a time.
- Unlike in the case when one performs job search while logged in with their LinkedIn account, in anonymous search the exact number of applicants on many occasions is not provided. There is simply a sentence saying that you will be among the first 25 applicants. Since the number of applicants is one of the valuable pieces of information provided in LinkedIn job search, this appears to be a serious drawback. However, we believe that despite this it is worth it to use anonymous scraping because of the large amount of postings one can collect information from.

For our particular project, we perform scraping of job postings from searches for full-time Data Scientist positions in several US metropolitan areas. In order to minimize the impact on the limited number of postings displayed in a single search, searches are performed for postings within the last week and separately for each seniority level - entry, associate, and senior.

The scraper uses Selenium with Chrome driver. Special credit is due to Omer Sakarya whose tutorial (https://towardsdatascience.com/selenium-tutorial-scraping-glassdoor-com-in-10-minutes-3d0915c6d905) has been very valuable in building my own scraper. 

Python regex is used to extract relevant information from the job description.

In [None]:
# import libraries and packages

# use Selenium to get various content from the job postings
from selenium.common.exceptions import NoSuchElementException, ElementClickInterceptedException
from selenium import webdriver

# necessary packages for extracting, processing, and saving data 
import pandas as pd
import re

# use these to create some random breaks in the process of scraping to mimic human activity
import time
from random import randrange

In [None]:
# use chromedriver with Selenium
driver = webdriver.Chrome('/Users/marin/chromedriver_win32/chromedriver') # provide path where chromedriver has been saved

In [None]:
# go to https://www.linkedin.com/
driver.get('https://www.linkedin.com')

In [None]:
# go to job search page - perform the search manually first and copy the link below

driver.get('https://www.linkedin.com/jobs/search?keywords=Data%20Scientist&location=Greater%2BChicago%2BArea&geoId=90000014&trk=public_jobs_jobs-search-bar_search-submit&f_TP=1%2C2&redirect=false&position=1&pageNum=0&f_JT=F&f_E=4')

In [None]:
# cancel sign-in prompt

try:
    driver.find_element_by_class_name("cta-modal__dismiss-btn").click()  # clicking the X to cancell
except NoSuchElementException:
    pass

In [None]:
# compile a list of all job links on the left of the page
# there is a glitch which allows only 40-50 jobs to be shown -->
# one has to scroll down manually to get the max number of jobs displayed 

# initiate list for job links
job_links = []

# find all job links on page
elems = driver.find_elements_by_xpath("//*[contains(@href, 'full-click')]")

for elem in elems:
    job_links.append(elem.get_attribute("href"))

In [None]:
# check how many job links were collected

print(len(job_links))

In [None]:
# check the first few links - this is mostly for testing; can remove/comment it out for actual scraping

n_jobs = 2

for i in range(n_jobs):
    print(job_links[i])
    print('\n')

In [None]:
# job search parameters --> don't forget to change for new search!

# list of cities/metro areas
# cities = ['atl', 'aus', 'bos', 'chi', 'dal', 'dc', 'hou', 'la', 'nc', 'ny', 'phi', 'phx', 'por', 'sd', 'sea', 'sf']

city = 'atl'

seniority = 'entry'
#seniority = 'associate'
#seniority = 'senior'

In [None]:
# define search lists for eduacation and data science terms wich will be used with regex to extract relevant information

# education key words to find education requirements for the position
search_bs_terms = ['B.S.','BS','Bachelor']
search_ms_terms = ['M.S.','MS','Master']
search_phd_terms = ['PhD','PHD','Ph.D.','Doctor']

# data science key words to find whether position description corresponds to data scientist role
search_ds_terms = ['data scien', 'machine learning', 'deep learning','unsupervised learning',
                   'artificial intelligence','data model','predictive model','data visualization', 
                   'classification','regression','clustering']

In [None]:
# define function to get jobs info
jobs = [] # jobs placeholder

n_jobs = len(job_links)

# initialize education indicators
edu_bs = 0
edu_ms = 0
edu_phd = 0

count_ds_terms = 0

def get_jobs_info(jobs, job_links, n_jobs, edu_bs, edu_ms, edu_phd, 
             search_bs_terms, search_ms_terms, search_phd_terms, search_ds_terms, count_ds_terms):
    
    for i in range(n_jobs):
        
        # set education indicators to 0 before scanning each position
        edu_bs = 0
        edu_ms = 0
        edu_phd = 0
        
        # set data science terms counter to 0 before scanning each position
        count_ds_terms = 0
        
        if i%4 == 0:
            time.sleep(randrange(6) + 5) # every fourth posting wait 5-10 sec 
        
        print('Job #: ', i+1) # this line is mostly for testing -- can comment out during actual scraping
        
        driver.get(job_links[i])
        
        time.sleep(randrange(3) + 3) # random wait of 3 to 5 sec
        
        # position title
        try: 
            pos_title = driver.find_element_by_xpath('//*[@class="topcard__title"]').text
        except NoSuchElementException:
            pos_title = 'None' # it is important to set a "not found value"
        
        # company
        try:
            company = driver.find_element_by_xpath('//*[@class="topcard__org-name-link topcard__flavor--black-link"]').text
        except NoSuchElementException:
            company = 'None' # it is important to set a "not found value"
        
        # location
        try:
            location = driver.find_element_by_xpath('//*[@class="topcard__flavor topcard__flavor--bullet"]').text
        except NoSuchElementException:
            location = 'None' # it is important to set a "not found value"
        
        # time posted
        try:
            time_posted = driver.find_element_by_xpath('//*[contains(@class, "posted-time-ago")]').text
        except NoSuchElementException:
            time_posted = 'None' # it is important to set a "not found value"
        
        # number of applicants
        try:
            num_applicants = driver.find_element_by_xpath('//*[contains(@class, "num-applicants__caption")]').text
        except NoSuchElementException:
            num_applicants = 'None' # it is important to set a "not found value"
        
        # get all info below description
        try:
            industry_info = driver.find_element_by_xpath('//*[@class="job-criteria__list"]').text
            industry_info = industry_info.split('\n')
        except NoSuchElementException:
            industry_info = 'None' # it is important to set a "not found value"
        
        # expand job description by clicking "Show more" button
        try:
            driver.find_element_by_xpath('//*[@class="show-more-less-html__button show-more-less-html__button--more"]').click()
        except NoSuchElementException:
            pass
        
        # get the text of the job description
        try:
            pos_description = driver.find_elements_by_xpath('//*[@class="show-more-less-html__markup"]')
            pos_description = pos_description[0].text
        except NoSuchElementException:
            pos_description = 'None' # it is important to set a "not found value"
                
        # get education requirements from position description
        
        # check for Bachelor degree
        for term in search_bs_terms:
            if re.search(term, pos_description):
                edu_bs = 1
                
        # check for Master degree
        for term in search_ms_terms:
            if re.search(term, pos_description):
                edu_ms = 1
        
        # check for Doctoral degree
        for term in search_phd_terms:
            if re.search(term, pos_description):
                edu_phd = 1
        
        # get experience requirements from position description
        res_experience = re.findall(r'(\d+\s+\byears|[\d-]+\d+\s+\byears|[\d+]+\s+\byears)', pos_description)
        
        # get data science terms count in position description to gauge how relevant the position is
        for term in search_ds_terms:
            for match in re.finditer(term, pos_description, flags = re.IGNORECASE):
                count_ds_terms += 1
        
        # add job info to jobs
        jobs.append({"Job Title" : pos_title,
                     "Company Name" : company,
                     "Location" : location,
                     "Metro Area" : city.upper(), 
                     "Time Posted" : time_posted,
                     "Number of Applicants" : num_applicants,
                     "Industry Info" : industry_info,
                     "Education-Bachelor" : edu_bs,
                     "Education-Master" : edu_ms,
                     "Education-Doctor" : edu_phd,
                     "Experience" : res_experience,
                     "Data Science Terms Count" : count_ds_terms})

In [None]:
# perform scraping of job postings
jobs = []
n_jobs = len(job_links)

get_jobs_info(jobs, job_links, n_jobs, edu_bs, edu_ms, edu_phd, 
             search_bs_terms, search_ms_terms, search_phd_terms, search_ds_terms, count_ds_terms)

In [None]:
# convert the collected info to a data frame

data = pd.DataFrame(jobs)

data.head()

In [None]:
# save job links to excel file to allow for possibility to use later

links_filename = 'raw_data_5/joblinks_ds_' + city + '_' + seniority + '_5.xlsx'

data_links = pd.DataFrame(job_links)
data_links.to_excel(links_filename, index = False)

In [None]:
# save job data as an excel file
jobs_filename = 'raw_data_5/jobs_ds_' + city + '_' + seniority + '_5.xlsx'

data.to_excel(jobs_filename, index = False)

In [None]:
# read saved file to check if it reads correctly

data_test = pd.read_excel(jobs_filename)

data_test.head()

In [None]:
print(job_links[23])

In [None]:
del job_links[23]