# Automatically scrape job postings of a competitor

We will now see scraping in action. Imagine you are working in HR for a major retailer. Your boss asks you to monitor the strategic hiring decisions of your close competitors. Naturally, you cannot go and call them up - but you could take a look at their job postings to see ($i$) how much they are hiring, ($ii$) what types of positions they are hiring for.

Now, you could log onto their website every day, see what job postings there are, compare that with the job postings from before, and save the relevant data. But why go through so much effort if we can just automate the task?

## 1. Using BeautifulSoup

The first example relies purely on what we have learned about BeautifulSoup and Requests (and a bit of Pandas!).

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

We are searching for positions in the head office of Aldi. On the website, we see that there are different types of head office positions, each with their own website. Let's get the links to those sub-sites.

In [None]:
url = "https://www.aldirecruitment.co.uk/head-office"
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
print(soup)

In [None]:
links = []
for link in soup.findAll('a'):
    links.append(link.get('href'))
print(links)

We want to get only the links to actual job postings, so we have to clean the results somewhat:

In [None]:
cleaned_links = []
for link in links:
    if link != None and link != '/head-office/' and link.startswith('/head-office/'):
        cleaned_links.append(link.replace('/head-office',''))
links = cleaned_links
print(links)

Let's see how many postings there are on one of the sub-sites. For this, we have to find the right tags, using their class argument. Again, inspecting the site is very important!

In [None]:
url + links[0]

In [None]:
category_url = url + links[0]
page = requests.get(category_url)
soup = BeautifulSoup(page.content, "html.parser")
postings = soup.findAll("div", class_="c-career--dropdown")
len(postings)

We now extract some information from the actual position: the job title.

In [None]:
title = postings[0].find("div", class_="c-career--dropdown__content").find('h2')
print(title.text)

Aside from the title and the text description (which we will ignore in this example, but which can hold extremely useful information), there are some key details about the job, such as the work time and the salary.

In [None]:
details = postings[0].findAll("div", class_="c-job-details__content")
print(details)

We definitely want to get the salary information. Sometime, the text gives multiple values, so let's make sure to save the lowest and the highest value (of course, multiple values may be due to changes over time or for different starting requirements - we can adapt our scraper to capture arbitrary complexity later on).

In [None]:
detail = details[0]
detail_text = detail.find('div', class_="c-job-details__text").text
print(detail_text)

In [None]:
temp = detail_text.replace(',','')
temp = temp.replace('-','')
temp = temp.split()
salary_numbers = [float(s[1:]) for s in temp if s.startswith('£')]
ub = max(salary_numbers)
lb = min(salary_numbers)
print(ub)
print(lb)

Let's also try to capture the weekly working hours:

In [None]:
detail = details[2]
detail_text = detail.find('div', class_="c-job-details__text").text
for s in detail_text.split():
    if '-hour' in s:
        work_time = s
        work_time = int(work_time.replace('-hour',''))
print(work_time)

The following code combines our extraction of job details:

In [None]:
details = postings[0].findAll("div", class_="c-job-details__content")
for detail in details:
    detail_title = detail.find('span', class_="c-job-details__title").text
    detail_text = detail.find('div', class_="c-job-details__text").text
    if detail_title == 'Salary':
        temp = detail_text.replace(',','')
        temp = temp.replace('-','')
        temp = temp.split()
        salary_numbers = [float(s[1:]) for s in temp if s.startswith('£')]
        ub = max(salary_numbers)
        lb = min(salary_numbers)
    elif detail_title == 'Hours and benefits':
        for s in detail_text.split():
            if '-hour' in s:
                work_time = s
                work_time = int(work_time.replace('-hour',''))
print(ub)
print(lb)
print(work_time)

Finally, we are putting it all together into a simple-to-call function that returns a data frame of job postings. We have to make a few adjustments to avoid errors. These are marked with comments.

In [None]:
def scrape_aldi_jobs(starting_page = 'head-office'):
    url = "https://www.aldirecruitment.co.uk/" + starting_page
    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html.parser")
    links = []
    for link in soup.findAll('a'):
        new_link = link.get('href')
        if new_link != None and new_link.startswith('/head-office/'):
            new_link = new_link.replace('/head-office','')
            if new_link != '/':
                links.append(new_link)
    
    department = []
    titles = []
    ubs = []
    lbs = []
    hours = []
    for link in links:
        category_url = url + link
        page = requests.get(category_url)
        soup = BeautifulSoup(page.content, "html.parser")
        postings = soup.findAll("div", class_="c-career--dropdown")
        for posting in postings:
            ## Also grabbing the department information
            dep_name = link.replace('-',' ').replace('/','')
            department.append(dep_name)
            titles.append(posting.find("div", class_="c-career--dropdown__content").find('h2').text)
            details = posting.findAll("div", class_="c-job-details__content")
            for detail in details:
                detail_title = detail.find('span', class_="c-job-details__title").text
                detail_text = detail.find('div', class_="c-job-details__text").text
                if detail_title == 'Salary':
                    temp = detail_text.replace(',','')
                    temp = temp.replace('-','')
                    temp = temp.split()
                    salary_numbers = [float(s[1:]) for s in temp if s.startswith('£')]
                    ## Salary may not be specified
                    if len(salary_numbers) > 0:
                        ## Salaries are sometimes specified as per week instead of per year
                        if 'per' in temp and 'week' in temp:
                            salary_numbers = [salary*52 for salary in salary_numbers]
                        ubs.append(max(salary_numbers))
                        lbs.append(min(salary_numbers))
                    else:
                        ubs.append(None)
                        lbs.append(None)
                ## Some postings say "Benefits" instead of "Hours and benefits", and sometimes the spelling is capitalized differently
                elif detail_title.lower() == 'hours and benefits' or detail_title.lower() == 'benefits':
                    ## Some postings do not specify a number of hours per week
                    work_time = None
                    for s in detail_text.split():
                        if '-hour' in s:
                            work_time = s
                            ## Some postings write, e.g., 40-hour per week, some 40-hours per week
                            if '-hours' in s:
                                work_time = int(work_time.replace('-hours',''))
                            else:
                                work_time = int(work_time.replace('-hour',''))
                    hours.append(work_time)
                        
    job_data = pd.DataFrame(
        {'Department': department,
         'Job title': titles,
         'Salary lower': lbs,
         'Salary upper': ubs,
         'Weekly hours': hours
        })
    return job_data

Let's try it out:

In [None]:
aldi_job_data = scrape_aldi_jobs()
aldi_job_data.head()

We can now explore the data frame, improve our code if we find issues, and then analyze it. For example, let's have a look at a simple histogram of postings per department.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
fig.set_size_inches(30, 10)
sns.histplot(data=aldi_job_data, x="Department",ax=ax)
plt.show()

Finally, save the job postings we found as a CSV:

In [None]:
aldi_job_data.to_csv('Aldi_postings_2021-10-20.csv', index=False)

## 2. A more advanced case - using Selenium to enter details

Let's get data from a second competitor. We will use Lidle here (I am, of course, not biased in my choices). Check out Lidl's hiring page https://careers.lidl.co.uk/ and start a search. Then look at the link where you landed at - can you see why things are a bit more complex here?

Since we cannot just find the right links, we need to act like a browser. This is where Selenium comes in - it will literally run a browser!

In [None]:
import requests
from bs4 import BeautifulSoup
import selenium
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.keys import Keys
import pandas as pd
import re
import time

We need to choose the type of browser that Selenium runs - and each comes with its own access and installation requirements. I personally recommend using Chrome. However, to use Chrome, with Selenium, you need to install ChromeDriver. The site https://sites.google.com/chromium.org/driver/downloads gives download files, which should work fine for Windows users. Simply download and unpack the Zip, which gives you a .exe file. Either move it somewhere on your PATH, or add it to your path (https://stackoverflow.com/questions/4822400/register-an-exe-so-you-can-run-it-from-any-command-line-in-windows gives a good description how to).

On Mac, you may run into access issues. The easiest way to proceed is to use Homebrew (https://brew.sh/ shows how to use it). Once done, type
```
brew install chromedriver
```
into your terminal.
Other options can be found here: https://www.kenst.com/2015/03/installing-chromedriver-on-mac-osx/ (note that the syntax can be a bit outdated).

Once done, the below code will open a new window in the browser of your choice (here Chrome):

In [None]:
driver = webdriver.Chrome()
driver.get("https://careers.lidl.co.uk/jobsearch")

You will notice that this is a completely new Chrome process - so cookies are not yet accepted. To see what's going on, let's start by accepting cookies. How do we do this? We simply find the right button (by insepcting the site, then copying the XPath), and then let Selenium click this button!

In [None]:
cookie_button = driver.find_element_by_xpath('//button[@class="cookie-alert-extended-button"]')
cookie_button.click()

The Lidl jobs site offers the option to select head office positions, just like Aldi. However, the link is relatively complex, so we will simply let Selenium click on the right button again.

In [None]:
head_office_button = driver.find_element_by_xpath('//h4[contains(text(),"Head Office Roles")]')
head_office_button.click()

There are a few positions here. If you click on any of those, you'll notice that the links are relatively simple in structure and don't depend on your website interaction. Hence, the easiest is for us to collect all the posting links

In [None]:
posting_urls = []
postings = driver.find_elements_by_xpath('//a[@class="jobResult"]')
for posting in postings:
    posting_urls.append(posting.get_attribute('href'))
print(posting_urls)

It may be that postings are spread across multiple pages (delete the filters to see this). Luckily, there is a forward button that let's us scroll through the pages. We can easily combine this with our previous code. Note that we only move forward if the next page element actually exists.
There can be a problem with identifying the button location. Usually, this can be fixed by maximizing the window in which Selenium runs.


Also, we add an implict wait so that the server has time to respond before our clicks.

In [None]:
stop = False
posting_urls = []
while not stop:
    driver.implicitly_wait(5)
    postings = driver.find_elements_by_xpath('//a[contains(@class,"jobResult")]')
    print("Found " + str(len(postings)) + " postings")
    for posting in postings:
        posting_urls.append(posting.get_attribute('href'))
    next_elements = driver.find_elements_by_class_name('paginationArrow_next')
    if len(next_elements) > 0:
        element = driver.find_element_by_class_name('paginationArrow_next')
        if element.is_enabled():
            driver.execute_script("arguments[0].click();", element)
            time.sleep(3)
        else:
            stop = True
    else:
        stop = True

In [None]:
lidl_job_data = pd.DataFrame({'url': posting_urls})
lidl_job_data.to_csv('Lidl_postings_2021-10-20.csv', index=False)

## Exercise 1

Now that we have loaded the urls of the relevant vacancies, can you extract key information (e.g., title and postcode of location, maybe also salary)? You might want to take a look at what we did for the Aldi vacancies.

If you are having problems running Selenium, you can use the uploaded list of urls (note: some may no longer be working):

In [None]:
posting_urls = pd.read_csv('Lidl_postings_2021-10-20.csv')['url'].tolist()