## Scraping searched jobs on Jobsite with Selenium on Google colab

In this notebook, I will automate job search on Jobsite website using Python and selenium, then scrape the listed jobs, process and store the data in a csv file.

the link to scrape:
https://www.jobsite.co.uk/

For each listed job, following fields are scraped:
- Job title
- Company
- Location
- Salary
- Job description

## Install Chromedriver and selenium on Google colab

In [None]:
%%shell

# Add debian buster
cat > /etc/apt/sources.list.d/debian.list <<'EOF'
deb [arch=amd64 signed-by=/usr/share/keyrings/debian-buster.gpg] http://deb.debian.org/debian buster main
deb [arch=amd64 signed-by=/usr/share/keyrings/debian-buster-updates.gpg] http://deb.debian.org/debian buster-updates main
deb [arch=amd64 signed-by=/usr/share/keyrings/debian-security-buster.gpg] http://deb.debian.org/debian-security buster/updates main
EOF

# Add keys
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys DCC9EFBF77E11517
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 648ACFD622F3D138
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 112695A0E562B32A

apt-key export 77E11517 | gpg --dearmour -o /usr/share/keyrings/debian-buster.gpg
apt-key export 22F3D138 | gpg --dearmour -o /usr/share/keyrings/debian-buster-updates.gpg
apt-key export E562B32A | gpg --dearmour -o /usr/share/keyrings/debian-security-buster.gpg

# Prefer debian repo for chromium* packages only
# Note the double-blank lines between entries
cat > /etc/apt/preferences.d/chromium.pref << 'EOF'
Package: *
Pin: release a=eoan
Pin-Priority: 500


Package: *
Pin: origin "deb.debian.org"
Pin-Priority: 300


Package: chromium*
Pin: origin "deb.debian.org"
Pin-Priority: 700
EOF

In [None]:
!apt-get update
!apt-get install chromium chromium-driver

In [None]:
!pip install selenium

In [5]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.select import Select

# from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# from selenium.common.exceptions import ElementClickInterceptedException

from time import sleep

import pandas as pd
import numpy as np

In [6]:
def web_driver():
    options = webdriver.ChromeOptions()
    options.add_argument("--verbose")
    options.add_argument('--no-sandbox') # needed, because colab runs as root
    options.add_argument('--headless')  # or use pyvirtualdiplay
    options.add_argument('--disable-gpu')
    options.add_argument("--window-size=1920, 1200")
    options.add_argument('--disable-dev-shm-usage')
    driver = webdriver.Chrome(options=options)
    return driver

## Scraping data with selenium

In [188]:
driver = web_driver()

url = 'https://www.jobsite.co.uk/'

driver.get(url)

Now need to take care of the cookies popup and let the browser click Accept button if the cookie popup appears

In [189]:
for i in range(3):
  try:
    cookie_accept = driver.find_element(By.ID, 'ccmgt_explicit_accept')
    cookie_accept.click()
    print('ACCEPT CLICKED')
    break
  except:
    print('THERE IS NO COOKIES POPUP or WAITING TO LOAD')
    time.sleep(5)

THERE IS NO COOKIES POPUP or WAITING TO LOAD
ACCEPT CLICKED


Browser screenshot can be taken to check if it is working

In [190]:
# driver.save_screenshot('1.jpg')

## Job search query automation

First, I will make a job search with following entries for an example:

- Job title to search - 'Software Engineer'
- Location - 'Manchester'
- Radius - 'within 30 miles' (select from dropdown selection)

In [191]:
job_query = 'Software Engineer'
location_query = 'Manchester'
radius_query = '30 miles'

Now enter the necessary details into input fields to make a search query

In [192]:
job_title = driver.find_element(By.ID, 'keywords')
job_title.send_keys(job_query)

location = driver.find_element(By.ID, 'location')
location.send_keys(location_query)

# choose 30miles from dropdown selection by working with select clause
dropdown = driver.find_element(By.ID, 'Radius')
radius = Select(dropdown)
radius.select_by_visible_text(radius_query)

wait = WebDriverWait(driver, 20)

search = wait.until(EC.element_to_be_clickable((By.ID, 'search-button')))
driver.execute_script('arguments[0].click()', search)
print('Search button clicked')

Search button clicked


Now the search is made and next step is to scrape the jobs listed

But first, I need to find out how many pages I actually need to scrape.

Because for an example, the search query I make might have 100 job results, however Jobsite often extends/adds the search results with extended jobs/results with message like this:

**"We have extended your search with 652 more results from outside the region"**

In such case, I will end up scraping up to the very last page of search results with extended jobs which I do not want to include in my final data.

Therefore, I will find out how many pages I need to scrape according to the number of jobs returned by the search query made and divide the number by 25 (the number of jobs listed per page) so that I will only scrape the pages I need for speed and efficiency.

Find out how many jobs listed for the search query made

In [193]:
jobs_found = driver.find_element(By.XPATH, '//h1[@class="resultlist-zf9fsu at-facets-header-title"]').text
jobs_found

'552 Software Engineer jobs in Manchester + 30 miles'

In [194]:
max_jobs_no = driver.find_element(By.XPATH, '//h1[@class="resultlist-zf9fsu at-facets-header-title"]/span')
max_jobs = int(max_jobs_no.text)
max_jobs

552

Get the number of last page to scrape for given search

In [195]:
last_page_no = int(max_jobs / 25) + 1
last_page_no

23

Now scrape each page and store the data from each listed job in respective lists

In [196]:
job_list, company_list, location_list, salary_list, job_details_list = [], [], [], [], []

print('last_page_no is', last_page_no)
print()

for page in range(last_page_no):
  print('Page', page+1)

  more_button = driver.find_elements(By.XPATH, '//span[@class="ExpandButtonText-sc-1v4cep4-2 jHPToM"]')

  # click ALL more button
  [button.click() for button in more_button]

  box = driver.find_elements(By.CSS_SELECTOR, 'div.sc-fznNTe.kxgehf')

  for item in box:
    # print('------------------------')
    job_title = item.find_element(By.CSS_SELECTOR, 'h2').text
    # print(job_title)
    company = item.find_element(By.CSS_SELECTOR, 'div.sc-fzoiQi.kuzZTz').text
    # print(company)
    
    location = item.find_element(By.CSS_SELECTOR, 'li').text
    salary = item.find_element(By.CSS_SELECTOR, 'dl').text

    # use find_elements as some jobs have two a span elements, therefore need a list here
    l_job_details = item.find_elements(By.CSS_SELECTOR, 'a span')
    job_details = l_job_details[-1].text
    # print(job_details)
    # print('------------------------')

    job_list.append(job_title)
    company_list.append(company)
    location_list.append(location)
    salary_list.append(salary)
    job_details_list.append(job_details)

  next = driver.find_element(By.XPATH, '//a[@aria-label="Next"]')
  next.click()
  print('NEXT CLICKED')
  time.sleep(3)
  
driver.quit()  
print('ALL DONE')
print(len(job_list))
print(len(company_list))

last_page_no is 23

Page 1
NEXT CLICKED
Page 2
NEXT CLICKED
Page 3
NEXT CLICKED
Page 4
NEXT CLICKED
Page 5
NEXT CLICKED
Page 6
NEXT CLICKED
Page 7
NEXT CLICKED
Page 8
NEXT CLICKED
Page 9
NEXT CLICKED
Page 10
NEXT CLICKED
Page 11
NEXT CLICKED
Page 12
NEXT CLICKED
Page 13
NEXT CLICKED
Page 14
NEXT CLICKED
Page 15
NEXT CLICKED
Page 16
NEXT CLICKED
Page 17
NEXT CLICKED
Page 18
NEXT CLICKED
Page 19
NEXT CLICKED
Page 20
NEXT CLICKED
Page 21
NEXT CLICKED
Page 22
NEXT CLICKED
Page 23
NEXT CLICKED
ALL DONE
575
575


## Handling the scraped data

In [197]:
data = list(zip(job_list, company_list, location_list, salary_list, job_details_list))
# checking the data of first 2 jobs listed
data[:2]

[('Software Engineer - iOS',
  'BBC',
  'Salford, Greater Manchester',
  'Competitive Salary + Benefits',
  "Job Introduction We are looking for Mobile Software Engineers to join the Sport App team. BBC Sport App is one of the UK's most well-known and loved brands, and we're looking for passionate team members to join our collaborative, agile, iOS team. We welcome applications from all, regardless of age, gender, ethnicity, disability, sexuality, social background, religion and/or belief. As a Software Engineer for the Sport App team, you will have the opportunity to join an engineering team that delivers an intuitive and engaging sport-oriented experience to millions of audience members every day. To be successful in this role you will need a good understanding of object-oriented programming, clean architecture, and test-driven development."),
 ('Software Engineer',
  'Ultimate Performance Fitness',
  'M1, Manchester',
  'From £50,000 to £60,000 per annum',
  'Location- based in the h

Make the length of the data list as the number of jobs found as per search query made

In [198]:
data = data[:max_jobs]
len(data)

552

In [199]:
file = pd.DataFrame(data, columns = ['job_title', 'company', 'location', 'salary', 'job_description'])

# Rearrange index so that index works as the position of the job in the list
file.index = np.arange(1, len(file) + 1)
file

Unnamed: 0,job_title,company,location,salary,job_description
1,Software Engineer - iOS,BBC,"Salford, Greater Manchester",Competitive Salary + Benefits,Job Introduction We are looking for Mobile Sof...
2,Software Engineer,Ultimate Performance Fitness,"M1, Manchester","From £50,000 to £60,000 per annum",Location- based in the heart of Manchester Sal...
3,Software Engineer,Precisely Software Limited,UK,Competitive,Precisely is the leader in data integrity. We ...
4,Software Engineer - Manchester - £57k,DGH Recruitment Ltd,"Manchester, Greater Manchester",£45000 - £57000 per annum,My client is currently recruiting for an exper...
5,Graduate Software Engineer,ITECCO Limited,"Chorley, Lancashire",£20000 - £30000 per annum,Are you ready to kick start your career as a G...
...,...,...,...,...,...
548,Project Manager (HR Systems and Change),Reed Technology,"Manchester, Greater Manchester",£475 - £525 per day,Manchester/Remote (will need to go to site) £4...
549,Senior IT Project Manager,Calisen,"Spring Gardens, M2 1HW","Up to £55,000 per annum",An exciting opportunity has become available f...
550,Quality & Compliance Manager,Morson Talent,"The Oaks Business Park, M23 9SS",Market related,As the Quality & Compliance Manager you will b...
551,Project Test Lead,Corecom Consulting,"M2, Manchester","Up to £55,000 per annum",Are you an accomplished Project Test Lead with...


In [200]:
# checking there is no empty values in the dataframe

file.isnull().sum()

job_title          0
company            0
location           0
salary             0
job_description    0
dtype: int64

In [201]:
file.to_csv('Jobsite.csv', index=False)