The objective of this project is to create a web scraper which finds jobs related to my query of "junior data (analyst, engineer)" in UK, Danish, or Norwegian locations.

In [1]:
# Libraries
import csv                        # to export data
from datetime import datetime     # retreive current date
import requests                   # syndicate requests to retreive .html
from bs4 import BeautifulSoup     # parse and extract data from Indeed.com
import pandas as pd

Indeed URLs look like this:

https://uk.indeed.com/jobs?q=data+engineer&l=United+Kingdom&fromage=7

Where it has "q=" and "l=" I will add into a function so it's easy to search for different queries.

In [2]:
# Function which creates URL according to our query
def get_url_indeed(position, location):
    # URL template
    template_url = 'https://uk.indeed.com/jobs?q={}&l={}&fromage=3'
    # allow arguments to be inserted into URL
    final_url = template_url.format(position, location).replace(" ", "+")
    return (final_url)

# Demonstration
print('Demonstration URL:\n',
      get_url_indeed(position = 'data analyst',
                     location = 'United Kingdom'))

Demonstration URL:
 https://uk.indeed.com/jobs?q=data+analyst&l=United+Kingdom&fromage=3


Now, to build a demo Web Scaper. For this we need to:
- Get the **response** for the URL, using the *requests* library.
- Use the *Beautiful Soup* library to 1) **parse** and then 2) **extract** the data.

We wish to extract information about:
- Job title.
- Company.
- Location.
- Date of posting.
- Summary.
- Salary.
- Orginal query.
- URL.

In [4]:
# Retreiving the response
demo_response = requests.get(get_url_indeed(position = 'accountant',
                                            location = 'united kingdom'))
# Parsing the response
demo_parse = BeautifulSoup(demo_response.text, 'html.parser')
# Get all job cards which appear on a webpage
demo_cards = demo_parse.find_all('div',                    # all 'div' classes
                                 'jobsearch-SerpJobCard')  # part of webpage
print('The number of job cards on each page is',len(demo_cards))
print('\nSelecting the first 5 jobs:\n')
# Extracting information from 1st entry (job card) on the webpage.
for i in range(5):
    demo_title = demo_cards[i].h2.a.get('title')
    print('Job title:', demo_title)
    demo_company = demo_cards[i].find('span', 'company').text.strip()
    print('Company:', demo_company)
    demo_location = demo_cards[i].find('div', 'recJobLoc').get('data-rc-loc')
    print('Location:', demo_location)
    demo_date = demo_cards[i].find('span', 'date date-a11y').text
    print('Date:', demo_date, 'on', datetime.today().strftime('%d-%m-%Y'))
    demo_summary = demo_cards[i].find('div', 'summary').text.strip().replace('\n',' ')
    print('Summary:',demo_summary)
    try:
        demo_salary = demo_cards[i].find('span','salaryText').text.strip()
    except AttributeError:
        demo_salary = 'N/A'
    print('Salary:',demo_salary)
    demo_url = demo_cards[i].h2.a.get('href')
    print('URL:', 'https://www.indeed.com' + demo_url)
    print('')

The number of job cards on each page is 15

Selecting the first 5 jobs:

Job title: Technical Accountant NHS AfC: Band 8a
Company: St Helens & Knowsley Teaching Hospitals
Location: Prescot
Date: Just posted on 16-06-2021
Summary: Full time - 37.5 hours per week (Working patterns may be negotiable). The Technical Accountant’s primary purposes are to develop and enact policies, systems,…
Salary: £45,753 - £51,668 a year
URL: https://www.indeed.com/rc/clk?jk=5ed52d00fbc95353&fccid=b937bf49aac15d18&vjs=3

Job title: Trainee Accountant
Company: Rothmans
Location: Southampton
Date: 2 days ago on 16-06-2021
Summary: We are currently looking for an enthusiastic graduate to join our team as a Trainee Accountant based at our Southampton office in Chilworth. If you are a high…
Salary: N/A
URL: https://www.indeed.com/rc/clk?jk=bfc703ae8ae965e6&fccid=93d0c73593dc9935&vjs=3

Job title: Assistant Accountant
Company: SSE plc
Location: Perthshire
Date: 1 day ago on 16-06-2021
Summary: Working Pattern: 

# Full-code
Let's now create a function which can be combined with the URL-maker function to simplify things. Aspects which must be improved upon include **1)** Accounting for errors, where no value for field element exists; **2)** Jumping to another page if there is one in the job list.

In [17]:
# Function which retrieves useful information
def get_job_indeed(job_card):
    job_title = job_card.h2.a.get('title')
    job_company = job_card.find('span', 'company').text.strip()
    job_location = job_card.find('div', 'recJobLoc').get('data-rc-loc')
    job_date = job_card.find('span', 'date date-a11y').text + ' on ' + datetime.today().strftime('%d-%m-%Y')
    job_summary = job_card.find('div', 'summary').text.strip().replace('\n',' ')
    try:
        job_salary = job_card.find('span','salaryText').text.strip()
    except AttributeError:
        job_salary = '-'
    job_url = 'https://uk.indeed.com' + job_card.h2.a.get('href')
    #Description. As to get description we have to follow a link:
    desc_template = 'https://www.indeed.com/viewjob?jk={}'
    desc_data_jk = job_card.get('data-jk')
    description_url = desc_template.format(desc_data_jk)
    response_desc = requests.get(description_url)
    soup_desc = BeautifulSoup(response_desc.text, 'html.parser')
    
    try:
        job_description = soup_desc.find('div', 'jobsearch-jobDescriptionText').text.strip().replace('\n', ' ')
    except AttributeError:
        job_description = '-'
     
    job_extract = (job_title, job_company, job_location, job_date, job_summary, job_salary, job_description, job_url)
    return (job_extract)
# Demonstration
get_job_indeed(demo_cards[0])

('Technical Accountant NHS AfC: Band 8a',
 'St Helens & Knowsley Teaching Hospitals',
 'Prescot',
 'Just posted on 16-06-2021',
 'Full time - 37.5 hours per week (Working patterns may be negotiable). The Technical Accountant’s primary purposes are to develop and enact policies, systems,…',
 '£45,753 - £51,668 a year',
 "Main area Finance Grade NHS AfC: Band 8a Contract Permanent Hours Full time - 37.5 hours per week (Working patterns may be negotiable) Job ref 409-S3189108 Site Nightingale House, Whiston Hospital Town Prescot Salary £45,753 - £51,668 per annum Salary period Yearly Closing 20/06/2021 23:59 Interview date 29/06/2021 Job overview  A vacancy has arisen for a talented and proactive member of the Trust’s Financial Services department. The Technical Accountant’s primary purposes are to develop and enact policies, systems, models and processes in Financial Services, to take a strategic and analytical approach towards transactions including aged debts and debt management, and t

In [None]:
def main_function_jobs(position, location):
    job_records = []
    # Part 1: Retreive the appropriate URL using get_url_indeed
    url = get_url_indeed(position, location)
    # Part 2: Parsing and extracting the data
    while True:     # Required to go through multiple pages
        print(url)
        # Retreiving the response
        response = requests.get(url)
        # Parsing the response
        parse = BeautifulSoup(response.text, 'html.parser')
        # Retreiving job cards found on webpage(s)
        cards = parse.find_all('div', 'jobsearch-SerpJobCard')
        
        for i in cards:
            job_record = get_job_indeed(i)
            job_records.append(job_record)
        
        try:
            url = 'https://uk.indeed.com' + parse.find('a', {'aria-label': 'Next'}).get('href')
        except AttributeError:
            break
    return (job_records)

In [None]:
def webscrape_jobs(level, position, location):
    # Running the webscraper for a given list of positions and locations at various levels
    df = pd.DataFrame()
    for location in location_list:
        for position in position_list:
            for level in level_list:
                df = df.append(main_function_jobs(position = level + position, location = location))
    # N.B. I may wish to filter the data, by title / description, at this stage - TBC
    df.columns = ['job_title','job_company','job_location','job_date','job_summary','job_salary','job_url']
    return (df)

In [None]:
# Input job titles and locations I am interested in
level_list = ['','junior ','graduate ']
position_list = ['data analyst', 'data engineer', 'machine learning']
location_list = ['United Kingdom']

# Run WebScaper
webscrape_jobs(level = level_list, position = position_list, location = location_list)

Excellent. I have a database which contains a list of jobs according to my specification. ✌

The next stage in this project would be to automate running of it, using PyCharm and a postgreSQL database. I could email it to myself...