The objective of this project is to create a web scraper which finds jobs related to my query of "junior data (analyst, engineer)" in UK, Danish, or Norwegian locations.

In [1]:
# Libraries
import csv                        # to export data
from datetime import datetime     # retreive current date
import requests                   # syndicate requests to retreive .html
from bs4 import BeautifulSoup     # parse and extract data from Indeed.com
import pandas as pd

Indeed URLs look like this:

https://uk.indeed.com/jobs?q=data+engineer&l=United+Kingdom&fromage=7

Where it has "q=" and "l=" I will add into a function so it's easy to search for different queries.

In [2]:
# Function which creates URL according to our query
def get_url_indeed(position, location):
    # URL template
    template_url = 'https://uk.indeed.com/jobs?q={}&l={}&fromage=3'
    # allow arguments to be inserted into URL
    final_url = template_url.format(position, location).replace(" ", "+")
    return (final_url)

# Demonstration
print('Demonstration URL:\n',
      get_url_indeed(position = 'data analyst',
                     location = 'United Kingdom'))

Demonstration URL:
 https://uk.indeed.com/jobs?q=data+analyst&l=United+Kingdom&fromage=3


Now, to build a demo Web Scaper. For this we need to:
- Get the **response** for the URL, using the *requests* library.
- Use the *Beautiful Soup* library to 1) **parse** and then 2) **extract** the data.

We wish to extract information about:
- Job title.
- Company.
- Location.
- Date of posting.
- Summary.
- Salary.
- Orginal query.
- URL.

In [3]:
# Retreiving the response
demo_response = requests.get(get_url_indeed(position = 'accountant',
                                            location = 'united kingdom'))
# Parsing the response
demo_parse = BeautifulSoup(demo_response.text, 'html.parser')
# Get all job cards which appear on a webpage
demo_cards = demo_parse.find_all('div',                    # all 'div' classes
                                 'jobsearch-SerpJobCard')  # part of webpage
print('The number of job cards on each page is',len(demo_cards))
print('\nSelecting the first 5 jobs:\n')
# Extracting information from 1st entry (job card) on the webpage.
for i in range(5):
    demo_title = demo_cards[i].h2.a.get('title')
    print('Job title:', demo_title)
    demo_company = demo_cards[i].find('span', 'company').text.strip()
    print('Company:', demo_company)
    demo_location = demo_cards[i].find('div', 'recJobLoc').get('data-rc-loc')
    print('Location:', demo_location)
    demo_date = demo_cards[i].find('span', 'date date-a11y').text
    print('Date:', demo_date, 'on', datetime.today().strftime('%d-%m-%Y'))
    demo_summary = demo_cards[i].find('div', 'summary').text.strip().replace('\n',' ')
    print('Summary:',demo_summary)
    try:
        demo_salary = demo_cards[i].find('span','salaryText').text.strip()
    except AttributeError:
        demo_salary = 'N/A'
    print('Salary:',demo_salary)
    demo_url = demo_cards[i].h2.a.get('href')
    print('URL:', 'https://www.indeed.com' + demo_url)
    print('')

The number of job cards on each page is 15

Selecting the first 5 jobs:

Job title: Assistant Management Accountant
Company: Coop
Location: Manchester
Date: 3 days ago on 13-06-2021
Summary: An ACA, CIMA or ACCA part qualified accountant. Up to £30,000 plus benefits (Grade F14). Remote / flexible working, with some weekly travel to Manchester city…
Salary: £30,000 a year
URL: https://www.indeed.com/pagead/clk?mo=r&ad=-6NYlbfkN0Ae8w1yikstuQNQleHQytgaCy3iZ06N5iqAjinK6cLir-wbdh19nDKZIAQTJRkhikUW4NH-V5C2vYe5n8F_7iG9aFUE8Ww3hWhfsbfyhUjyhYxaWcsn2VTcWY0PhiM8jPRFo8KuYqZLNhV9e-as6cOqvtyU3zDTXS_Oc_1YSP2zY6njDN42GqTnD3WPQvb9lLYRlPBRtOBm4CXancrqSdsBVGt09Cd86uAITXIeHEjpbweYqtGAlvfCDlkYjbaIH4hQ0cf2rsAwGTA_Kr5Tau47cRtpdxcTG3UM23_9Gh3ZGbXDdWLVJ6C_yg8rHvZhSYisUjubiosG_74_VLE9inoWfWF7TmryIkdDGjr3u_Cm5XqmGZ-2cw7TrTBRpEZlQQko6ytw6q1SrE--WSmS4D0v5xlEErwJ18zBmLrbzRB_8vzDIa_5KGrK-jFimrE_338=&p=0&fvj=0&vjs=3

Job title: Trainee Accountant
Company: Ballantyne & Co Chartered Accountants
Location: Glasgow
Date: 

# Full-code
Let's now create a function which can be combined with the URL-maker function to simplify things. Aspects which must be improved upon include **1)** Accounting for errors, where no value for field element exists; **2)** Jumping to another page if there is one in the job list.

In [4]:
def get_job_indeed(job_card):
    job_title = job_card.h2.a.get('title')
    job_company = job_card.find('span', 'company').text.strip()
    job_location = job_card.find('div', 'recJobLoc').get('data-rc-loc')
    job_date = job_card.find('span', 'date date-a11y').text + ' on ' + datetime.today().strftime('%d-%m-%Y')
    job_summary = job_card.find('div', 'summary').text.strip().replace('\n',' ')
    try:
        job_salary = job_card.find('span','salaryText').text.strip()
    except AttributeError:
        job_salary = '-'
    job_url = 'https://uk.indeed.com' + job_card.h2.a.get('href')
    job_extract = (job_title, job_company, job_location, job_date, 
           job_summary, job_salary, job_url)
    return (job_extract)

In [5]:
def main_function_jobs(position, location):
    job_records = []
    # Part 1: Retreive the appropriate URL using get_url_indeed
    url = get_url_indeed(position, location)
    # Part 2: Parsing and extracting the data
    while True:     # Required to go through multiple pages
        print(url)
        # Retreiving the response
        response = requests.get(url)
        # Parsing the response
        parse = BeautifulSoup(response.text, 'html.parser')
        # Retreiving job cards found on webpage(s)
        cards = parse.find_all('div', 'jobsearch-SerpJobCard')
        
        for i in cards:
            job_record = get_job_indeed(i)
            job_records.append(job_record)
        
        try:
            url = 'https://uk.indeed.com' + parse.find('a', {'aria-label': 'Next'}).get('href')
        except AttributeError:
            break
    return (job_records)

In [6]:
def webscrape_jobs(level, position, location):
    # Running the webscraper for a given list of positions and locations at various levels
    df = pd.DataFrame()
    for location in location_list:
        for position in position_list:
            for level in level_list:
                df = df.append(main_function_jobs(position = level + position, location = location))
    # N.B. I may wish to filter the data, by title / description, at this stage - TBC
    df.columns = ['job_title','job_company','job_location','job_date','job_summary','job_salary','job_url']
    return (df)

In [7]:
# Input job titles and locations I am interested in
level_list = ['','junior ','graduate ']
position_list = ['data analyst', 'data engineer', 'machine learning']
location_list = ['United Kingdom']

# Run WebScaper
webscrape_jobs(level = level_list, position = position_list, location = location_list)

https://uk.indeed.com/jobs?q=data+analyst&l=United+Kingdom&fromage=3
https://uk.indeed.com/jobs?q=data+analyst&l=United+Kingdom&fromage=3&start=10
https://uk.indeed.com/jobs?q=data+analyst&l=United+Kingdom&fromage=3&start=20
https://uk.indeed.com/jobs?q=data+analyst&l=United+Kingdom&fromage=3&start=30
https://uk.indeed.com/jobs?q=data+analyst&l=United+Kingdom&fromage=3&start=40
https://uk.indeed.com/jobs?q=data+analyst&l=United+Kingdom&fromage=3&start=50
https://uk.indeed.com/jobs?q=data+analyst&l=United+Kingdom&fromage=3&start=60
https://uk.indeed.com/jobs?q=data+analyst&l=United+Kingdom&fromage=3&start=70
https://uk.indeed.com/jobs?q=data+analyst&l=United+Kingdom&fromage=3&start=80
https://uk.indeed.com/jobs?q=data+analyst&l=United+Kingdom&fromage=3&start=90
https://uk.indeed.com/jobs?q=data+analyst&l=United+Kingdom&fromage=3&start=100
https://uk.indeed.com/jobs?q=data+analyst&l=United+Kingdom&fromage=3&start=110
https://uk.indeed.com/jobs?q=data+analyst&l=United+Kingdom&fromage=3&st

Unnamed: 0,job_title,job_company,job_location,job_date,job_summary,job_salary,job_url
0,Data Quality Analyst,"Wrightington, Wigan and Leigh Teaching Hospita...",Wigan,3 days ago on 13-06-2021,You will work very closely with all facets of ...,"£24,907 - £30,615 a year",https://uk.indeed.com/rc/clk?jk=c15a0fa529f61a...
1,Data Analyst,go-centric,Glasgow,2 days ago on 13-06-2021,An ability to draw conclusions from data. Abil...,"£23,000 - £25,000 a year",https://uk.indeed.com/company/Go--centric/jobs...
2,Data Quality Analyst,Sky,Isleworth,3 days ago on 13-06-2021,"Results-oriented, diligent, champions best pra...",-,https://uk.indeed.com/rc/clk?jk=483714a287915a...
3,Data Analyst - Telemetry,"JPMorgan Chase Bank, N.A.",Glasgow,Today on 13-06-2021,Strong analytical skills with a concentration ...,-,https://uk.indeed.com/rc/clk?jk=9a7188e02d214a...
4,Healthcare Analyst – Data Engineer,NHS Midlands and Lancashire Commissioning Supp...,London,2 days ago on 13-06-2021,Build and maintain data pipelines used to proc...,"£31,365 - £37,890 a year",https://uk.indeed.com/rc/clk?jk=8861d7f92a6f17...
...,...,...,...,...,...,...,...
0,Technology Graduate – 6-Month Internship,BearingPoint UK,London,1 day ago on 13-06-2021,The technology consulting team help clients ga...,-,https://uk.indeed.com/rc/clk?jk=12686108039bda...
1,Turing Talent Graduate Program - Women in Tech...,Turing Talent,London,3 days ago on 13-06-2021,Using deep learning for early diagnosis of dem...,"£30,000 - £50,000 a year",https://uk.indeed.com/company/Turing-Talent/jo...
2,Turing Talent Internship - Data Science & Mach...,Turing Talent,London,3 days ago on 13-06-2021,Using deep learning for early diagnosis of dem...,"£36,000 a year",https://uk.indeed.com/company/Turing-Talent/jo...
3,IT Support Engineer,Concirrus Ltd,London,2 days ago on 13-06-2021,Cutting edge risk models driven by the latest ...,-,https://uk.indeed.com/rc/clk?jk=35975c621aebde...


Excellent. I have a database which contains a list of jobs according to my specification. ✌

The next stage in this project would be to automate running of it, using PyCharm and a postgreSQL database. I could email it to myself...