# Overview

**Project Scenario**

The team at the recruitment agency is trying to improve its sourcing of job vacancies. To do this the agency relies on multiple job posting sites to identify potential job openings for its clients. However, manually searching through each site is time-consuming and often leads to missed opportunities.  

We would  analyze the data using web scraping tools that can automatically extract job posting data from multiple job posting sites.  The team will use the analysis to provide a more efficient way to provide job vacancies to better serve its clients. This feature will help the recruitment agency by getting relevant openings to their clients more quickly, giving their clients a competitive advantage over other applicants.

**Project Objectives**

- Increase the efficiency of job vacancy sourcing

- Improve the quality of job vacancy sourcing  

- Gain a competitive advantage



**The task** will be to conduct a web scraping data analysis to automatically extract job posting data from a job posting site. To do this, an environment would be set up, identify the job posting site, scrape the data,  process, analyze, and visualize the data.

The site to be used in our web scrapping is myjobmag.com

We would create a general purpose job scraper for [www.myjobmag.com](https://www.myjobmag.com)

### Importing the important libraries to be used

In [None]:
### Import necessary Libraries

from bs4 import BeautifulSoup
import pandas as pd
import os
from datetime import datetime
import requests
import csv

### Getting the URL

Go to indeed, search for a job title, then copy the link up to where location ends

In [None]:
### Assign URL to variable
### Sample URL (Url = "https://www.myjobmag.com/search/jobs?q=Data+analyst&location=Lagos")

template = 'https://www.myjobmag.com/search/jobs?q={}&location={}'

In [None]:
def get_url(role, location):
  """Generate Url for role"""
  template = 'https://www.myjobmag.com/search/jobs?q={}&location={}'
  url = template.format(role, location)
  return url

In [None]:
url = get_url('Data Analyst', 'lagos')

url

'https://www.myjobmag.com/search/jobs?q=Data Analyst&location=lagos'

### Extract raw html

In [None]:
### Request URL from server
response = requests.get(url)

In [None]:
### Check response
response

<Response [200]>

In [None]:
### Reason for response
response.reason

'OK'

In [None]:
# Parse the HTML content of the response using BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

In [None]:
### After investigating the HTML using inspect we can see that
### all job titles has a li class container of mag-b
""" Which we would use to find all job titles"""

cards = soup.find_all('li', 'job-info')

In [None]:
### Check the length of the cards

len(cards)

18

### We can then prototype the model with a single record

In [None]:
### Get the first element in the job posting page (i.e. the first div element)
card = cards[0]
card

<li class="job-info">
<ul>
<li class="mag-b">
<h2><a href="/job/data-analyst-buybeta-investment-limited">Data Analyst at Buybeta Investment Limited</a></h2>
</li>
<li class="job-desc">
Job Summary
The Data Analyst will support the skincare company's data needs by analyzing sales, inventory, and market trends to optimize product availability, streamline operations, and enha </li>
<li class="job-item">
<ul>
<li id="job-date">09 October</li>
<li id="job-duration">
</li>
</ul>
</li>
</ul>
</li>

In [None]:
### Get the a tag, which the job title is located
atag = card.h2.a
atag

<a href="/job/data-analyst-buybeta-investment-limited">Data Analyst at Buybeta Investment Limited</a>

In [None]:
### Extract the job title from the atag
job_title = atag.text.split(' at ')[0].strip()
job_title

'Data Analyst'

In [None]:
### Extract the job Url using the href tag and add to the wbesite string
job_url = 'https://www.myjobmag.com/' + atag.get('href')
job_url

'https://www.myjobmag.com//job/data-analyst-buybeta-investment-limited'

In [None]:
### Extract company name
company = atag.text.split(' at ')[1].strip()
company

'Buybeta Investment Limited'

In [None]:
### Use the li tag to find the description
job_description = card.find('li', 'job-desc').text.replace('\n', '')
job_description

"Job SummaryThe Data Analyst will support the skincare company's data needs by analyzing sales, inventory, and market trends to optimize product availability, streamline operations, and enha "

In [None]:
### Use the li id="job-date" to find the date job was posted
post_date = card.find('li', id='job-date').text
post_date

'09 October'

In [None]:
### Get current date to check when scrapping was done
today = datetime.today().strftime('%Y-%m-%d')
today

'2024-10-15'

### Putting everything together into a model with a function

In [None]:
### Put all gotten values into a function to add it as a list

def get_record(card):
  """Extraxt job data from a single record"""
  atag = card.h2.a
  job_title = atag.text.split(' at ')[0].strip()
  job_url = 'https://www.myjobmag.com/' + atag.get('href')
  company = atag.text.split(' at ')[1].strip()
  job_description = card.find('li', 'job-desc').text.replace('\n', '')
  post_date = card.find('li', id='job-date').text
  today = datetime.today().strftime('%Y-%m-%d')

  record = (job_title, job_url, company, job_description, post_date, today)

  return record

In [None]:
### Create an empty list
records = []

### then iterate through site using the cards
for card in cards:
  record = get_record(card)
  records.append(record)

In [None]:
records[0]

('Data Analyst',
 'https://www.myjobmag.com//job/data-analyst-buybeta-investment-limited',
 'Buybeta Investment Limited',
 "Job SummaryThe Data Analyst will support the skincare company's data needs by analyzing sales, inventory, and market trends to optimize product availability, streamline operations, and enha ",
 '09 October',
 '2024-10-15')

In [None]:
soup.find('ul', 'setPaginate')

<ul class="setPaginate"><li><a class="current_page">1</a></li><li><a href="/search/jobs?q=Data%20Analyst&amp;location=lagos¤tpage=2">2</a></li><li><a href="/search/jobs?q=Data%20Analyst&amp;location=lagos¤tpage=3">3</a></li><li><a href="/search/jobs?q=Data%20Analyst&amp;location=lagos¤tpage=4">4</a></li><li><a href="/search/jobs?q=Data%20Analyst&amp;location=lagos¤tpage=5">5</a></li></ul>

In [None]:
url = 'https://www.myjobmag.com' + soup.find('ul', 'setPaginate').find('a', 'current_page').parent.find_next_sibling('li').a.get('href').replace('lagos¤tpage', 'lagos&currentpage')
url

'https://www.myjobmag.com/search/jobs?q=Data%20Analyst&location=lagos&currentpage=2'


### Getting to the next page

The model we have created previously was for getting details on only the first page.
We now have to create a model on how to get to the next page

- Frist we get the href for next page
- Then we itirate through each page
- Then use records to get each page details

In [None]:
### trying to get pages

while True:
  try:
    url = 'https://www.myjobmag.com' + soup.find('ul', 'setPaginate').find('a', 'current_page').parent.find_next_sibling('li').a.get('href').replace('lagos¤tpage', 'lagos&currentpage')
  except AttributeError:
    break
  response = requests.get(url)
  soup = BeautifulSoup(response.text, 'html.parser')
  cards = soup.find_all('li', 'job-info')

  for card in cards:
    record = get_record(card)
    records.append(record)

    # Check if we have reached the last page
    if not url:
        break

In [None]:
len(records)

1054

### Putting it all together

In [52]:
import csv
import pandas as pd
from datetime import datetime
import requests
from bs4 import BeautifulSoup
import os
import re

def get_url(role, location):
    """Generate Url for role"""
    template = 'https://www.myjobmag.com/search/jobs?q={}&location={}'
    url = template.format(role, location)
    return url

def get_record(card):
    """Extract job data from a single record"""
    atag = card.h2.a
    job_title = atag.text.split(' at ')[0].strip()
    job_url = 'https://www.myjobmag.com/' + atag.get('href')
    company = atag.text.split(' at ')[1].strip()
    job_description = card.find('li', 'job-desc').text.replace('\n', '')
    post_date = card.find('li', id='job-date').text
    today = datetime.today().strftime('%Y-%m-%d')
    record = (job_title, job_url, company, job_description, post_date, today)
    return record

def main(role, location):
    """Run the main program routine"""
    records = []
    url = get_url(role, location)

    while True:
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')
        cards = soup.find_all('li', 'job-info')

        for card in cards:
            record = get_record(card)
            records.append(record)

        try:
            url = 'https://www.myjobmag.com' + soup.find('ul', 'setPaginate').find('a', 'current_page').parent.find_next_sibling('li').a.get('href').replace('lagos¤tpage', 'lagos&currentpage')
        except AttributeError:
            break

    # Create a DataFrame from the list of records
    df = pd.DataFrame(records, columns=['job_title', 'job_url', 'company', 'job_description', 'post_date', 'scrap_date'])

    # Check if the CSV file already exists
    if os.path.exists('jobs_listing_data_analyst_151024.csv'):
        # Load the existing DataFrame from the CSV file
        existing_df = pd.read_csv('jobs_listing_data_analyst_151024.csv')
        # Concatenate the existing DataFrame with the new DataFrame
        df = pd.concat([existing_df, df], ignore_index=True)
        # Remove duplicate rows
        df = df.drop_duplicates()

    # Save DataFrame to CSV file
    df.to_csv('jobs_listing_data_analyst_151024.csv', index=False)

In [53]:
def get_alphabetic_input(prompt):
    """Get user input and ensure it contains only alphabetic characters and spaces"""
    while True:
        user_input = input(prompt)
        if re.match(r'^[a-zA-Z\s]+$', user_input):
            return user_input
        else:
            print("Please enter only alphabetic characters and spaces.")

# Run the main program
if __name__ == "__main__":
    job_title = get_alphabetic_input("Enter the job title (e.g., data analyst): ")
    location = get_alphabetic_input("Enter the location (e.g., lagos): ")
    main(job_title, location)

Enter the job title (e.g., data analyst): data analyst
Enter the location (e.g., lagos): lagos


In [54]:
### Read the csv created data
jobs_listing_data = pd.read_csv('jobs_listing_data_analyst_151024.csv')
jobs_listing_data.tail()

Unnamed: 0,job_title,job_url,company,job_description,post_date,scrap_date
1048,Planning Analyst,https://www.myjobmag.com//job/planning-analyst...,Hobark International Limited (HIL),Job DescriptionBudgeting and Forecasting: Supp...,16 April,2024-10-15
1049,Risk and Collection Analyst,https://www.myjobmag.com//job/risk-and-collect...,HRD Solutions,Job DescriptionAs a Risk and Collection Analys...,15 April,2024-10-15
1050,Credit Risk Analyst,https://www.myjobmag.com//job/credit-risk-anal...,Coolbucks,Role DescriptionThis is a full-time on-site ro...,15 April,2024-10-15
1051,Legal Analyst,https://www.myjobmag.com//job/legal-analyst-rs...,RS Hunter Limited,Interested and qualified candidates should apply,15 April,2024-10-15
1052,Corporate Finance Analyst,https://www.myjobmag.com//job/corporate-financ...,RS Hunter Limited,Interested and qualified candidates should apply,15 April,2024-10-15
