# Overview

**Project Scenario**

The team at the recruitment agency is trying to improve its sourcing of job vacancies. To do this the agency relies on multiple job posting sites to identify potential job openings for its clients. However, manually searching through each site is time-consuming and often leads to missed opportunities.  

We would  analyze the data using web scraping tools that can automatically extract job posting data from multiple job posting sites.  The team will use the analysis to provide a more efficient way to provide job vacancies to better serve its clients. This feature will help the recruitment agency by getting relevant openings to their clients more quickly, giving their clients a competitive advantage over other applicants.

**Project Objectives**

- Increase the efficiency of job vacancy sourcing

- Improve the quality of job vacancy sourcing  

- Gain a competitive advantage



**The task** will be to conduct a web scraping data analysis to automatically extract job posting data from a job posting site. To do this, an environment would be set up, identify the job posting site, scrape the data,  process, analyze, and visualize the data.

### The site to be used in our web scrapping is shine.com

We would create a general purpose job scraper for [www.shine.com](https://www.shine.com)

### Importing the important libraries to be used

In [181]:
import csv
import pandas as pd
from datetime import datetime
import requests
from bs4 import BeautifulSoup

### Connect to Google Drive

In [182]:
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Getting the URL

Go to indeed, search for a job title, then copy the link up to where location ends

In [183]:
## Assign the url to a variable
### Replace the job query i.e. q=[job title] with {} and also the location

template = 'https://www.shine.com/job-search/{}-jobs-in-{}?q={}&loc={}'

In [184]:
def get_url(position, location):
    """Generate a URL from position and location"""
    template = 'https://www.shine.com/job-search/{}-jobs-in-{}?q={}&loc={}'
    url = template.format(position, location, position, location)
    return url

In [185]:
url = get_url('Big Data Analyst', 'Bangalore')

### Extract raw html

In [186]:
### Request URL from server
response = requests.get(url)

In [187]:
### Check response
response

<Response [200]>

In [188]:
### Reason for response
response.reason

'OK'

In [189]:
# Parse the HTML content of the response using BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

In [190]:
### After investigating the HTML using inspect we can see that
### all job titles has a div class container of jobCard_jobCard__jjUmu
""" Which we would use to find all job titles"""

cards = soup.find_all('div', 'jobCard_jobCard__jjUmu')

In [191]:
len(cards)

15

### We can then prototype the model with a single record

In [192]:
### Get the first element in the job posting page (i.e. the first div element)
card = cards[0]

In [193]:
### Get the a tag, which the job title is located
atag = card.h2.a
atag

<a href="/jobs/data-analyst-permanent/diraa-hr-services/14054678">Data Analyst</a>

In [194]:
### Extract the job title from the atag
job_title = atag.text
job_title

'Data Analyst'

In [195]:
### Extract the job Url using the href tag and add to the wbesite string
job_url = 'https://www.shine.com' + atag.get('href')
job_url

'https://www.shine.com/jobs/data-analyst-permanent/diraa-hr-services/14054678'

In [196]:
### Extract company name
company = card.find('div', 'jobCard_jobCard_cName__mYnow').span.text
company

'DIRAA HR SERVICES Hiring For MNCs'

In [197]:
### Use the div tag to find the location
job_location = card.find('div', 'jobCard_locationIcon__zrWt2').text
job_location

'Bangalore+3Chennai, Hyderabad, Coimbatore'

In [198]:
### Gettting the job type
job_type = card.find('ul', 'jobCard_jobCard_jobDetail__jD82J').text
job_type

'Regular10 Positions'

In [199]:
### Then getting the day job was posted
post_date = card.find('div', 'jobCard_jobCard_features__wJid6').find_all('span')[-1].text
post_date

'2 months ago'

In [200]:
### Get current date
today = datetime.today().strftime('%Y-%m-%d')
today

'2024-03-19'

In [201]:
### We also need to get the years of experience needed
try:
  work_experience = card.find('div', 'jobCard_jobCard_lists_item__YxRkV jobCard_jobIcon__3FB1t').text
except AttributeError:
  work_experience = ''
work_experience

'0 to 1 Yr'

### Putting everything together into a model with a function

In [202]:
def get_record(card):
  """Extraxt job data from a single record"""
  atag = card.h2.a
  job_title = atag.text
  job_url = 'https://www.shine.com' + atag.get('href')
  company = card.find('div', 'jobCard_jobCard_cName__mYnow').span.text
  job_location = card.find('div', 'jobCard_locationIcon__zrWt2').text
  job_type = card.find('ul', 'jobCard_jobCard_jobDetail__jD82J').text
  post_date = card.find('div', 'jobCard_jobCard_features__wJid6').find_all('span')[-1].text
  today = datetime.today().strftime('%Y-%m-%d')
  try:
    work_experience = card.find('div', 'jobCard_jobCard_lists_item__YxRkV jobCard_jobIcon__3FB1t').text
  except AttributeError:
    work_experience = ''

  record = (job_title, job_url, company, job_location, job_type, post_date, today, work_experience)

  return record

### create a list of records

In [203]:
### Create an empty list
records = []

### then iterate through site using the cards
for card in cards:
  record = get_record(card)
  records.append(record)

In [204]:
records[0]

('Data Analyst',
 'https://www.shine.com/jobs/data-analyst-permanent/diraa-hr-services/14054678',
 'DIRAA HR SERVICES Hiring For MNCs',
 'Bangalore+3Chennai, Hyderabad, Coimbatore',
 'Regular10 Positions',
 '2 months ago',
 '2024-03-19',
 '0 to 1 Yr')

### Getting to the next page

The model we have created previously was for getting details on only the first page.
We now have to create a model on how to get to the next page

- Frist we get the href for next page
- Then we itirate through each page
- Then use records to get each page details

In [205]:
### trying to get for only 20 pages

while True:
  try:
    url = 'https://www.shine.com/job-search/' + soup.find('div', 'jsrpcomponent_pagination__gBo0y').find('a', {'title': 'Next'}).get('href')
    url
  except AttributeError:
    break
  response = requests.get(url)
  soup = BeautifulSoup(response.text, 'html.parser')
  cards = soup.find_all('div', 'jobCard_jobCard__jjUmu')

  for card in cards:
    record = get_record(card)
    records.append(record)


    # Check if we have reached the last page
    if not url:
        break

In [206]:
len(records)

15

### Putting it all together

In [207]:
import csv
import pandas as pd
from datetime import datetime
import requests
from bs4 import BeautifulSoup
import os

def get_url(position, location):
    """Generate a URL from position and location"""
    template = 'https://www.shine.com/job-search/{}-jobs-in-{}?q={}&loc={}'
    url = template.format(position, location, position, location)
    return url

def get_record(card):
  """Extraxt job data from a single record"""
  atag = card.h2.a
  job_title = atag.text
  job_url = 'https://www.shine.com' + atag.get('href')
  company = card.find('div', 'jobCard_jobCard_cName__mYnow').span.text
  job_location = card.find('div', 'jobCard_locationIcon__zrWt2').text
  job_type = card.find('ul', 'jobCard_jobCard_jobDetail__jD82J').text
  post_date = card.find('div', 'jobCard_jobCard_features__wJid6').find_all('span')[-1].text
  today = datetime.today().strftime('%Y-%m-%d')
  try:
    work_experience = card.find('div', 'jobCard_jobCard_lists_item__YxRkV jobCard_jobIcon__3FB1t').text
  except AttributeError:
    work_experience = ''

  record = (job_title, job_url, company, job_location, job_type, post_date, today, work_experience)

  return record

def main(position, location):
  """Run the main program routine"""
  records = []
  url = get_url(position, location)

  while True:
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    cards = soup.find_all('div', 'jobCard_jobCard__jjUmu')

    for card in cards:
      record = get_record(card)
      records.append(record)

    try:
      url = 'https://www.shine.com/job-search/' + soup.find('div', 'jsrpcomponent_pagination__gBo0y').find('a', {'title': 'Next'}).get('href')
    except AttributeError:
      break

  # Create a DataFrame from the list of records
  df = pd.DataFrame(records, columns=['Job Title', 'Job URL', 'Company', 'Job Location', 'Job Type', 'Post Date', 'Scraping Date', 'Work Experience'])

  # Check if the CSV file already exists
  if os.path.exists('jobs_listing_data.csv'):
      # Load the existing DataFrame from the CSV file
      existing_df = pd.read_csv('jobs_listing_data.csv')
      # Concatenate the existing DataFrame with the new DataFrame
      df = pd.concat([existing_df, df], ignore_index=True)
      # Remove duplicate rows
      df = df.drop_duplicates()

  # Save DataFrame to CSV file
  df.to_csv('jobs_listing_data.csv', index=False)

#### Anytime a new job posting is needed, we just enter job title and location

In [208]:
### Run the main program
main('Big Data Analyst', 'India')

In [209]:
### Read the csv created data
jobs_listing_data = pd.read_csv('jobs_listing_data.csv')
jobs_listing_data.head()

Unnamed: 0,Job Title,Job URL,Company,Job Location,Job Type,Post Date,Scraping Date,Work Experience
0,Graphic Designer,https://www.shine.com/jobs/graphic-designer-pe...,Connexions,Jaipur+1Other Rajasthan,Regular,3 days ago,2024-03-19,1 to 5 Yrs
1,Graphics Designer,https://www.shine.com/jobs/graphics-designer-p...,Connexions,Jaipur+1Other Rajasthan,Regular,3 days ago,2024-03-19,2 to 6 Yrs
2,Assistant Graphic Designer,https://www.shine.com/jobs/assistant-graphic-d...,Scaleswift Digital Services Pvt. Lt...,All India,Regular,3 weeks ago,2024-03-19,1 to 2 Yrs
3,Graphic Designer,https://www.shine.com/jobs/graphic-designer-pe...,RK Websoft Technologies Pvt Ltd,Other Gujarat+1Anand,Regular,1 day ago,2024-03-19,2 to 6 Yrs
4,Graphic Designer,https://www.shine.com/jobs/graphic-designer-pe...,Sourcedesk Global,Kolkata+1Other West Bengal,Regular,1 day ago,2024-03-19,2 to 6 Yrs
