## Introduction
This practice covers the steps on how to scrape jobs from Indeed. Using python, bs4, selenium, and pandas, we'll be able to extract information from indeed.com and construct a pandas data frame. Before we begin, let's understand web scraping simply. 

Imagine if you are trying to get much information about something from various web pages and articles that need to be stored in a suitable format, for instance, an excel file. One way is to go through all those websites and write the useful information to the excel sheets manually. But programmers tend to do it in an easy way which is web scraping. Web scraping is the technique of extracting a large amount of data from different web pages that can be stored in a suitable format.

## Scraping job details from Indeed
Indeed is one of the largest American job listing portals which consists of millions of job listings all over the world from different small scale and large scale companies including startups. Scraping job details from indeed really helps you to get a large amount of information about different jobs, locations, actively hiring companies, ratings, etc.

Here are the steps involved:

1. Install and import necessary modules
2. Send some basic queries like like job title or company name and location to the Indeed website using selenium
3. Fetch the current URL after sending the queries to the website using selenium
4. Parse the page using requests and Beautiful Soup
5. Fetch the information about job title, company name, rating, location, simple description, date of posting, etc
6. Store this information into a CSV file using pandas


## Load Libraries
First of all, we need to install some specific modules including a chrome driver for selenium. After installing the chrome driver move it to the working directory.

We need to import the libraries that will be used for this practical. Here requests help to send an HTTP request using python, Selenium is an automation tool that helps here to send queries to the website, lxml can convert the page into XML or HTML format. bs4 module for parsing the web page and pandas to convert the data into a CSV file.



In [1]:
# Load packages
import requests 
import pandas as pd 
import time

from bs4 import BeautifulSoup 
from selenium import webdriver 
from selenium.webdriver.common.keys import Keys 
from selenium.webdriver.chrome.service import Service 
from selenium.webdriver.common.by import By

## Sending job title and location using selenium
Now let's create a function that sends queries to the web page and returns the current URL. This function opens indeed.com using the specified URL as one of its parameters. Then it sends the job title or company name and location to the site using selenium. After that, we'll get a new page and its URL which consists of all the job details related to the job title and location you have specified as its parameters. Lastly, it returns the current URL which consists of jobs and their details so that we can simply scrape it using Beautiful Soup.

In [2]:
def get_current_url(url, job_title, location): 
    # service = Service(executable_path="selenium-webdriver/safari")
    # driver = webdriver.Firefox(service=service)
    service = Service(executable_path="/chromedriver")
    driver = webdriver.Chrome(service=service)
    driver.get(url)
    time.sleep(3)
    driver.find_element("xpath", '//*[@id="text-input-what"]').send_keys(job_title)
    time.sleep(3)
    driver.find_element("xpath", '//*[@id="text-input-where"]').send_keys(location)
    time.sleep(3)
    driver.find_element("xpath", "/html/body/div").click()
    time.sleep(3)
    try: 
        driver.find_element("xpath", '//*[@id="jobsearch"]/button').click()
    except: 
        driver.find_element("xpath", '//*[@id="whatWhereFormId"]/div[3]/button').click()
    current_url = driver.current_url
    return current_url

current_url = get_current_url("https://sg.indeed.com/", "Data Scientist", "Singapore")
print(current_url)

https://sg.indeed.com/jobs?q=Data%20Scientist&l=Singapore%20General%20Hospital&from=searchOnHP


## Scraping jobs using Beautiful Soup
Now let's get into the scraping part. We can use BeautifulSoup to scape data that we require like what we have covered in the previous lab. Try the following code.


In [3]:
resp = requests.get(current_url)
content = BeautifulSoup(resp.content, 'lxml')
print(content)

<!DOCTYPE html>
<!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en-US"> <![endif]--><!--[if IE 7]>    <html class="no-js ie7 oldie" lang="en-US"> <![endif]--><!--[if IE 8]>    <html class="no-js ie8 oldie" lang="en-US"> <![endif]--><!--[if gt IE 8]><!--><html class="no-js" lang="en-US"> <!--<![endif]-->
<head>
<title>Attention Required! | Cloudflare</title>
<meta charset="utf-8"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
<meta content="noindex, nofollow" name="robots"/>
<meta content="width=device-width,initial-scale=1" name="viewport"/>
<link href="/cdn-cgi/styles/cf.errors.css" id="cf_styles-css" rel="stylesheet"/>
<!--[if lt IE 9]><link rel="stylesheet" id='cf_styles-ie-css' href="/cdn-cgi/styles/cf.errors.ie.css" /><![endif]-->
<style>body{margin:0;padding:0}</style>
<!--[if gte IE 10]><!-->
<script>
  if (!navigator.cookieEnabled) {
    window.addEventListener('DOMContentLoaded', function () {

In [4]:
def scrape_job_details(url):
    service = Service(executable_path="/chromedriver")
    driver = webdriver.Chrome(service=service)

    driver.get(url)
    content = driver.find_elements(By.CLASS_NAME, 'job_seen_beacon')

    jobs_list = []
    for post in content:
        try:
            data = {
            "job_title": post.find_element(By.CLASS_NAME,'jobTitle').text,
            "company": post.find_element(By.CLASS_NAME, 'companyName').text,
            "salary": post.find_element(By.CLASS_NAME, 'attribute_snippet').text,
            "location": post.find_element(By.CLASS_NAME, 'companyLocation').text,
            "date": post.find_element(By.CLASS_NAME, 'date').text,
            "job_desc": post.find_element(By.CLASS_NAME, 'job-snippet').text
            }
        except IndexError:
            continue
        jobs_list.append(data)

    return pd.DataFrame(jobs_list)


The driver.get() function returns the data of the entire webpage. The next step is to find the CSS selectors and retrieve the raw text inside the tags that contain these CSS selectors. The CSS selectors given in the code are probably the same on the web page but sometimes it may change.

By looping through all the job posts we'll get much information about it. Lastly, we converted the data into a pandas data frame and simply returned it. You'll get the details about the job title, company name, rating, location, date of posting, and a simple job description. You can save it as a CSV file using df.to_csv("jobs.csv").


In [5]:
df_jobs = scrape_job_details(current_url)
df_jobs.to_csv("jobs.csv", index=False)
df_jobs.head(20)

Unnamed: 0,job_title,company,salary,location,date,job_desc
0,Cybersecurity Data Analyst – Data Analytics an...,Info-communications Media Development Authority,Full-time\n+2,Singapore,Posted\nPosted 30+ days ago,At least 1 year of experience in data transfor...
1,"Quality Engineer (Video Analytics), Data Scien...",GVT Government Technology Agency (GovTech),Full-time,Remote in Singapore,Posted\nPosted 30+ days ago,Develop and maintain records on all test suite...
2,Data Scientist,SINGAPORE TELECOMMUNICATIONS LIMITED,"$5,500 - $11,000 a month",Singapore,Posted\nPosted 2 days ago,Communicate data insights and findings to wide...
3,Data Scientist - New Start-up!,SCAYLER PTE. LTD.,Full-time\n+4,Singapore,Posted\nPosted 18 days ago,Create future NLP technology that is powering ...
4,Data Scientist,SPH MEDIA LIMITED,"$6,000 - $12,000 a month",Singapore,Posted\nPosted 3 days ago,You are familiar with SQL and data stores such...
5,Data Scientist,HUBBLE PTE. LTD.,"$5,000 - $7,000 a month",Singapore,Posted\nToday,Able to assess the effectiveness and accuracy ...
6,Data Scientist (Transportation),GATEWAY SEARCH PTE. LTD.,"$9,000 - $12,000 a month",Singapore,Posted\nPosted 30+ days ago,Experience applying data science in end-to-end...
7,Data Scientist,TANGSPAC CONSULTING PTE LTD,"$8,000 - $12,000 a month",Singapore,Posted\nPosted 15 days ago,Min 4 years of experienced working as data sci...
8,Data Scientist,SWIRE SHIPPING PTE. LTD.,"$5,500 - $11,000 a month",Singapore,Posted\nPosted 4 days ago,Minimum 10 year’s experience in data profiling...
9,Data Scientist,AMBITION GROUP SINGAPORE PTE. LTD.,"$10,000 - $19,000 a month",Singapore,Posted\nPosted 30+ days ago,Ability to apply statistical tests to large da...
