## Job Scraping with Python using Selenium with BeautifulSoup
### Target site: jobstreet.co.id

I'm here targeting elements using xPath and Class. Maybe this time it was made with the time you tried this there would be a difference in the name or position in the xPath/Class because I noticed that jobstreet was like using a styled-component, which would randomly generate class names.

In [None]:
# Import Library

from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
import time

In [None]:
# Prepare Selenium Options
options = webdriver.ChromeOptions()

# Set Selenium to use fullscreen
options.add_argument('--start-maximized')

# Set Selenium to use Chrome and pass the options
driver = webdriver.Chrome('./driver/chromedriver.exe', options=options)

### Set Selenium to open targeted site and wait for 5sec
### In this case i set to 5sec for avoid blocking IP

In [None]:
driver.get('https://www.jobstreet.co.id')
time.sleep(5)

Look for input elements using xPath and enter the target keyword

In [None]:
job_position_input = driver.find_element_by_xpath('//*[@id="searchKeywordsField"]')
job_position_input.send_keys('Data Engineer')

Look for the button and click to find a list of jobs that match the keyword, and of course wait 5sec after that

In [None]:
search_button = driver.find_element_by_xpath('//*[@id="contentContainer"]/div/div[1]/div/div/div/div[2]/div/form/div/div/div[2]/div[4]/button').click()
time.sleep(5)

## Prepare DataFrame using Pandas Library
### In this scraping I want to take some information; Job title, job link, company, company link, location, salary and date of posting of the job.

In [None]:
job_data = pd.DataFrame({'Link': [],
                          'Position': [],
                          'Company': [],
                          'Company Link': [],
                          'Location': [],
                          'Salary': [],
                          'Published Date': []})

Because the job list page contains several pages, it will be using a While-Loop to reach all available pages.

In [None]:
# Initial state
i = 0

# Start scraping
while True:
    soup = BeautifulSoup(driver.page_source, 'lxml')

    # Find the information wrapper (Card)
    job_lists = soup.find_all(
        'div', class_='sx2jih0 zcydq87a zcydq86a zcydq84y zcydq85a')

    # If there is, the initial state will be incremented by one.
    i += 1

    # Looping the job lists
    for job_list in job_lists:
        full_link = 'https://www.jobstreet.co.id'
        company_temp = job_list.find('a', class_='sx2jih0 sx2jihe _2sRFr')
        link = job_list.find(
            'a', class_='_18qlyvc12 _9tnmfh1 _18qlyvc2 sx2jih0 sx2jihe zcydq824').get('href')
        full_job_link = full_link + link
        position = job_list.find('span', class_='sx2jih0').text
        company = company_temp.text
        company_link = full_link + company_temp.get('href')
        location = job_list.find('a', class_='sx2jih0 sx2jihe _2sRFr').text
        published_date = job_list.find(
            'time', class_='sx2jih0 zcydq82q').attrs['datetime'].strip()

        # Because not all salary information is available, try-except is used to fill in the blank salary information.
        try:
            salary = job_list.find_all(
                'span', class_='sx2jih0 zcydq82q')[1].text
        except:
            salary = 'NA'

        # Adding into DataFrame
        job_data = job_data.append(
            {'Link': full_job_link,
             'Position': position,
             'Company': company,
             'Company Link': company_link,
             'Location': location,
             'Salary': salary,
             'Published Date': published_date}, ignore_index=True)

    # Next Page
    if i == 1:
        try:
            next_page = soup.find(
                'a', class_='sx2jih0 zcydq872 zcydq862 zcydq88 zcydq82b zcydq832 zcydq8c6 zcydq824 zcydq82l zcydq82k CyicE_0').get('href')
            driver.get(full_link + next_page)
        except:
            break
    else:
        try:
            next_page = soup.find_all(
                'a', class_='sx2jih0 zcydq872 zcydq862 zcydq88 zcydq82b zcydq832 zcydq8c6 zcydq824 zcydq82l zcydq82k CyicE_0')[1].get('href')
            driver.get(full_link + next_page)
        except:
            break


### Export DataFrame to .csv

In [None]:
job_data.to_csv('job_lists.csv')