# Analysis of Job Postings (Data Analytics)

## Business Case Overview

You're working as a data scientist for a contracting firm that's rapidly expanding. Now that they have their most valuable employee (you!), they need to leverage data to win more contracts. Your firm offers technology and scientific solutions and wants to be competitive in the hiring market. Your principal has two main objectives:

   1. Determine the industry factors that are most important in predicting the salary amounts for these data.
   2. Determine the factors that distinguish job categories and titles from each other. 


---


### QUESTION 1: Factors that impact salary

To predict salary you can frame this as a classification problem, in which case you will create labels from these salaries (high vs. low salary, for example) according to thresholds (such as median salary).


### QUESTION 2: Factors that distinguish job category

There are a variety of interesting ways you can frame the target variable, for example:
- What components of a job posting distinguish data scientists from other data jobs?
- What features are important for distinguishing junior vs. senior positions?
- Do the requirements for titles vary significantly with industry ?

###  Overview:

Part 1. Scrape and prepare your own data.

Part 2. Data Cleaning and Exploratory data analysis (EDA)

Part 3. Modelling and evaluation

Part 4. Executive summary

# Part 1 -  web scraping

In [2]:
from selenium import webdriver
import os
from bs4 import BeautifulSoup
import urllib
import numpy as np
from time import sleep
import csv
from selenium.webdriver.common.keys import Keys
import random
import pandas as pd

In [3]:
# web scraping(job url links)
# go to mycareersfuture website and search for "data"
# search through the pages and get the url links to the jobs and save into csv file

driver = webdriver.Chrome(executable_path="./chromedriver/chromedriver")
links=[]

# for loop to search through the pages and add the url links into the list
for page in range(0, 2):  # the actual data was scraped using range(0,200)
    sleep(random.randint(4,5))
    html = driver.page_source
    soup = BeautifulSoup(html, "lxml")
    driver.get("https://www.mycareersfuture.sg/search?search=data&sortBy=new_posting_date&page={}".format(page))
    
    if len(soup)==0: 
        pass       # pass if no links is found
    else:
        for link in soup.find_all("a", {"class": "bg-white mb3 w-100 dib v-top pa3 no-underline flex-ns flex-wrap JobCard__card___22xP3"}):
            link = link.get("href")
            links.append("https://www.mycareersfuture.sg"+link)

# convert to dataframe
links_df = pd.DataFrame({"Links":links})
# save to csv
links_df.to_csv("links_df_sample.csv")
# Closes the driver
driver.close()

In [None]:
# web scraping(job info)
# load csv file with the url links
csv = './links_df_sample.csv'  # actual file used is links_df.csv
df = pd.read_csv(csv)

driver = webdriver.Chrome(executable_path="./chromedriver/chromedriver")
# Setting up lists
company=[]
job_title=[]
location=[]
employment_type=[]
seniority=[]
job_categories=[]
salary=[]
payment_period=[]
job_description=[]
requirements=[]
count=0

# search through links to get the job info, append Nan if info not found

for info in df.Links:
    sleep(random.randint(8,10))
    html = driver.page_source
    soup = BeautifulSoup(html, "lxml")
    driver.get(info)
    count +=1
    sleep(random.randint(5,8))
    print('no.',count)
    
    try:
        company.append(driver.find_element_by_name('company').text)
    except:
        company.append(np.nan)
    try:
        job_title.append(driver.find_element_by_id('job_title').text)
    except:
        job_title.append(np.nan)    
    try:
        location.append(driver.find_element_by_id('address').text)
    except:
        location.append(np.nan)  
    try:
        employment_type.append(driver.find_element_by_id('employment_type').text)
    except:
        employment_type.append(np.nan)
    try:
        seniority.append(driver.find_element_by_id('seniority').text)
    except:
        seniority.append(np.nan)
    try:
        job_categories.append(driver.find_element_by_id('job-categories').text)
    except:
        job_categories.append(np.nan)
    try:
        salary.append(driver.find_element_by_class_name('lh-solid').text)
    except:
        salary.append(np.nan)
    try:
        payment_period.append(driver.find_element_by_class_name('salary_type').text)
    except:
        payment_period.append(np.nan)      
    try:
        job_description.append(driver.find_element_by_id('job_description').text)
    except:
        job_description.append(np.nan)
    try:
        requirements.append(driver.find_element_by_id('requirements').text)
    except:
        requirements.append(np.nan)

# save info into a dataframe
jobs_info = pd.DataFrame({'company':company,'job_title':job_title,'location':location,'employment_type':employment_type,'seniority':seniority,'job_categories':job_categories,'salary':salary,'payment_period':payment_period,'job_description':job_description, 'requirements':requirements })

# add to the first links dataframe
result = pd.concat([df,jobs_info], axis=1,sort=False)
# save to a new csv file
result.to_csv('job_info_sample.csv') # actual file used is job_info.csv

# Closes the driver
driver.close()

no. 1
no. 2
no. 3
no. 4
no. 5
no. 6
no. 7
no. 8
no. 9
no. 10
no. 11
no. 12
no. 13
no. 14
