## Requirements
1. Scrape and prepare your own data.

2. Create and compare at least two models for each section. One of the two models should be a decision tree or ensemble model. The other can be a classifier or regression of your choosing (e.g. Ridge, logistic regression, KNN, SVM, etc).

 - Section 1: Job Salary Trends
 - Section 2: Job Category Factors

3. Prepare a polished Jupyter Notebook with your analysis for a peer audience of data scientists.

 - Make sure to clearly describe and label each section.
 - Comment on your code so that others could, in theory, replicate your work.

4. A brief writeup in an executive summary, written for a non-technical audience.

 - Writeups should be at least 500-1000 words, defining any technical terms, explaining your approach, as well as any risks and limitations.


## BONUS
5. Answer the salary discussion by using your model to explain the tradeoffs between detecting high vs low salary positions.

6. Convert your executive summary into a public blog post of at least 500 words, in which you document your approach in a tutorial for other aspiring data scientists. Link to this in your notebook.

## Suggestions for Getting Started
1. Collect data from Indeed.com (or another aggregator) on data-related jobs to use in predicting salary trends for your analysis.
 - Select and parse data from at least 1000 postings for jobs, potentially from multiple location searches.
 
2. Find out what factors most directly impact salaries (e.g. title, location, department, etc).
 - Test, validate, and describe your models. What factors predict salary category? How do your models perform?
 
3. Discover which features have the greatest importance when determining a low vs. high paying job.
 - Your Boss is interested in what overall features hold the greatest significance.
 - HR is interested in which SKILLS and KEY WORDS hold the greatest significance.
 
4. Author an executive summary that details the highlights of your analysis for a non-technical audience.

5. If tackling the bonus question, try framing the salary problem as a classification problem detecting low vs. high salary positions.

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# QUESTION 1: Factors that impact salary
To predict salary you will be building either a classification or regression model, using features like the location, title, and summary of the job. If framing this as a regression problem, you will be estimating the listed salary amounts. You may instead choose to frame this as a classification problem, in which case you will create labels from these salaries (high vs. low salary, for example) according to thresholds (such as median salary).

You have learned a variety of new skills and models that may be useful for this problem:

- NLP
- Unsupervised learning and dimensionality reduction techniques (PCA, clustering)
- Ensemble methods and decision tree models
- SVM models

Whatever you decide to use, the most important thing is to justify your choices and interpret your results. Communication of your process is key. Note that most listings DO NOT come with salary information. You'll need to able to extrapolate or predict the expected salaries for these listings.

# QUESTION 2: Factors that distinguish job category
Using the job postings you scraped for part 1 (or potentially new job postings from a second round of scraping), identify features in the data related to job postings that can distinguish job titles from each other. There are a variety of interesting ways you can frame the target variable, for example:

- What components of a job posting distinguish data scientists from other data jobs?
- What features are important for distinguishing junior vs. senior positions?
- Do the requirements for titles vary significantly with industry (e.g. healthcare vs. government)?

You may end up making multiple classification models to tackle different questions. Be sure to clearly explain your hypotheses and framing, any feature engineering, and what your target variables are. The type of classification model you choose is up to you. Be sure to interpret your results and evaluate your models' performance.

# BONUS PROBLEM
Your boss would rather tell a client incorrectly that they would get a lower salary job than tell a client incorrectly that they would get a high salary job. Adjust one of your models to ease his mind, and explain what it is doing and any tradeoffs. Plot the ROC curve.

##### Choose a website to scrape
We will be scraping data from "seek.com.au" for the data related jobs.

- Website used   : 
                   seek.com.au
        
- Jobs searched  : 
                   Any job that includes 'data' as keyword

- Industry       :
                   All industries

- Job location   :
                   All of Australia
        
- Number of scraped jobs : 
                   3300 jobs (150 pages, removed duplicates)

# Scraping Seek

In [6]:
# Import useful libraries

import pandas as pd
import numpy as np
import json
import re

import requests
from scrapy.selector import Selector
from bs4 import BeautifulSoup
import validators
import urllib
import urllib.request, urllib.parse, urllib.error

from selenium import webdriver
from selenium.webdriver.common.by import By

In [7]:
import matplotlib.pyplot as plt
%matplotlib inline

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

The information we want are:

- Job title
- State
- Work type (Full time, Part time, Contract, etc...)
- Field (IT, Communication, Medical etc)
- Salary
- Description


To obtain the above, we need to access indivudual job url which can be accessed from the aggregate page.

## Create a function which returns a list of links for individual jobs

In [113]:
def url_list(total_page_number):
    
    link_list = []
    
    for page_number in range(1, (total_page_number) +1):
    
        homepage   = 'https://www.seek.com.au/data-jobs?page=%s' %(page_number)
        response   = requests.get(homepage)
        html       = response.text
        soup       = BeautifulSoup(html, 'lxml')
    
    
        job_titles = soup.find_all(name='a', 
                                   attrs={'data-automation':'jobTitle'})

    
        for job_title in job_titles:
            
            link_part = job_title.get('href')
            link      = 'https://www.seek.com.au'+link_part
            
            link_list.append(link)
    
            
    return (link_list)

In [114]:
# Check the length of url list if we scrape the first page.
len(url_list(1))

22

# Scrape & create a dictionary of salaries and other features.

The variables scraped are:
- Job title
- Company name
- State & Suburb
- Work type (Full time, Part time, Contract, etc...)
- Field (IT, Communication, Medical etc)
- Department
- Salary
- Description

In [115]:
# Initialise an empty dictionary for data jobs

job_dictionary  = {}


# Initialise empty lists for each variable

title_list      = []
state_list      = []
worktype_list   = []
field_list      = []
salary_list     = []
summary_list    = []



###############################################################################################################################
# Iterate through the 150 pages
total_page = 150


for url in url_list(total_page):

    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    
    
   # TITLE list --------------------------------------------------------------------------------------------------------------    
  
    title = soup.find('h1').text
    title_list.append(title)

    

   # Access 'job_info' box on each job page-----------------------------------------------------------------------------------
   # This box contains STATE, WORK-TYPE, FIELD and SALARY information.
    
    job_info = soup.find(name='section', attrs={'aria-labelledby':'jobInfoHeader'}) 
    

   # STATE list -----------------------------------------------------------------------------------------------------
   
    state    = job_info.find(name='strong', attrs={'class':'lwHBT6d'}).text
    state_list.append(state)
    
    
    
    
    
   # WORK-TYPE list (Full-time/ Part-time/ Casual etc)------------------------------------------------------------------------
    
    work_type = soup.find(name='dd', attrs={'data-automation':'job-detail-work-type'}).text
    worktype_list.append(work_type)

    
    
   
   # FIELD list (IT/ Consulting/ etc)----------------------------------------------------------------------------------
   
    # if field & department info exists, it is under the last 'dd' node.
    field = job_info.find_all('dd')[-1].find(name='strong', attrs={'class': 'lwHBT6d'}).text
    field_list.append(field)


    
    
   # SALARY list -------------------------------------------------------------------------------------------------------------
    
    salary = job_info.find_all('dd')
    
    # There are 4 'dd' nodes only if salary information exists & it is always under the 3rd 'dd' node.
    if len(job_info.find_all('dd')) == 5: 
        
        salary = job_info.find_all('dd')[2].find(name='span', attrs={'class': 'lwHBT6d'}).text  
    else:
        salary = np.nan
    
    salary_list.append(salary)
    
    
   # SUMMARY list ------------------------------------------------------------------------------------------------------------
    
    summary = soup.find(name='div', attrs={'data-automation':'mobileTemplate'}).text
    summary_list.append(summary)

    

# Put everything together in a dictionary & put it in a dataframe

job_dictionary = {'title'      : title_list, 
                  'state'      : state_list, 
                  'work_type'  : worktype_list,
                  'field'      : field_list,
                  'salary'     :salary_list,
                  'summary'    : summary_list
                 }

In [116]:
jobs_df = pd.DataFrame(job_dictionary)

In [117]:
# Check job_df dataframe
jobs_df.head(5)

Unnamed: 0,title,state,work_type,field,salary,summary
0,Junior Market Research Data Analyst,Sydney,Full Time,Marketing & Communications,"$55,000 - $64,999",About the business Small but mighty – our te...
1,Head of Data Analytics Strategy,Sydney,Part Time,Consulting & Strategy,,"The Actuaries Institute, a highly respected pr..."
2,Data Analyst / Database Officer,Sydney,Full Time,Information & Communication Technology,,A full-time position is available for a Data A...
3,Data Analyst,Sydney,Full Time,Information & Communication Technology,$90k base + super + bonus,This well established company has a start-up c...
4,BI/Data Analyst,Brisbane,Full Time,Information & Communication Technology,Attractive Salary + Super,We are on the lookout for motivated BI/Data An...


## Remove duplicates

In [119]:
jobs_df2 = jobs_df.drop_duplicates(subset=['salary', 'summary'], keep='first')
jobs_df2.reset_index(inplace=True, drop=True)

In [120]:
print('Number of scraped data:', len(jobs_df))
print('Number of data without duplicates:', len(jobs_df2))

Number of scraped data: 3300
Number of data without duplicates: 2861


## Clean 'state' column

In [123]:
# Find unique values under 'state' column
jobs_df2['state'].unique()

array(['Sydney', 'Brisbane', 'Wollongong, Illawarra & South Coast',
       'Melbourne', 'Perth', 'ACT', 'Gold Coast', 'Mackay & Coalfields',
       'South West Coast VIC', 'Adelaide', 'Gladstone & Central QLD',
       'Newcastle, Maitland & Hunter', 'Tamworth & North West NSW',
       'Port Macquarie & Mid North Coast', 'Darwin', 'Hobart',
       'Alice Springs & Central Australia', 'Gosford & Central Coast',
       'Mildura & Murray', 'Toowoomba & Darling Downs', 'Sunshine Coast',
       'Albury Area', 'Northern QLD', 'Port Hedland, Karratha & Pilbara',
       'Rockhampton & Capricorn Coast', 'Ballarat & Central Highlands',
       'Bairnsdale & Gippsland', 'Coffs Harbour & North Coast',
       'Mt Gambier & Limestone Coast', 'Asia Pacific',
       'Lismore & Far North Coast', 'Kalgoorlie, Goldfields & Esperance',
       'Yarra Valley & High Country', 'Dubbo & Central NSW',
       'Wagga Wagga & Riverina', 'Mornington Peninsula & Bass Coast',
       'Devonport & North West', 'Albany & 

In [127]:
# Create a dictionary that takes state & territory as key and city as value

state_dict = {'VIC': ['Melbourne', 'South West Coast VIC', 'Bendigo, Goldfields & Macedon Ranges', 'Bairnsdale & Gippsland',
                      'Ballarat & Central Highlands', 'West Gippsland & Latrobe Valley', 'Yarra Valley & High Country',
                      'Mornington Peninsula & Bass Coast', 'Shepparton & Goulburn Valley'],
              
              'NSW': ['Sydney', 'Newcastle, Maitland & Hunter', 'Wollongong, Illawarra & South Coast', 
                      'Tamworth & North West NSW', 'Port Macquarie & Mid North Coast', 'Gosford & Central Coast',
                      'Albury Area', 'Coffs Harbour & North Coast', 'Lismore & Far North Coast', 'Dubbo & Central NSW',
                      'Blue Mountains & Central West', 'Wagga Wagga & Riverina', 'Far West & North Central NSW',
                      'Southern Highlands & Tablelands'],
              
              'QLD': ['Brisbane', 'Gold Coast', 'Mackay & Coalfields', 'Toowoomba & Darling Downs', 'Sunshine Coast',
                      'Mildura & Murray', 'Northern QLD', 'Gladstone & Central QLD', 'Rockhampton & Capricorn Coast', 
                      'Cairns & Far North', 'Hervey Bay & Fraser Coast'],
              
              'SA' : ['Adelaide', 'Mt Gambier & Limestone Coast'],
              
              'WA' : ['Perth', 'Port Hedland, Karratha & Pilbara', 'Kalgoorlie, Goldfields & Esperance',
                      'Albany & Great Southern', 'Geraldton, Gascoyne & Midwest', 'Mandurah & Peel', 'Broome & Kimberley'],
              
              'TAS': ['Hobart', 'Devonport & North West'],
              
              'NT' : ['Darwin', 'Alice Springs & Central Australia'],
              
              'ACT': ['ACT']
             }

In [128]:
# If the 'state' value is in the value list of each state, the cell value is replaced by the state name

for i, e in enumerate(jobs_df2['state']):
    
    for key, value in state_dict.items(): 
        
        if e in value:
            
            jobs_df2.loc[i, 'state'] = key

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [129]:
jobs_df2['state'].unique()

array(['NSW', 'QLD', 'VIC', 'WA', 'ACT', 'SA', 'NT', 'TAS',
       'Asia Pacific'], dtype=object)

In [130]:
# Check data frame
jobs_df2.head()

Unnamed: 0,title,state,work_type,field,salary,summary
0,Junior Market Research Data Analyst,NSW,Full Time,Marketing & Communications,"$55,000 - $64,999",About the business Small but mighty – our te...
1,Head of Data Analytics Strategy,NSW,Part Time,Consulting & Strategy,,"The Actuaries Institute, a highly respected pr..."
2,Data Analyst / Database Officer,NSW,Full Time,Information & Communication Technology,,A full-time position is available for a Data A...
3,Data Analyst,NSW,Full Time,Information & Communication Technology,$90k base + super + bonus,This well established company has a start-up c...
4,BI/Data Analyst,QLD,Full Time,Information & Communication Technology,Attractive Salary + Super,We are on the lookout for motivated BI/Data An...


In [131]:
print('There are ', jobs_df2['salary'].notnull().sum(), ' jobs with salary information.')

There are  998  jobs with salary information.


## Save data frame as a csv file

In [132]:
jobs_df2.to_csv('.\jobs_raw.csv', index=False)

In [133]:
jobs_raw = pd.read_csv('.\jobs_raw.csv')
jobs_raw.head(20)

Unnamed: 0,title,state,work_type,field,salary,summary
0,Junior Market Research Data Analyst,NSW,Full Time,Marketing & Communications,"$55,000 - $64,999",About the business Small but mighty – our te...
1,Head of Data Analytics Strategy,NSW,Part Time,Consulting & Strategy,,"The Actuaries Institute, a highly respected pr..."
2,Data Analyst / Database Officer,NSW,Full Time,Information & Communication Technology,,A full-time position is available for a Data A...
3,Data Analyst,NSW,Full Time,Information & Communication Technology,$90k base + super + bonus,This well established company has a start-up c...
4,BI/Data Analyst,QLD,Full Time,Information & Communication Technology,Attractive Salary + Super,We are on the lookout for motivated BI/Data An...
5,Business Intelligence Analyst,NSW,Full Time,Banking & Financial Services,,12 Month Fixed Term ContractGreat Company Cult...
6,Data Analyst - Big Data Exposure,NSW,Full Time,Information & Communication Technology,,Data AnalystDigital and broadcasting leader12m...
7,Data Analyst (Data Science Team),NSW,Full Time,Information & Communication Technology,Up to 120K Base + Super + Bonus,This is a newly created role and Data Science ...
8,Data and Marketing Analyst,NSW,Full Time,Marketing & Communications,,"CompanyEstablished, highly regarded and well k..."
9,Data & Statistical Analyst,NSW,Full Time,Healthcare & Medical,"$110,961.00 - $126,496.00 per annum plus super",Collaboration. Innovation. Better Healthcare...
