# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Project 4: Web Scraping Job Postings

## Business Case Overview

You're working as a data scientist for a contracting firm that's rapidly expanding. Now that they have their most valuable employee (you!), they need to leverage data to win more contracts. Your firm offers technology and scientific solutions and wants to be competitive in the hiring market. Your principal has two main objectives:

   1. Determine the industry factors that are most important in predicting the salary amounts for these data.
   2. Determine the factors that distinguish job categories and titles from each other. For example, can required skills accurately predict job title?

To limit the scope, your principal has suggested that you *focus on data-related job postings*, e.g. data scientist, data analyst, research scientist, business intelligence, and any others you might think of. You may also want to decrease the scope by *limiting your search to a single region.*

Hint: Aggregators like [Seek.com](https://www.seek.com) regularly pool job postings from a variety of markets and industries. 

**Goal:** Scrape your own data from a job aggregation tool like seek.com in order to collect the data to best answer these two questions.

---

## Directions

In this project you will be leveraging a variety of skills. The first will be to use the web-scraping and/or API techniques you've learned to collect data on data jobs from Indeed.com or another aggregator. Once you have collected and cleaned the data, you will use it to answer the two questions described above.

### QUESTION 1: Factors that impact salary

To predict salary you will be building either a classification or regression model, using features like the location, title, and summary of the job. If framing this as a regression problem, you will be estimating the listed salary amounts. You may instead choose to frame this as a classification problem, in which case you will create labels from these salaries (high vs. low salary, for example) according to thresholds (such as median salary).

You have learned a variety of new skills and models that may be useful for this problem:
- NLP
- Unsupervised learning and dimensionality reduction techniques (PCA, clustering)
- Ensemble methods and decision tree models
- SVM models

Whatever you decide to use, the most important thing is to justify your choices and interpret your results. *Communication of your process is key.* Note that most listings **DO NOT** come with salary information. You'll need to able to extrapolate or predict the expected salaries for these listings.

### QUESTION 2: Factors that distinguish job category

Using the job postings you scraped for part 1 (or potentially new job postings from a second round of scraping), identify features in the data related to job postings that can distinguish job titles from each other. There are a variety of interesting ways you can frame the target variable, for example:
- What components of a job posting distinguish data scientists from other data jobs?
- What features are important for distinguishing junior vs. senior positions?
- Do the requirements for titles vary significantly with industry (e.g. healthcare vs. government)?

You may end up making multiple classification models to tackle different questions. Be sure to clearly explain your hypotheses and framing, any feature engineering, and what your target variables are. The type of classification model you choose is up to you. Be sure to interpret your results and evaluate your models' performance.
 
## Requirements

1. Scrape and prepare your own data.

2. **Create and compare at least two models for each section**. One of the two models should be a decision tree or ensemble model. The other can be a classifier or regression of your choosing (e.g. Ridge, logistic regression, KNN, SVM, etc).
   - Section 1: Job Salary Trends
   - Section 2: Job Category Factors

3. Prepare a polished Jupyter Notebook with your analysis for a peer audience of data scientists. 
   - Make sure to clearly describe and label each section.
   - Comment on your code so that others could, in theory, replicate your work.

4. A brief writeup in an executive summary, written for a non-technical audience.
   - Writeups should be at least 500-1000 words, defining any technical terms, explaining your approach, as well as any risks and limitations.

 

## Suggestions for Getting Started

1. Collect data from [seek.com](www.seek.com) (or another aggregator) on data-related jobs to use in predicting salary trends for your analysis.
  - Select and parse data from *at least 1000 postings* for jobs, potentially from multiple location searches.
2. Find out what factors most directly impact salaries (e.g. title, location, department, etc).
  - Test, validate, and describe your models. What factors predict salary category? How do your models perform?
3. Discover which features have the greatest importance when determining a low vs. high paying job.
  - Your Boss is interested in what overall features hold the greatest significance.
  - HR is interested in which SKILLS and KEY WORDS hold the greatest significance.   
4. Author an executive summary that details the highlights of your analysis for a non-technical audience.
 
---

## Useful Resources

- Scraping is one of the most fun, useful and interesting skills out there. Don’t lose out by copying someone else's code!
- [Here is some advice on how to write for a non-technical audience](http://programmers.stackexchange.com/questions/11523/explaining-technical-things-to-non-technical-people)
- [Documentation for BeautifulSoup can be found here](http://www.crummy.com/software/BeautifulSoup/).

---

### Project Feedback + Evaluation

For all projects, students will be evaluated on a simple 4 point scale (0, 1, 2 or 3). Instructors will use this rubric when scoring student performance on each of the core project **requirements:** 

 Score | Expectations
 ----- | ------------
 **0** | _Did not complete. Try again._
 **1** | _Does not meet expectations. Try again_
 **2** | _Meets expectations._
 **3** | _Surpasses expectations. Brilliant!_
 
 # Project 4 feedback
| Requirement | Rubric   |
|------|------|
|   Scrape and prepare your own data  | |
|   Create and compare at least two models, one a decision tree or ensemble and the other a classifier or regression for Section 1: Job Salary Trends and Section 2: Job Category Factors (so at least 4 models in total)  | |
|   Polished Jupyter notebook with your analysis annotated for a peer audience of data scientists  | |
|  Executive summary at the beginning of your notebook for written for your superiors to use to make business decisions. Make sure that it includes the ‘So what…’ regarding your analysis, risks and limitations. ||
|   
__Qualitative feedback:__


In [2]:
import pandas as pd

data = pd.read_csv("kk_proj4_json.csv")
data

Unnamed: 0.1,Unnamed: 0,Job Title,Company,multiple_details,Description
0,0,Data Scientist,Fundo Loans,"['Sydney', 'CBD, Inner West & Eastern Suburbs'...",We are scaling up! An exciting opportunity exi...
1,1,Data Analyst,AIA Australia Limited,"['Melbourne', 'CBD & Inner Suburbs', 'Banking ...",The focus of this role is to provide support t...
2,2,Junior Business Intelligence Analyst,Hays Talent Solutions,"['Sydney', 'CBD, Inner West & Eastern Suburbs'...","Your new companyAt Hays, we are on a journey t..."
3,3,Data Analyst,Atos Australia,"['Sydney', 'CBD, Inner West & Eastern Suburbs'...",About Atos:Atos is a global leader in digital ...
4,4,Data Analyst,Infrastructure Partnerships Australia,"['Sydney', 'CBD, Inner West & Eastern Suburbs'...",About Infrastructure Partnerships AustraliaInf...
...,...,...,...,...,...
1074,534,Data Analyst,Aurec,"['ACT', 'Information & Communication Technolog...",We are looking to engage a skilled and enthusi...
1075,535,Senior Data Analyst - NSW Government,Talenza,"['Sydney', 'CBD, Inner West & Eastern Suburbs'...",Senior Data Analyst - NSW GovernmentLocation: ...
1076,536,Principal Engineer (Knowledge Graph),SEEK Limited,"['Melbourne', 'CBD & Inner Suburbs', 'Informat...",Company DescriptionAbout SEEKSEEK’s portfolio ...
1077,537,Data Management Analyst,Humanised Group,"['Brisbane', 'CBD & Inner Suburbs', 'Informati...",About the role:In this role you will be requir...


In [3]:
data.head()

Unnamed: 0.1,Unnamed: 0,Job Title,Company,multiple_details,Description
0,0,Data Scientist,Fundo Loans,"['Sydney', 'CBD, Inner West & Eastern Suburbs'...",We are scaling up! An exciting opportunity exi...
1,1,Data Analyst,AIA Australia Limited,"['Melbourne', 'CBD & Inner Suburbs', 'Banking ...",The focus of this role is to provide support t...
2,2,Junior Business Intelligence Analyst,Hays Talent Solutions,"['Sydney', 'CBD, Inner West & Eastern Suburbs'...","Your new companyAt Hays, we are on a journey t..."
3,3,Data Analyst,Atos Australia,"['Sydney', 'CBD, Inner West & Eastern Suburbs'...",About Atos:Atos is a global leader in digital ...
4,4,Data Analyst,Infrastructure Partnerships Australia,"['Sydney', 'CBD, Inner West & Eastern Suburbs'...",About Infrastructure Partnerships AustraliaInf...


In [4]:
# string form of list
data.multiple_details.head().values

array(["['Sydney', 'CBD, Inner West & Eastern Suburbs', 'Information & Communication Technology', 'Engineering - Software', '$110,000 - $149,999', 'Full time']",
       "['Melbourne', 'CBD & Inner Suburbs', 'Banking & Financial Services', 'Analysis & Reporting', 'Full time']",
       "['Sydney', 'CBD, Inner West & Eastern Suburbs', 'Information & Communication Technology', 'Business/Systems Analysts', 'Full time']",
       "['Sydney', 'CBD, Inner West & Eastern Suburbs', 'Information & Communication Technology', 'Business/Systems Analysts', 'Full time']",
       "['Sydney', 'CBD, Inner West & Eastern Suburbs', 'Consulting & Strategy', 'Analysts', 'Full time']"],
      dtype=object)

In [5]:
# parse string into list using ast
import ast
data['multiple_details_lst'] = data.multiple_details.apply(lambda x: ast.literal_eval(x))

In [6]:
# check length of lists
data['items'] = data.multiple_details_lst.apply(lambda x:len(x))
data['items'].value_counts()

5    653
6    277
4    149
Name: items, dtype: int64

In [7]:
# hypothesis - lists of different lengths have different features.
# features are based on their index in the list
# features share the same index in the list. 

# region, industry, role_type, offer_type
data.loc[(data['items']==4),'multiple_details_lst'].sample(10).values

array([list(['Adelaide', 'Consulting & Strategy', 'Analysts', 'Full time']),
       list(['Melbourne', 'Accounting', 'Analysis & Reporting', 'Contract/Temp']),
       list(['Adelaide', 'Information & Communication Technology', 'Business/Systems Analysts', 'Full time']),
       list(['Gosford & Central Coast', 'Manufacturing, Transport & Logistics', 'Analysis & Reporting', 'Contract/Temp']),
       list(['Adelaide', 'Information & Communication Technology', 'Business/Systems Analysts', 'Full time']),
       list(['Adelaide', 'Accounting', 'Business Services & Corporate Advisory', 'Full time']),
       list(['Sydney', 'Information & Communication Technology', 'Business/Systems Analysts', 'Full time']),
       list(['Melbourne', 'Information & Communication Technology', 'Help Desk & IT Support', 'Contract/Temp']),
       list(['Melbourne', 'Information & Communication Technology', 'Business/Systems Analysts', 'Full time']),
       list(['Melbourne', 'Information & Communication Technology

In [8]:
# region, sub-region, industry, role_type, offer_type
data.loc[(data['items']==5),'multiple_details_lst'].sample(10).values

array([list(['Melbourne', 'CBD & Inner Suburbs', 'Information & Communication Technology', 'Other', 'Full time']),
       list(['Sydney', 'CBD, Inner West & Eastern Suburbs', 'Information & Communication Technology', 'Business/Systems Analysts', 'Contract/Temp']),
       list(['Melbourne', 'CBD & Inner Suburbs', 'Insurance & Superannuation', 'Other', 'Full time']),
       list(['Melbourne', 'Information & Communication Technology', 'Business/Systems Analysts', 'Enjoy a Flexi Day on us!', 'Full time']),
       list(['Sydney', 'CBD, Inner West & Eastern Suburbs', 'Insurance & Superannuation', 'Actuarial', 'Full time']),
       list(['Sydney', 'CBD, Inner West & Eastern Suburbs', 'Information & Communication Technology', 'Business/Systems Analysts', 'Part time']),
       list(['Brisbane', 'CBD & Inner Suburbs', 'Information & Communication Technology', 'Database Development & Administration', 'Full time']),
       list(['Brisbane', 'CBD & Inner Suburbs', 'Information & Communication Techn

In [9]:
# region, sub-region, industry, role_type, renumeration, offer_type
data.loc[(data['items']==6),'multiple_details_lst'].sample(10).values

array([list(['Melbourne', 'CBD & Inner Suburbs', 'Government & Defence', 'Government - Federal', '$75,200 to $84,468 plus 15.4% super ', 'Full time']),
       list(['Sydney', 'North Shore & Northern Beaches', 'Information & Communication Technology', 'Business/Systems Analysts', '$110000.00 - $120k p.a.', 'Full time']),
       list(['Melbourne', 'CBD & Inner Suburbs', 'Information & Communication Technology', 'Business/Systems Analysts', 'Up to $140,000 + Super', 'Full time']),
       list(['Sydney', 'CBD, Inner West & Eastern Suburbs', 'Information & Communication Technology', 'Engineering - Software', 'Base + Super + Profit Share', 'Full time']),
       list(['Sydney', 'CBD, Inner West & Eastern Suburbs', 'Information & Communication Technology', 'Database Development & Administration', '$935 per day (inclusive of super)', 'Contract/Temp']),
       list(['Sydney', 'CBD, Inner West & Eastern Suburbs', 'Information & Communication Technology', 'Database Development & Administration', '

In [10]:
# example parser function
def list_to_dict(row):
    d = dict()
    if row['items'] ==4:
        # region, industry, role_type, offer_type
        d['region'] = row['multiple_details_lst'][0]
        d['industry'] = row['multiple_details_lst'][1]
        d['role_type'] = row['multiple_details_lst'][2]
        d['offer_type'] = row['multiple_details_lst'][3]
    if row['items'] ==5:
        # region, sub_region, industry, role_type, offer_type
        d['region'] = row['multiple_details_lst'][0]
        d['sub_region'] = row['multiple_details_lst'][1]
        d['industry'] = row['multiple_details_lst'][2]
        d['role_type'] = row['multiple_details_lst'][3]
        d['offer_type'] = row['multiple_details_lst'][4]
    if row['items'] ==6:
        # region, sub-region, industry, role_type, renumeration, offer_type
        d['region'] = row['multiple_details_lst'][0]
        d['sub_region'] = row['multiple_details_lst'][1]
        d['industry'] = row['multiple_details_lst'][2]
        d['role_type'] = row['multiple_details_lst'][3]
        d['renumeration'] = row['multiple_details_lst'][4] 
        d['offer_type'] = row['multiple_details_lst'][5]        
    return d

data['multiple_items_dict'] = data.apply(list_to_dict,axis=1)
data['multiple_items_dict'].head().values

array([{'region': 'Sydney', 'sub_region': 'CBD, Inner West & Eastern Suburbs', 'industry': 'Information & Communication Technology', 'role_type': 'Engineering - Software', 'renumeration': '$110,000 - $149,999', 'offer_type': 'Full time'},
       {'region': 'Melbourne', 'sub_region': 'CBD & Inner Suburbs', 'industry': 'Banking & Financial Services', 'role_type': 'Analysis & Reporting', 'offer_type': 'Full time'},
       {'region': 'Sydney', 'sub_region': 'CBD, Inner West & Eastern Suburbs', 'industry': 'Information & Communication Technology', 'role_type': 'Business/Systems Analysts', 'offer_type': 'Full time'},
       {'region': 'Sydney', 'sub_region': 'CBD, Inner West & Eastern Suburbs', 'industry': 'Information & Communication Technology', 'role_type': 'Business/Systems Analysts', 'offer_type': 'Full time'},
       {'region': 'Sydney', 'sub_region': 'CBD, Inner West & Eastern Suburbs', 'industry': 'Consulting & Strategy', 'role_type': 'Analysts', 'offer_type': 'Full time'}],
      dt

In [11]:
# and then pull out of dict to rows using get (for error managing if the dict doesn't have that key)
data['region'] = data.multiple_items_dict.apply(lambda x: x.get('region'))
data['sub_region'] = data.multiple_items_dict.apply(lambda x: x.get('sub_region'))
data['industry'] = data.multiple_items_dict.apply(lambda x: x.get('industry'))
data['role_type'] = data.multiple_items_dict.apply(lambda x: x.get('role_type'))
data['renumeration'] = data.multiple_items_dict.apply(lambda x: x.get('renumeration'))
data['offer_type'] = data.multiple_items_dict.apply(lambda x: x.get('offer_type'))

In [12]:
# you will still need regex to get the numeric values out
data.renumeration.value_counts()

$75,200 to $84,468 plus 15.4% super                   8
Attractive salary package + bonus + benefits          6
$70,000 - $75,000                                     4
$130,000 - $149,999                                   4
Competitive Salary & Flexible Working Arrangements    4
                                                     ..
$600 - $725 per day                                   1
$836 per day + super                                  1
$850 - $875 p.d. incl. Super                          1
$100 - $106.25 p.h.                                   1
$110k - $140k p.a.                                    1
Name: renumeration, Length: 186, dtype: int64

In [13]:
data['renumeration']

0               $110,000 - $149,999
1                              None
2                              None
3                              None
4                              None
                   ...             
1074                           None
1075               $700 - $780 p.d.
1076    Base + Super + Profit Share
1077                           None
1078                           None
Name: renumeration, Length: 1079, dtype: object

In [14]:
# maybe there's some fun in looking at offer types
data.offer_type.value_counts()

Full time          812
Contract/Temp      256
Part time            6
Casual/Vacation      5
Name: offer_type, dtype: int64

In [15]:
data.head(5)

Unnamed: 0.1,Unnamed: 0,Job Title,Company,multiple_details,Description,multiple_details_lst,items,multiple_items_dict,region,sub_region,industry,role_type,renumeration,offer_type
0,0,Data Scientist,Fundo Loans,"['Sydney', 'CBD, Inner West & Eastern Suburbs'...",We are scaling up! An exciting opportunity exi...,"[Sydney, CBD, Inner West & Eastern Suburbs, In...",6,"{'region': 'Sydney', 'sub_region': 'CBD, Inner...",Sydney,"CBD, Inner West & Eastern Suburbs",Information & Communication Technology,Engineering - Software,"$110,000 - $149,999",Full time
1,1,Data Analyst,AIA Australia Limited,"['Melbourne', 'CBD & Inner Suburbs', 'Banking ...",The focus of this role is to provide support t...,"[Melbourne, CBD & Inner Suburbs, Banking & Fin...",5,"{'region': 'Melbourne', 'sub_region': 'CBD & I...",Melbourne,CBD & Inner Suburbs,Banking & Financial Services,Analysis & Reporting,,Full time
2,2,Junior Business Intelligence Analyst,Hays Talent Solutions,"['Sydney', 'CBD, Inner West & Eastern Suburbs'...","Your new companyAt Hays, we are on a journey t...","[Sydney, CBD, Inner West & Eastern Suburbs, In...",5,"{'region': 'Sydney', 'sub_region': 'CBD, Inner...",Sydney,"CBD, Inner West & Eastern Suburbs",Information & Communication Technology,Business/Systems Analysts,,Full time
3,3,Data Analyst,Atos Australia,"['Sydney', 'CBD, Inner West & Eastern Suburbs'...",About Atos:Atos is a global leader in digital ...,"[Sydney, CBD, Inner West & Eastern Suburbs, In...",5,"{'region': 'Sydney', 'sub_region': 'CBD, Inner...",Sydney,"CBD, Inner West & Eastern Suburbs",Information & Communication Technology,Business/Systems Analysts,,Full time
4,4,Data Analyst,Infrastructure Partnerships Australia,"['Sydney', 'CBD, Inner West & Eastern Suburbs'...",About Infrastructure Partnerships AustraliaInf...,"[Sydney, CBD, Inner West & Eastern Suburbs, Co...",5,"{'region': 'Sydney', 'sub_region': 'CBD, Inner...",Sydney,"CBD, Inner West & Eastern Suburbs",Consulting & Strategy,Analysts,,Full time


In [16]:
data['renumeration'][0]

'$110,000 - $149,999'

In [17]:
data['renumeration'][1]  #it's None and you have 802 of None values

In [18]:
data.isna().sum()

Unnamed: 0                0
Job Title                 0
Company                   0
multiple_details          0
Description               0
multiple_details_lst      0
items                     0
multiple_items_dict       0
region                    0
sub_region              149
industry                  0
role_type                 0
renumeration            802
offer_type                0
dtype: int64

In [19]:
data['salary'] = data['renumeration'].fillna('90000')

In [20]:
import re
re.findall(r'[0-9,.]+', data.salary[0])

['110,000', '149,999']

In [21]:
re.findall(r'[0-9,.]+', data.renumeration[0])

['110,000', '149,999']

In [22]:
re.findall(r'[0-9,.]+', data.salary[1])

['90000']

In [23]:
re.findall(r'[0-9,.]+', data.renumeration[1])  #just show you the difference here: renumeration[1] is 'None', 
                                             #salary[1] is '90000'

TypeError: expected string or bytes-like object

In [None]:
# Working with Anita


def average_value(string_salary):
    if len(re.findall(r'[0-9,.]+',string_salary)) == 1:
        return float(re.findall(r'[0-9,]+',string_salary)[0].replace(',',''))
    if len(re.findall(r'[0-9,.]+',string_salary)) == 2:
        values = []
        for string in re.findall(r'[0-9,.]+',string_salary):
            values.append(string.replace(',',''))
        return ((float(values[0])+float(values[1]))/2)
    else:
        return (float(90000))

In [None]:
def salary_converter(salary):
    try:
        if type(salary) == str:
    #         job_salary = str('90000')
            if len(re.findall(r'[0-9]+',salary)) >= 1:
                if 'year' in salary:
                    job_salary = average_value(salary)
                elif 'month' in salary:
                    job_salary = average_value(salary)*12
                elif 'week' in salary:
                    job_salary = average_value(salary)*52
                elif 'day' in salary:
                    job_salary = average_value(salary)*52*5
                elif 'hour' in salary:
                    job_salary = average_value(salary)*52*38
                else:
                    job_salary = average_value(salary)
            else:
                job_salary = float('90000')
        else:
            job_salary = float(salary)
        if job_salary < 300:
            job_salary = job_salary * 1000
        if job_salary < 10000:
            job_salary = 90000
        return job_salary
    except:
        return float(90000)

In [None]:
data.salary

In [None]:
for i in data.salary:
    print(salary_converter(i))

In [None]:
data['salary'] = data.salary.map(salary_converter)
data['salary'].value_counts()

In [None]:
salary_export = data[['salary']]
salary_export.columns = (['salary_import'])
salary_export.to_csv('salary_import.csv', index = False)

pd.read_csv('salary_import.csv')['salary_import']


In [None]:
import seaborn as sns

In [None]:
sns.distplot(data.salary)

In [None]:
# data['renumeration02'] = data.renumeration.map(salary_converter,data['renumeration'])

# data['renumeration02'] = data.renumeration.map(salary_converter,data['renumeration'])

data['renumeration02'] = data.renumeration.map(salary_converter)
data['renumeration02']

In [None]:
data['renumeration02'].value_counts()