# Webscrapping data jobs from Seek

### Outline of the project:
On this project we will look for data specific jobs on Seek and use NLP to determine if what makes characterizes the job as high or low paying.

To achieve this, we will classify the jobs as: High paying >= AUD 100.000 and low paying as < AUD 100.000

To guarantee the project is reproducible and robust we will need at least 500 samples. If we cannot get that number from this webscraping we will use a pool of jobs that has been provided by GA.

A second analysis will be made looking at the job descriptions and titles to ascertain what determines a job is related or not to data. This will be a more straightforward NLP processing task.


### Loading Basic Libraries
more should  be loaded as needed

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import re
import datetime
import string

%matplotlib inline

# Data Scraping
Data Scrapping was done from Seek on April 2020

In [2]:
import requests
from bs4 import BeautifulSoup
from scrapy.selector import Selector
from selenium import webdriver
from time import sleep
# from Ipython import display


#### Scraping - First step
Getting unique job url from seek.

Will get job title and as much info as possible from this first page.

Will use beautiful soup for this.

In [36]:
driver = webdriver.Chrome()

In [37]:
driver.get('https://www.seek.com.au/data-jobs')

In [42]:
url = 'https://www.seek.com.au/data-jobs'
res = requests.get(url)
soup = BeautifulSoup(res.content, 'lxml')

In [60]:
'https://www.seek.com.au'+soup.find_all('a',{'class':'_2iNL7wI'})[int(j)].text

'https://www.seek.com.auData Network Engineer'

In [93]:
#     Initiating empty list to append rows    
link_list = []
job_list = []
def link_puller (url):

    for i in range(1,500):
#     BeautifulSoup
        repeated_counter = 0
#     getting to the right pages
        url = 'https://www.seek.com.au/data-jobs?page='+str(i)
        res = requests.get(url)
        soup = BeautifulSoup(res.content, 'lxml')
#     accessing individual jobs page
        for j in range(1,20):

            
        #         initiating empty dict for this job
            row = {}
        #         Getting Job Title
            row['title'] = soup.find_all('a',{'class':'_2iNL7wI'})[int(j)].text
        #         Getting Job url
            row['url'] = 'https://www.seek.com.au'+soup.find_all('a',{'class':'_2iNL7wI'})[int(j)].attrs['href']
        #         Getting Job pay
            try:
                row['salary'] = soup.find_all('span',{'data-automation':'jobSalary'})[int(j)].text
            except:
                row['salary'] = "No Information"
        
            job_list.append(row)
            print(len(job_list)) #Seeing the number of jobs pulled to check progress
            
            check = 'https://www.seek.com.au'+soup.find_all('a',{'class':'_2iNL7wI'})[int(j)].attrs['href']
            
            if check in link_list:
                repeated_counter += 1
                print(" ----> REPEATED", row['title'], row['url'])
            else:
                print(" --> ", row['title'], row['url'])

            # appending to the check after the check for this row
            link_list.append(check)

In [None]:
link_puller('https://www.seek.com.au/data-jobs')

In [107]:
link_df = pd.DataFrame(job_list)
link_df.to_csv('Link_list.csv',index=False)
link_df.shape

(3800, 3)

### Checking if the job postings are all unique
Doing this step early will save us a lot of headache in the future

In [99]:
# Checking to see if I have two of the same 
df = pd.DataFrame(job_list)
df = df.drop_duplicates(['url'], keep='first')

df.shape

(3800, 3)

In [101]:
# Double checking to see if I have two of the same 
set_links = set(link_df.url)
len(set_links)

3800

In [103]:
# Triple checking to see if I have two of the same 
if link_df.shape[0] > len(set(link_df.url)):
    print('different')
else:
    print('all unique')

all unique


#### Second Step
Get the job info by going in to each url

In [None]:
# Ok, from this point onwards we have 3800 unique job postings. It is probable that most are not related to "data". 

In [171]:
url = link_list[7]
res = requests.get(url)        
soup = BeautifulSoup(res.text, 'lxml')

In [173]:
# Creating empty listto commit values
raw_data=[]
def job_puller (link_list):
    loop_counter = 0
    # looping thorugh each job
    for j in link_list:
        #getting how much in percentage we ran through
        loop_counter += 1
        loop_percent = (loop_counter/len(link_list))*100
        print('%.2f' %loop_percent)
        # getting soup
        url = j
        res = requests.get(url)        
        soup = BeautifulSoup(res.text, 'lxml')
        # Getting information out of jobs page
        row = {}
        
        
        # finding the company name and cleaning it up
        try:
            company = soup.find_all('span', {'class':"_3FrNV7v _2QG7TNq E6m4BZb"})[0].text
        except:
            company = "Not Given"
        
        # creating Company Column
        row['company'] = company
        
        # cleaning up 'city'
        try:
            city = str(soup.find_all('strong', {'class':"lwHBT6d"})).replace('<strong class="lwHBT6d">','').split('</strong>')[1][1:].replace("amp;","")
        except:
            city = "Not Given"
        # Adding City Column
        row['city'] = city
        
        # cleaning up 'job cat'
        try:
            cat = str(soup.find_all('strong', {'class':"lwHBT6d"})).replace('<strong class="lwHBT6d">','').split('</strong>')[2].replace("amp;","")
        except:
            cat = "Not Given"
        # Adding Cat Column
        row['cat'] = cat            

    
        # full text pulling
        full_text = soup.find_all('div', {'class':"_2e4Pi2B"})
        full_text = str(full_text).replace('<div class="_2e4Pi2B" data-automation="mobileTemplate">','').replace('</div>','')

        # removing html tags from the full page text for clarity
        clean = re.compile('<.*?>')
        full_text = re.sub(clean, '', full_text)
        # creating full description column
        row['full_desc'] = full_text[1:-1]
        
        # looking for salary --> This is a great way of doing it! Credit to Tom G for helping!
        salary = soup.find_all('span', {'class':"lwHBT6d"})
        salary = str(salary).replace(",","")
        try:
            base_salary = re.search('([$])(\d+(?:\.\d{2})?)', salary).groups()
            base_salary = base_salary[1]
        except:
            base_salary = "Not Given"
        # creating compensation column
        row['pay'] = base_salary
        
        # looking for contract type
        if "Full Time" in salary:
            contract = "Full Time"
        elif "Casual" in salary:
            contract = "Casual"
        elif "Part Time" in salary:
            contract = "Part Time"
        elif "Contract/Temp" in salary:
            contract = "Contract/Temp"
        else:
            contract = "Not Given"
        # Creating the contract_type column
        row['contract'] = contract

        # cleaning up 'location'
        try:
            location = str(soup.find_all('span', {'class':"eBOHjGN"})).replace('<span class="eBOHjGN">','').replace('<span class="_2njvnpA">','').split("</span>")[1].replace("amp;","")
        except:
            location = "Not Given"
        # Creating Location
        row['location_desc'] = location

           
        # cleaning up 'job desc'
        try:
            job_des = str(soup.find_all('span', {'class':"eBOHjGN"})).replace('<span class="eBOHjGN">','').replace('<span class="_2njvnpA">','').split("</span>")[3].replace("amp;","")
        except:
            job_des = "Not Given"
        # Creating job description column
        row['short_desc'] = job_des 

        
        # Appending the row
        raw_data.append(row)

In [None]:
job_puller (link_list)

In [175]:
# Making it in to a DF
full_df = pd.DataFrame(raw_data)

In [177]:
# Joining with the previous scraped parts
full_df = link_df.join(full_df)

In [180]:
# Done! Saving a CSV to ensure I dont loose it again
# I'll consider this my corpus (pre cleaning and vectorizing)
full_df.to_csv('full_file.csv', index=False)
full_df.head()

Unnamed: 0,title,url,salary,company,city,cat,full_desc,pay,contract,location_desc,short_desc
0,Deputy Project Managers x 2 (SFIA 4),https://www.seek.com.au/job/41308040?type=prom...,Top $'s Paid ! Contract extensions likely !,Bright Consulting,ACT,", Information & Communication Technology",Bright Consulting is Seeking two Deputy Projec...,Not Given,Contract/Temp,Programme & Project Management,Programme & Project Management
1,Data Analyst,https://www.seek.com.au/job/41346662?type=stan...,Top $'s Paid ! Contract extensions likely !,Quality People,Perth,", Information & Communication Technology",Our client URGENTLY requires a Data Analyst fo...,Not Given,Contract/Temp,Business/Systems Analysts,Business/Systems Analysts
2,Data Analyst,https://www.seek.com.au/job/41346677?type=stan...,"Remuneration: $97,812 - $116,013 per annum",Quality People,Brisbane,", Information & Communication Technology",Our client URGENTLY requires a Data Analyst fo...,Not Given,Contract/Temp,Business/Systems Analysts,Business/Systems Analysts
3,Data Analyst,https://www.seek.com.au/job/41346675?type=stan...,Top $'s Paid ! Contract extensions likely !,Quality People,Adelaide,", Information & Communication Technology",Our client URGENTLY requires a Data Analyst fo...,Not Given,Contract/Temp,Business/Systems Analysts,Business/Systems Analysts
4,"Data Analyst, Data Governance & Management",https://www.seek.com.au/job/41343264?type=stan...,Top $'s Paid ! Contract extensions likely !,Cancer Institute NSW,Sydney,", Government & Defence",Employment Type: Permanent Full TimePosition C...,97812,Full Time,"CBD, Inner West & Eastern Suburbs",Government - State


In [183]:
# just quadruple checking that there are no duplicates
full_df.drop_duplicates(['title','url'], keep='first').shape

(3800, 11)

Nice, the Df looks good now.

I will have a look and start cleaning next

---
##### back-up plan/side project
I just pulled a few full htmls. If I have the time I'll try to parse through them with regex to get out the info.

Should be fun!

In [92]:
# Using Selenium to  get all the HTML in the sites
# Initiating selenium 
driver = webdriver.Chrome()
# Creating empty listto commit values
raw_html=[]
# looping thorugh each job
for j in link_list:
#     print(j)
    try:
        driver.get(j)
        sleep(1)
#         Getting information out of jobs page
        row = {}
#             Building the Rows:
#             Job title as advertised
        row['full_html'] = driver.page_source
#             Location of the advertised job
    except:
        row['full_html'] = 'NaN'
    raw_html.append(row)
#         job_list.append(jobs)
#     print(len(raw_data))
print(np.shape(raw_html))
html = pd.DataFrame(raw_html)
html.to_csv('html_part_1.csv',index=False)

(3781,)


In [6]:
driver.close()
raw_df_1 = pd.read_csv('html_part_1.csv')
# raw_df_2 = pd.read_csv()

#### "EDA" and cleaning
Checking what we got and possibly cleaning it a little.

I just want to look at what we have before starting to clean it all

In [4]:
full_df = pd.read_csv('full_file.csv')

In [29]:
full_df.title.value_counts().head(20)

Business Analyst                                           111
Solution Architect                                          49
Data Analyst                                                48
Senior Business Analyst                                     39
Technical Business Analyst                                  30
Data Engineer                                               28
Administration Assistant                                    15
Data Scientist                                              14
Administration Officer                                      12
Business Intelligence Developer                             11
Senior Data Engineer                                        11
Customer Service Officer                                    11
Clinical Nurse (Data Manager - Neonatal Intensive Care)     10
Administrator                                               10
Analyst                                                     10
Research Assistant                                     

From looking ath the 50 job titles that have 
the most entries we see that we have a few that 
have no connection to data, such as electrician

Lets take a closer look at the ones that are not so obvious and decide what to do.

They are:
- Administration Assistant
- Administration Officer
- Analyst
- Administrator
- Research Assistant
- Commercial Analyst
- Health Information Manager
- Psychologist  Behaviour Support Team Leader 
- Intelligence Analyst (Signals)
- Financial Analyst
- Reporting Analyst
- Pricing Analyst
- Program Scheduler 
- Insolvency Analysts


In [8]:
list(full_df.loc[full_df['title'] == 'Insolvency Analysts']['full_desc'])[3] 

"\nNumber of positions available\nUtilise your analytical and/or legal experience in a varied role - AS 4 level\nCollegiate and supportive team environment\n\nAt ASIC there is a reason for everything we do, every law we regulate, every action we take, every interaction we have with industry and consumers. We're proud of the difference we make to Australia's economic reputation and wellbeing.\nThe team\nInsolvency Practitioners (IP) provide technical insolvency advice within ASIC and undertake three externally focused functions: Liquidator compliance- regulating the conduct of Registered Liquidators (RLs); the Assetless Administration Fund (AAF) - assessing liquidators' applications for funding; and Liaison activities - engagement with the insolvency profession and promoting ASIC's functions and expectations.\nThe role\nAs an Analyst, you will be required to:\n\nundertake compliance activities that monitor and regulate registered liquidator conduct, including transaction and risk-based 

Jobs to drop:
- APS4/5 Delivery and Engagement Officer
- Postdoctoral Research Fellow (Biostatistician)
- Property Manager
- Customer Experience Specialist (Pharmaceuticals) - Sydney
- Storeperson
- Revenue Officer
- Operational Services Officer                                                        
- Office Support / Receptionist / Data Entry  
- Graduate Geologist 
- Accounts Assistant
- Training Coordinator
- Automotive Technician/Mechanic
- Specialist Business Standards Governance | Perth
- Accountant                                                                          
- Accounts Payable Officer                                                            
- Research Associate/Fellow - Optimisation, Optimal Control or Operations Research
- Bioanalytical Chemist
- Warehouse Storeperson 
- Administration Officer - level 3
- Product Lead
- Accounts Payable 
- Administration & Office Support
- Biostatistician
- Geotechnical Engineer                                                 
- Customer Advisor - Car Servicing                                      
- Level 1/2  Network Integration Support (M17) 
- Administration Assistant - Not relatedto data
- Administration Officer- Not relatedto data
- Research Assistant
- Health Information Manager
- Psychologist  Behaviour Support Team Leader 
- Intelligence Analyst (Signals)
- Administration
- Insolvency Analysts
- Self-employed Field Technicians
- Bookkeeper
- Payroll Officer
- Acoustic Analyst Submariner
- Assistant Accountant
- Program Scheduler
- Medical Receptionist
- Service Desk Analyst Helpdesk Level 2
- Application Support Analyst
- Shift Lead Technician
- Project Officer
- Accounts Officer
- Electrician
- Purchasing Officer
- Recepciotinist
- Legal Officer-Child Advocate                               
- Accommodation Manager
- Legal Officer-Child Advocate                               
- Accommodation Manager
- Management Accountant
- Laboratory Analyst
- Water Resource Officer, Evaluation & Reporting             
- ad Document Control Administration
- Chief Scientist                                            
- cument Controller 
- set Management Availability Focal
- ter Resource Officer, Social Analyst
- Laboratory Technician
- Mine Surveyor



In [32]:
jobs_to_drop = ['APS4/5 Delivery and Engagement Officer','Postdoctoral Research Fellow (Biostatistician)','Property Manager',
 'Customer Experience Specialist (Pharmaceuticals) - Sydney', 'Storeperson','Revenue Officer','Operational Services Officer',
 'Office Support / Receptionist / Data Entry','Graduate Geologist', 'Accounts Assistant','Training Coordinator',
 'Automotive Technician/Mechanic','Specialist Business Standards Governance | Perth','Accountant','Accounts Payable Officer',
 'Research Associate/Fellow - Optimisation, Optimal Control or Operations Research','Bioanalytical Chemist',
 'Warehouse Storeperson','Administration Officer - level 3','Product Lead','Accounts Payable', 'Administration & Office Support',
 'Biostatistician','Geotechnical Engineer','Customer Advisor - Car Servicing','Level 1/2  Network Integration Support (M17)',
 'Administration Assistant - Not relatedto data','Administration Officer- Not relatedto data','Research Assistant',
 'Health Information Manager','Psychologist  Behaviour Support Team Leader','Intelligence Analyst (Signals)','Administration',
 'Insolvency Analysts','Self-employed Field Technicians','Bookkeeper','Payroll Officer','Acoustic Analyst Submariner',
 'Assistant Accountant','Program Scheduler','Medical Receptionist','Service Desk Analyst Helpdesk Level 2',
 'Application Support Analyst','Shift Lead Technician','Project Officer','Accounts Officer','Electrician','Purchasing Officer',
 'Recepciotinist','Legal Officer-Child Advocate','Accommodation Manager','Legal Officer-Child Advocate','Accommodation Manager',
 'Management Accountant','Laboratory Analyst','Water Resource Officer, Evaluation & Reporting',
 'Lead Document Control Administration','Chief Scientist','Document Controller', 'Asset Management Availability Focal',
 'Water Resource Officer', 'Social Analyst','Laboratory Technician','Mine Surveyor']


In [104]:
clean_df = full_df.copy()
for j in jobs_to_drop:
    index_to_drop = list(full_df.loc[full_df['title'] == j].index)
    try:
        clean_df = clean_df.drop(index_to_drop)
    except:
        pass

In [47]:
clean_df.shape

(3594, 11)

### Cleaning the description using regex
Now to clean the columns, starting with the description

In [48]:
# cleaning the description to make a term matrix
#  I'll make something that can be used in a lambda function that goes over the basics, like punctuation, lowercasing etc.
def first_clean_description(text):
    # Turning text in to lowercase
    text = text.lower()
    # removing puntuation from text
    text = re.sub('[%s]' %re.escape(string.punctuation), '',text)
    # removing numbers that are in the middle of the text
    text = re.sub('\w*\d\w*', '', text)
    # returnng the text clean
    return text
    

In [105]:
# Applying first round of cleaning on the title column
clean_df['title'] = clean_df['title'].apply(lambda x: first_clean_description(x) )

In [106]:
# Applying first round of cleaning on the description column
clean_df['full_desc'] = clean_df['full_desc'].apply(lambda x: first_clean_description(str(x)) )

In [63]:
list(clean_df.loc[clean_df['title'] == 'technical business analyst']['full_desc'])[5] 

'working for amp \nworking for amp means being part of a company that values diverse thinking encourages collaboration and promotes innovation \xa0it’s an environment that offers challenging and exciting work as well as opportunities for professional growth we’re flexible enough to allow you to make the most of your life both professionally and personally \nwe are looking for those that have the courage and agility to navigate changing and complex environments so that we can deliver the best solutions for our customers we value people with integrity an innate willingness to help others and an eagerness to perform to the best of their abilities\xa0\nwe’re transforming our business and we need people like you to join us on this journey\n\xa0\nabout the role\nas a key technical role reporting to the platform super manager – cloud business applications the technical business analyst is accountable for working closely with servicenow sme team integration engineers developers and testers to 

There are still a few problems, will do a second round of cleaning

In [107]:
# cleaning the description to make a term matrix
#  There is still some stuff to clean out
def second_clean_description(text):
    # taking out special caracters
    text = re.sub('\n', '', text)
    # taking out special caracters
    text = re.sub('\xa0', '', text)
    # taking out special caracters
    text = re.sub("'", '', text)
    # removing email addresses
    text = re.sub('([a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)' , '',text)
    # returnng the text clean
    return text

In [108]:
# Applying second round of cleaning on the description column
clean_df['full_desc'] = clean_df['full_desc'].apply(lambda x: second_clean_description(str(x)) )

In [69]:
list(clean_df.loc[clean_df['title'] == 'technical business analyst']['full_desc'])[5] 

'working for amp working for amp means being part of a company that values diverse thinking encourages collaboration and promotes innovation it’s an environment that offers challenging and exciting work as well as opportunities for professional growth we’re flexible enough to allow you to make the most of your life both professionally and personally we are looking for those that have the courage and agility to navigate changing and complex environments so that we can deliver the best solutions for our customers we value people with integrity an innate willingness to help others and an eagerness to perform to the best of their abilitieswe’re transforming our business and we need people like you to join us on this journeyabout the roleas a key technical role reporting to the platform super manager – cloud business applications the technical business analyst is accountable for working closely with servicenow sme team integration engineers developers and testers to design document and test

Good enough! Next Column, Salaries

In [89]:
# cleaning the description to make a term matrix
#  I'll make something that can be used in a lambda function that goes over the basics, like punctuation, lowercasing etc.
def first_clean_salary(text):
    # removing puntuation from text
    text = re.sub('[%s]' %re.escape(string.punctuation), '',text)
    # removing text from numbers
    text = re.sub('[a-zA-Z]*', '', text)
    # returnng the text clean
    return text
    

In [109]:
clean_df['salary'] = clean_df['salary'].apply(lambda x: first_clean_salary(x))

In [110]:
clean_df['salary'].value_counts()

                    2455
                      51
                      36
                      32
                      31
                    ... 
4546  6153             1
800  900               1
  800                  1
140000  149999         1
750  800               1
Name: salary, Length: 676, dtype: int64

In [None]:
# Not the best, but I am not sure hot to make it better

In [112]:
clean_df['pay'].value_counts()

Not Given    2695
60000          27
700            26
100            23
50000          23
             ... 
715             1
69212           1
88410.40        1
68000.00        1
1300            1
Name: pay, Length: 221, dtype: int64

In [117]:
clean_df['cat'].value_counts()

, Information & Communication Technology    1600
, Administration & Office Support            277
, Government & Defence                       223
, Accounting                                 215
, Banking & Financial Services               209
, Manufacturing, Transport & Logistics       117
, Marketing & Communications                 112
, Healthcare & Medical                       106
, Mining, Resources & Energy                 100
, Science & Technology                        71
, Engineering                                 63
, Consulting & Strategy                       61
, Sales                                       54
, Human Resources & Recruitment               52
, Trades & Services                           42
, Insurance & Superannuation                  42
Not Given                                     35
, Retail & Consumer Products                  34
, Call Centre & Customer Service              32
, Education & Training                        31
, Community Services

In [None]:
# cleaning the description to make a term matrix
#  I'll make something that can be used in a lambda function that goes over the basics, like punctuation, lowercasing etc.
def first_clean_salary(text):
    # removing puntuation from text
    text = re.sub('[%s]' %re.escape(string.punctuation), '',text)
    # removing & from numbers
    text = re.sub('&*', '', text)
    # returnng the text clean
    return text

In [137]:
# Cleaning Cat a little.
clean_df['cat'] = clean_df['cat'].str.strip(',').str.strip(' ')

dropping rows that have no pay information

In [142]:
# Dropping the jobs that have no pay information
lean_df = clean_df.drop(list(clean_df.loc[clean_df['pay'] == 'Not Given'].index))

In [144]:
def first_clean_pay(text):
    # removing puntuation from text
    text = re.sub('[%s]' %re.escape(string.punctuation), '',text)
    # returnng the text clean
    return text

In [146]:
lean_df['pay'] = lean_df['pay'].apply(lambda x: first_clean_pay(x))

In [148]:
# Turning the number in to integers
lean_df['pay'] = lean_df['pay'].apply(lambda x: int(x))

In [154]:
lean_df['contract'].value_counts()

Full Time        584
Contract/Temp    280
Part Time         20
Casual            15
Name: contract, dtype: int64

In [159]:
# Dropping salary column as it has no useful
lean_df.drop('salary', axis=1, inplace=True)

In [213]:
# Making sure that the columns that will be dummified have only one word on their rows
# This will make it so that the understanding of the tokenized works makes more sense, hopefully
lean_df.city = lean_df.city.str.strip(' ').apply(lambda x: x.replace(' ','_').replace('&',''))
lean_df.cat = lean_df.cat.str.strip(' ').apply(lambda x: x.replace(' ','_').replace('&',''))
lean_df.contract = lean_df.contract.str.strip(' ').apply(lambda x: x.replace(' ','_').replace('&',''))
lean_df.location_desc = lean_df.location_desc.str.strip(' ').apply(lambda x: x.replace(' ','_').replace('&',''))
lean_df.short_desc = lean_df.short_desc.str.strip(' ').apply(lambda x: x.replace(' ','_').replace('&',''))

In [214]:
# Dropping jobs that pay less tha 60k. Likely pay per day or have different remmuneration
model_df = lean_df.loc[lean_df['pay'] > 60000]

In [215]:
# Saving CSV, just in case...
model_df.to_csv('clean_data.csv')

---
Okay! lets do some CountVectorizing and see what we get!

I want to get the top words, unique words used (Specific/technical vocabulary)

First I will tokenize (ConutVectorizer) then I'll run some models to test the best with this type of data

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn import metrics
# Using NB models to test the tokenizer
# And LogisticRegression because the target is binary
from sklearn.naive_bayes import MultinomialNB, ComplementNB, BernoulliNB
from sklearn.linear_model import LogisticRegression

In [3]:
model_df = pd.read_csv('clean_data.csv')

In [7]:
model_df.drop('Unnamed: 0',axis=1,inplace=True)

In [8]:
# First Thing will be to determine the target
# I will try to do a classification model using a treshold to determine high/low salary
# creating a column for low/high salary
model_df['salary'] = 0
model_df.loc[model_df['pay'] >= 100000, 'salary'] = 1

In [9]:
# anything over 100K is high. Pretty good class division!
model_df['salary'].value_counts()

0    156
1    150
Name: salary, dtype: int64

From the split above we can see that we only have 306 jobs.

As stated on the begining of the project, we need at least 500 jobs to be able to have an acceptable result.
It is likely that we did not chieve the 500 jobs due to the moment in time the scape was made. It will be interesting to run the same code in te future and see how many jobs we get.

For this project we will start a new notebook using the CSV provided by GA.