## Business Case Overview

You're working as a data scientist for a contracting firm that's rapidly expanding. Now that they have their most valuable employee (you!), they need to leverage data to win more contracts. Your firm offers technology and scientific solutions and wants to be competitive in the hiring market. Your principal has two main objectives:

   1. Determine the industry factors that are most important in predicting the salary amounts for these data.
   2. Determine the factors that distinguish job categories and titles from each other. For example, can required skills accurately predict job title?

To limit the scope, your principal has suggested that you *focus on data-related job postings*, e.g. data scientist, data analyst, research scientist, business intelligence, and any others you might think of. You may also want to decrease the scope by *limiting your search to a single region.*

Hint: Aggregators like [Indeed.com](https://www.indeed.com) regularly pool job postings from a variety of markets and industries. 

**Goal:** Scrape your own data from a job aggregation tool like Indeed.com in order to collect the data to best answer these two questions.

---

## Directions

In this project you will be leveraging a variety of skills. The first will be to use the web-scraping and/or API techniques you've learned to collect data on data jobs from Indeed.com or another aggregator. Once you have collected and cleaned the data, you will use it to answer the two questions described above.

### QUESTION 1: Factors that impact salary

To predict salary you will be building either a classification or regression model, using features like the location, title, and summary of the job. If framing this as a regression problem, you will be estimating the listed salary amounts. You may instead choose to frame this as a classification problem, in which case you will create labels from these salaries (high vs. low salary, for example) according to thresholds (such as median salary).

You have learned a variety of new skills and models that may be useful for this problem:
- NLP
- Unsupervised learning and dimensionality reduction techniques (PCA, clustering)
- Ensemble methods and decision tree models
- SVM models

Whatever you decide to use, the most important thing is to justify your choices and interpret your results. *Communication of your process is key.* Note that most listings **DO NOT** come with salary information. You'll need to able to extrapolate or predict the expected salaries for these listings.

In [1]:
import pickle
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import unicodedata

import numpy as np
pd.options.display.max_columns = 999
pd.options.display.max_rows = 999

In [2]:
file = open('all_jobs', 'rb')
all_jobs = pickle.load(file)
all_jobs.head(5)
file.close()

# 2. Data Cleaning

In [3]:
uniquetitle, duplicate_index = [],[]
for (i,x) in enumerate(all_jobs['job_title']):

    if len(x) > 40: # if job desc is too long and repeats, likely that it is a duplicate.
        if x in set(uniquetitle):
            duplicate_index.append(i)
        elif x not in set(uniquetitle):
           # print(uniquetitle)
           uniquetitle.append(x)

In [4]:
all_jobs.drop(index=duplicate_index, axis=0, inplace=True)

In [5]:
all_jobs.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3088 entries, 0 to 3234
Data columns (total 20 columns):
company                         2961 non-null object
country                         2334 non-null object
job_title                       3088 non-null object
description                     281 non-null object
required_skills                 3088 non-null object
date_created                    586 non-null object
equity                          62 non-null object
job_category                    2334 non-null object
job_type                        281 non-null object
last_updated                    281 non-null object
vacancies                       281 non-null float64
salary_range                    488 non-null object
years_of_experience_required    2333 non-null object
currency                        226 non-null object
lower                           226 non-null object
higher                          226 non-null float64
rate                            226 non-null float64
l

In [6]:
# fill all na in lower_sgd with corresponding value in higher_sgd, and vice versa
all_jobs['lower_sgd'] = all_jobs['lower_sgd'].fillna(all_jobs['salary_range']).astype(float)
all_jobs['higher_sgd'] = all_jobs['higher_sgd'].fillna(all_jobs['salary_range']).astype(float)

# create new average salary column
all_jobs['mean_salary_sgd'] = (all_jobs['lower_sgd'] + all_jobs['higher_sgd'])/2

In [7]:
# fill nas with empty cells to avoid errors from np.nan as a float
all_jobs['description'].fillna('', inplace=True)

In [8]:
# combine JD with required skills into 1 column
all_jobs['full_description'] = all_jobs['description'] + ' ' + all_jobs['required_skills']

In [9]:
# function to clean unicode from description
def clean_unicode(unicode_str):
    return unicodedata.normalize("NFKD", unicode_str)

all_jobs['full_description'] = all_jobs['full_description'].str.replace('\n',' ').apply(clean_unicode)

In [10]:
all_jobs.job_title= all_jobs.job_title.map(lambda x: x.lower())

# data_list=['data','deep learning','machine learning','ai ',' ai','nlp','ml ', ' ml','artificial intelligence']

def is_category(x):
    x = x.lower()
    ans="dont_care"
    
    if "database" in x or 'data base'in x:
        ans="database"
   
    engin_set = ['engineer','dev','operation','architect', 'admin',
                 'program','analy','miner','algorithm']
    for item in engin_set:
        if item in x:
            ans="engineer"
            break
            
    if "analy" in x:
        ans="analyst" 
    
    if "scien" in x or "research" in x:
        ans="scientist"
    
    leadership_set = ['head','director','manager','consult','lead','vp','presi','chief']
    for item in leadership_set:
        if item in x:
            ans = "leadership"
            break
    
    if ("intern" in x):
        ans = "intern"
    return(ans)
        
all_jobs["is_category"]= all_jobs.job_title.apply(is_category)

In [11]:
# cleaning the job categories column and using it to remove dubious scrapes

all_jobs.job_category.unique()

array(['Data & Analytics', 'Community Management',
       'Sales & Business Development', 'Project & Product Management',
       'Marketing & PR', 'Finance, Legal & Accounting',
       'Logistics & Operations', 'Web Development',
       'Enterprise Software & Systems', 'Executive & Management',
       'Customer Service', 'Administrative & Clerical',
       'Media & Journalism', 'UX/UI Design', 'DevOps & Cloud Management',
       'Mobile Development', 'Graphic & Motion Design', nan,
       'Human Resources Management/Consulting', 'Government/Defence',
       'Healthcare/Medical',
       'BioTechnology/Pharmaceutical/Clinical research', 'Insurance',
       'Banking/Financial Services', 'Retail/Merchandise',
       'Advertising/Marketing/Promotion/PR', 'Oil/Gas/Petroleum',
       'Semiconductor/Wafer Fabrication', 'Accounting/Audit/Tax Services',
       'Utilities/Power', 'Manufacturing/Production', 'Hotel/Hospitality',
       'Consulting (IT, Science, Engineering & Technical)',
       'A

In [12]:
# Remove weird job categories, hard code lmaooo

jobdel = ['Administrative & Clerical', 'Media & Journalism', 'Graphic & Motion Design', 'Oil/Gas/Petroleum', 'Semiconductor/Wafer Fabrication',
          'Accounting/Audit/Tax Services', 'Hotel/Hospitality', 'Telecommunication', 'Exhibitions/Event management/MICE',
         'Chemical/Fertilizers/Pesticides', 'Architectural Services/Interior Designing', 'Polymer/Plastic/Rubber/Tyres',
         'Marine/Aquaculture', 'Wood/Fibre/Paper', 'Repair & Maintenance Services', 'Grooming/Beauty/Fitness', 'Stockbroking/Securities', 'Law/Legal',
         'Automobile/Automotive Ancillary/Vehicle', 'Environment/Health/Safety', 'Sports', 'Travel/Tourism', 'Heavy Industrial/Machinery/Equipment']

In [13]:
all_jobs = all_jobs[~all_jobs['job_category'].isin(jobdel)]

In [14]:
# making a copy for Q2 analysis


all_jobsraw = all_jobs.copy()

In [15]:
# check for outliers across all countries - per month salary logically should not be less than $50 SGD
# The average wage per person in Vietnam is around 3.2 million VND ($150) a month
outliers = all_jobs[all_jobs['mean_salary_sgd'] < 50]

# first outlier is real - edit salary to match
condition = all_jobs['job_title'] == outliers['job_title'].iloc[0]
all_jobs['mean_salary_sgd'] = np.where(condition, 2800, all_jobs['mean_salary_sgd'])

# remove remaining monthly salaries less than 50sgd as irrelevant jobs
all_jobs = all_jobs[all_jobs['mean_salary_sgd'] > 50]

In [16]:
# typo - should be 3500 instead of 35005, and 4300 instead of 43001
condition = all_jobs['mean_salary_sgd'] == 35005
all_jobs['mean_salary_sgd'] = np.where(condition, 3500, all_jobs['mean_salary_sgd'])

condition = all_jobs['mean_salary_sgd'] == 43001
all_jobs['mean_salary_sgd'] = np.where(condition, 4300, all_jobs['mean_salary_sgd'])

# incorrect units - likely yearly salary instead of monthly salary (MYR 70,000 and SGD 1,000,000)
# divide outlier salary by 12 to convert to monthly salary
condition = all_jobs['mean_salary_sgd'] > 20000
all_jobs['mean_salary_sgd'] = np.where(condition, all_jobs['mean_salary_sgd']/12, all_jobs['mean_salary_sgd'])

# remove monthly salaries greater than 20,000sgd as outliers
all_jobs = all_jobs[all_jobs['mean_salary_sgd'] < 20000]

In [17]:
all_jobs['mean_salary_sgd'].describe()

count      480.000000
mean      3231.245851
std       2349.369360
min         70.000000
25%       1127.000000
50%       3000.000000
75%       4500.000000
max      11000.000000
Name: mean_salary_sgd, dtype: float64

In [18]:
all_jobs.is_category.value_counts()

engineer      141
analyst       125
dont_care      84
leadership     69
scientist      39
intern         22
Name: is_category, dtype: int64

In [19]:
# Subset df to non-null salaries
df1 = all_jobs.copy()

In [20]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 480 entries, 0 to 3216
Data columns (total 23 columns):
company                         479 non-null object
country                         480 non-null object
job_title                       480 non-null object
description                     480 non-null object
required_skills                 480 non-null object
date_created                    223 non-null object
equity                          57 non-null object
job_category                    480 non-null object
job_type                        223 non-null object
last_updated                    223 non-null object
vacancies                       223 non-null float64
salary_range                    480 non-null object
years_of_experience_required    479 non-null object
currency                        223 non-null object
lower                           223 non-null object
higher                          223 non-null float64
rate                            223 non-null float64
lower_sg

In [21]:
df1['jobcat'] = df1['job_category'].map(lambda x: 1 if x == 'Human Resources Management/Consulting' else 0)

In [22]:
X = df1.loc[:,['is_category','jobcat', 'full_description']]
X = pd.get_dummies(X, columns=['is_category'],drop_first=True)
X = X.reset_index(drop=True)

# 3. Feature extraction using tfidf

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer

vect = TfidfVectorizer(stop_words='english', ngram_range=(2,3),max_features=2000)
textvec = vect.fit_transform(X['full_description'])

In [24]:
columns = [vect.get_feature_names()[i] for i in range(2000)]


In [25]:
textdf = pd.DataFrame(textvec.todense(),columns=columns)

In [26]:
X = X.drop('full_description', axis=1)
X = X.reset_index(drop=True)

In [27]:
textdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 480 entries, 0 to 479
Columns: 2000 entries, 000 000 to ying wen ea
dtypes: float64(2000)
memory usage: 7.3 MB


In [28]:
X1 = pd.concat([X, textdf], axis=1, join_axes=[X.index])
X1.head(2)

Unnamed: 0,jobcat,is_category_dont_care,is_category_engineer,is_category_intern,is_category_leadership,is_category_scientist,000 000,000 000 basic,000 500,000 500 higher,000 basic,000 basic commensurate,000 higher,000 salary,000 salary commensurate,00pm salary,01 singapore,01c4394 rcb,01c4394 rcb 200007268e,01c4394cei registration,01c4394cei registration r1219269,04c4294ea personnel,10 000,11 000,1449 confidential,1449 confidential discussion,18 01,18 01 singapore,200007268e jin,200007268e jin vatenkeist,2012 pdpa,2012 pdpa read,21 cfr,21 cfr 820,30am 00pm,30am 30pm,30am 30pm salary,30am 30pm years,30am 30pmsalary,30am 6pm,30pm salary,30pm salary 000,30pm years,30pm years relevant,3rd party,500 500,500 500 basic,500 basic,500 basic negotiable,500 higher,500 higher salary,5k higher,5k negotiable,5k negotiable salary,6536 7890http,6536 7890http www,6590 9877,6590 9877 confidential,6590 9910,6590 9910 discussion,6590 9926,6590 9926 9230,6590 9946,65909963 94783345,65909963 94783345 confidential,7890http www,7890http www peopleprofilers,8d methodology,9230 1449,9230 1449 confidential,94783345 confidential,94783345 confidential discussion,9877 confidential,9877 confidential discussion,9910 discussion,9910 discussion glad,9926 9230,9926 9230 1449,ability communicate,ability work,able changeprovide,able changeprovide timely,able work,access data,access data required,accordance personal,accordance personal data,accordance privacy,accordance privacy policy,according experience,according experience qualification,according experience qualifications,account managers,accounts payable,achieve career,achieve employee,achieve employee satisfaction,achievegroup asiaor,achievegroup asiaor friendly,act 2012,act 2012 pdpa,action methods,action methods statistical,ad hoc,ad hoc duties,added advantage,address attended,address attended address,address job,address job application,adelard reg,adelard reg r1548174,adhoc duties,administrative tasks,advanced excel,affiliates accordance,affiliates accordance privacy,aforementioned address,aforementioned address attended,agile ignition,agile ignition cell,agile methodologies,agreed consented,agreed consented collecting,agreed terms,agreed terms privacy,alternatively send,alternatively send application,analysis communication,analysis communication qms,analysis data,analysis data management,analysis data modellingbasic,analysis data science,analysis design,analysis design generate,analysis machine,analysis machine learning,analysis toolsstrong,analysis toolsstrong working,analyst koo,analyst koo wan,analyst responsibilities,analyst responsibilities provide,analytical problem,analytical problem solving,analytical skills,analytics data,analytics reporting,analytics reporting analysis,analytics strategies,analytics strategies optimize,analytics tools,analyze data,analyzing datasets,analyzing datasets excel,ang kok,ang kok wee,ang mo,applicants apply,applicants apply sending,applicants click,applicants click apply,applicants kindly,applicants send,applicants send resume,application deemed,application deemed read,application email,application email email,application emailing,application emailing detailed,application employment,application employment people,application people,application people profilers,application purposes,application purposes ea,application sume,application sume deemed,applications treated,applications treated strictest,apply button,apply button page,apply button regret,apply interested,apply interested applicants,apply kindly,apply sending,apply sending updated,apply simply,apply submit,apply submit resume,apply team,apply team player,approaches use,approaches use sound,approving dhf,approving dhf validation,artificial intelligence,artificial intelligence ai,ascend4 achievegroup,ascend4 achievegroup asiaor,asiaor friendly,asiaor friendly consultant,asp net,assessment correction,assessment correction containment,asset management,assist indicate,assist indicate information,attended address,attended address job,attractive incentives,attractive incentives remuneration,attractive staff,attractive staff benefits,automation integration,automation integration sap,availability commence,availability commence work,availability commence workwe,availability regret,availability regret shortlisted,available corporate,available corporate website,available monday,available monday friday,aws vb,bachelor degree,bachelor degree computer,bachelor degree engineering,bachelor masters,bachelor masters phd,bachelor science,bachelor science physics,bahasa indonesia,based approaches,based approaches use,based experience,based experience qualification,based experience qualifications,basic aws,basic aws vb,basic commensurate,basic commensurate based,basic negotiable,basic negotiable higher,behalf people,behalf people profilers,believe make,believe make difference,benefits welfare,benefits welfare training,best class,best practices,big data,big plus,black belt,breakdown expected,breakdown expected monthly,build maintain,business analysis,business analyst,business analytics,business customer,business data,business decisions,business decisions stakeholdersdevelop,business development,business engineering,business engineering data,business intelligence,business needs,business objects,business objects enterprise,business process,business requirements,business senior,business technical,business units,business users,business warehouse,business warehouse products,business warehouse productshad,button page,button page friendly,button regret,cad fea,cad fea cfd,candidate future,candidate future suitable,candidate notified,candidate notified applications,candidates expect,candidates expect competitive,candidates join,candidates join growing,candidates notified,candidates notified important,candidates notified submitting,candidates position,capa risk,capa risk assessment,capability global,capability global reporting,career progression,cash cheque,cash cheque collection,cell fusion,cell fusion automation,center cic,center cic ensuring,cfd software,cfr 820,chain crm,chain crm financeknowledge,changeprovide timely,changeprovide timely access,cheque collection,chris ng,cic ensuring,cic ensuring visualization,click apply,click apply button,click apply submit,client established,client global,client known,client known established,client leading,client leading global,client leading mnc,client world,clients including,clients including identifying,clients manage,clients manage application,collected used,collected used disclosed,collecting using,collecting using retaining,collection analysis,collection analysis communication,collection systems,collection systems data,collection use,collection use disclosure,com cn,com sg,com sg copy,com sg privacy,combined information,combined information center,commence work,commence work regret,commence workby,commence workby submitting,commence workwe,commence workwe regret,commensurate according,commensurate according experience,commensurate based,commensurate based experience,commensurate qualifications,commensurate qualifications experience,committed safeguarding,committed safeguarding personal,communication interpersonal,communication interpersonal skills,communication presentation,communication qms,communication qms performance,communication response,communication response identified,communication skills,company specialised,company specialised semiconductor,company strives,company strives achieve,company transport,company transportation,company transportation pickup,company transportation provided,competitive remuneration,competitive remuneration package,complex data,comprehensive range,comprehensive range benefits,computer engineering,computer science,computer science engineering,computer science information,computer science related,concepts toolsknowledge,concepts toolsknowledge visual,conducive working,conducive working environment,confidence submitting,confidence submitting application,confidence success,confidence success achievement,confidential discussion,confidential discussion indicate,configuration programming,configuration programming including,connection job,connection job application,consent drop,consent drop email,consented collecting,consented collecting using,consented collection,consented collection use,consideration regret,consideration regret short,consideration success,consideration success achievement,consultant 65909963,consultant 65909963 94783345,consultant michelle,consultant michelle 6590,consultant vivien,consultant vivien 6590,consultant wynn,consultant wynn 6590,consulting manager,consumer electronics,containment communication,content marketing,continuous improvement,control software,conversion rate,conversion rate optimization,copy privacy,copy privacy policy,copy resume,copy resume email,copy updated,copy updated resume,corporate website,corporate website http,correction containment,correction containment communication,corrective action,corrective action methods,corrective preventive,crm financeknowledge,crm financeknowledge statistics,cross functional,crystal reports,crystal reports predictive,current drawn,current drawn monthly,current expected,current expected salary,current expected salaryreason,curriculum vitae,curriculum vitae personal,custom reports,customer acquisition,customer experience,customer quality,customer regulatory,customer regulatory requirementsserve,customer requirement,customer satisfaction,customer service,customerstransform data,customerstransform data information,cutting edge,cycle management,cycle management plm,cycle time,daily weekly,dashboards crystal,dashboards crystal reports,dashboards scada,dashboards scada mes,data accordance,data accordance personal,data affiliates,data affiliates accordance,data analysis,data analysis data,data analyst,data analyst koo,data analyst responsibilities,data analysts,data analytics,data analytics strategies,data architecture,data collection,data collection systems,data connection,data connection job,data data,data driven,data engineer,...,requirements degree,requirements diploma,requirements min,requirements min diploma,requirements minimum,requirements years,requirementsserve qa,requirementsserve qa ra,research analysis,resolve issues,resolve issues existing,response identified,response identified performance,responsibilities design,responsibilities develop,responsibilities ensure,responsibilities handle,responsibilities perform,responsibilities provide,responsibilities provide support,responsibilities responsible,responsibilities work,resume alternatively,resume alternatively send,resume current,resume current expected,resume email,resume email protected,resume microsoft,resume microsoft word,resume microsoft words,resume ms,resume ms word,resume providing,resume providing details,resume reason,resume recent,resume recent photo,resumes personal,resumes personal particulars,retaining disclosing,retaining disclosing personal,risk assessment,risk assessment correction,risk based,risk based approaches,risk management,root causes,safeguarding personal,safeguarding personal data,salary 000,salary 000 000,salary 3000,salary 500,salary commensurate,salary commensurate according,salary notice,salary provide,salary provide breakdown,salary reason,salary reason leaving,salary4 availability,salary4 availability regret,salaryreason leaving,salaryreason leaving notice,salaryreason leavingavailability,salaryreason leavingavailability commence,sales experience,sales marketing,sales strategy,sales team,sap business,sap business warehouse,sap query,sap query development,sas possess,sas possess knowledge,satisfaction provides,satisfaction provides conducive,satisfaction providing,satisfaction providing attractive,satisfy customerstransform,satisfy customerstransform data,scada mes,scada mes information,science big,science big data,science data,science engineering,science information,science information management,science information technology,science physics,science physics mathematics,science related,science statistics,search engine,self driven,self motivated,semiconductor equipments,semiconductor equipments expansion,send application,send application email,send resume,send resume microsoft,send updated,send updated resume,sending updated,sending updated sume,seng kang,seng kang woodlands,sengkang woodlands,sengkang woodlands working,senior functional,senior functional engineer,sent aforementioned,sent aforementioned address,seo sem,server sap,server sap business,servers sap,servers sap business,services pte,services pte committed,services pte ltdea,services singapore,services singapore pte,sg copy,sg copy privacy,sg privacy,sg privacy php,shift work,short listed,short listed candidate,shortlisted candidate,shortlisted candidate notified,shortlisted candidates,shortlisted candidates notified,simply submit,simply submit application,singapore pte,singapore pte ea,singaporeans information,singaporeans information location,skills ability,skills able,skills data,skills experience,skills knowledge,skills strong,social media,software design,software development,software development life,software engineering,software tools,solution design,solutionsprovide level,solutionsprovide level support,solving skills,sound business,sound business decisions,sound investigation,sound investigation corrective,south east,south east asia,southeast asia,speaking clients,specialised semiconductor,specialised semiconductor equipments,specifications configuration,specifications configuration programming,spoken written,spss sas,spss sas possess,sql nosql,sql server,sql server sap,sql servers,sql servers sap,staff apply,staff apply team,staff benefits,staff benefits welfare,stafflink com,stafflink com sg,stafflink services,stafflink services pte,staffs apply,staffs apply team,stakeholdersdevelop new,stakeholdersdevelop new databases,state art,statement available,statement available corporate,statistical analysis,statistical efficiency,statistical efficiency quality,statistical methods,statistical packages,statistical packages analyzing,statistical techniques,statistics experience,statistics experience using,statistics mathematics,statistics4 years,statistics4 years relevant,statisticsknowledge sql,statisticsknowledge sql server,strategies optimize,strategies optimize statistical,strictest confidence,strictest confidence submitting,strictest confidence success,strives achieve,strives achieve employee,strong analytical,strong analytical skills,strong experience,strong knowledge,strong programming,strong team,strong understanding,strong understanding database,studio information,studio information design,subject matter,subject matter expert,subject providing,subject providing details,subject title,subject title systems,submit application,submit application emailing,submit resume,submit resume providing,submit updated,submit updated resume,submitting application,submitting application sume,submitting curriculum,submitting curriculum vitae,success achievement,successful candidates,successful candidates expect,successfully maximizing,successfully maximizing fiscal,suitability eligibility,suitability eligibility qualifications,suitable candidates,suitable candidates join,suitable positions,suitable positions notifying,suitably qualified,suitably qualified candidates,sume deemed,sume deemed agreed,sume ms,sume ms word,supply chain,supply chain crm,support business,support combined,support combined information,support internal,support troubleshoot,support troubleshoot resolve,support visual,support visual design,supported operations,supported operations manufacturing,systems data,systems data analytics,systems engineering,systems engineering data,systems experience,systemsperform analysis,systemsperform analysis design,systemsprovide support,systemsprovide support visual,tampines seng,tampines seng kang,tampines sengkang,tampines sengkang woodlands,team build,team members,team player,team player meticulous,team qms,team qms quality,technical issues,technical specifications,technical specifications configuration,technical support,technical teams,teng tong,teng tong lin,terms privacy,terms privacy policy,testing systemsprovide,testing systemsprovide support,tham guo,tham guo yao,tham ying,tham ying wen,thorough understanding,time management,timely access,timely access data,timely manner,ting vivien,ting vivien ea,title systems,title systems engineering,tong lin,tong lin adelard,tool web,tool web intelligence,tools capability,tools capability global,toolsknowledge visual,toolsknowledge visual agile,toolsstrong working,toolsstrong working knowledge,track record,training development,training development opportunities,training programmes,training programmes staff,transport allowance,transport provided,transportation pickup,transportation pickup dropoff,transportation provided,treated strictest,treated strictest confidence,troubleshoot resolve,troubleshoot resolve issues,tuas transport,ui ux,understanding database,understanding database concepts,understanding machine,understanding machine learning,unique individual,unique individual join,updated copy,updated copy resume,updated resume,updated resume ms,updated resume recent,updated sume,updated sume ms,use disclosure,use disclosure personal,use risk,use risk based,use sound,use sound investigation,used disclosed,used disclosed behalf,user experience,user requirements,using retaining,using retaining disclosing,using statistical,using statistical packages,using statistical techniques,validation protocols,validation protocols risk,vapor manipulation,variable bonus,various industries,vatenkeist ea,vatenkeist ea personnel,vb net,verbal communication,verbal written,visit www,visit www kellyservices,visual agile,visual agile ignition,visual design,visual design data,visualization manufacturing,visualization manufacturing processes,vitae personal,vitae personal data,vivien 6590,vivien 6590 9877,vivien ea,vivien ea personnel,walk mrt,wan ting,wan ting vivien,warehouse products,warehouse products generate,warehouse productshad,warehouse productshad experience,way job,way job application,web analytics,web application,web development,web intelligence,web intelligence dashboards,web services,website http,website http www,wee gordon,weekly monthly,welfare training,welfare training programmes,wen ea,wen ea personnel,whatsapp tham,whatsapp tham ying,wish withdraw,wish withdraw consent,withdraw consent,withdraw consent drop,woodlands company,woodlands company transportation,woodlands good,woodlands good training,woodlands working,woodlands working days,word format,word format email,word format kelly,word format koo,word format michelle,word format subject,word format teng,word format wynn,work closely,work experience,work fast,work fast paced,work independently,work mon,work mon fri,work regret,work regret short,work regret shortlisted,work team,work week,workby submitting,workby submitting application,working closely,working days,working days days,working days mon,working days monday,working environment,working environment attractive,working environment staffs,working experience,working hours,working knowledge,working knowledge processes,working location,workwe regret,workwe regret short,world class,world leading,written communication,written spoken,written verbal,written verbal communication,www kellyservices,www kellyservices com,www peopleprofilers,www peopleprofilers comea,www stafflink,www stafflink com,wynn 6590,wynn 6590 9946,wynn tham,wynn tham guo,yao ea,yao ea personnel,year contract,year experience,year relevant,year working,year working experience,years experience,years experience data,years related,years relevant,years relevant experience,years relevant experiencebachelor,years relevant experiencepossess,years relevant working,years working,years working experience,ying wen,ying wen ea
0,0,0,0,0,0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.120263,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.15104,0.129234,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.177712,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.170549,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.092119,0.141244,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.078745,0.122726,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.11796,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.133566,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.138497,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.139844,0.170549,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.186683,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.145807,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.133566,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.099726,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0,0,0,0,0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.171684,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.211438,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.285001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.422877,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.247229,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.251267,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.19002,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [29]:
y = df1.mean_salary_sgd.values

In [30]:
X1.fillna(0, inplace=True)

In [31]:
X1.isnull().sum()

jobcat                                       0
is_category_dont_care                        0
is_category_engineer                         0
is_category_intern                           0
is_category_leadership                       0
is_category_scientist                        0
000 000                                      0
000 000 basic                                0
000 500                                      0
000 500 higher                               0
000 basic                                    0
000 basic commensurate                       0
000 higher                                   0
000 salary                                   0
000 salary commensurate                      0
00pm salary                                  0
01 singapore                                 0
01c4394 rcb                                  0
01c4394 rcb 200007268e                       0
01c4394cei registration                      0
01c4394cei registration r1219269             0
04c4294ea per

In [32]:
from sklearn.decomposition import PCA
pca = PCA(n_components=15)
pca.fit(X1)

PCA(copy=True, iterated_power='auto', n_components=15, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

In [33]:
print(pca.explained_variance_ratio_)  
print(sum(pca.explained_variance_ratio_))

[0.17547903 0.1269256  0.08849329 0.05611374 0.03669752 0.02350112
 0.01906956 0.01444719 0.01286519 0.01149481 0.01019614 0.0093229
 0.00823705 0.00807786 0.00669269]
0.607613709219762


In [34]:
print(pca.singular_values_)  

[12.30681521 10.46664931  8.73953316  6.95933396  5.62796562  4.5037816
  4.05698529  3.53121967  3.33227732  3.14980743  2.96654568  2.83666841
  2.66636158  2.64047054  2.40344058]


In [35]:
X2 = pca.fit_transform(X1)
X2

array([[-0.75034396,  0.20321523, -0.00817118, ..., -0.01534943,
        -0.01246861,  0.00649175],
       [-0.7484945 ,  0.19890875, -0.00359643, ..., -0.00658009,
         0.00862011,  0.02516281],
       [-0.71339689,  0.17750153, -0.0069416 , ..., -0.00894187,
        -0.02024981,  0.0156703 ],
       ...,
       [ 0.79709552,  0.40306006, -0.04179092, ..., -0.05775552,
        -0.18174244, -0.07667307],
       [ 0.77450072,  0.39479134, -0.04604745, ...,  0.01567915,
         0.08453139, -0.03312637],
       [ 0.33388561, -0.84317728, -0.50721352, ...,  0.04497515,
         0.10771495, -0.03703474]])

In [36]:
from sklearn.model_selection import train_test_split

In [37]:
X_train, X_test, y_train, y_test = train_test_split(X2, y, test_size=0.2, random_state=42)

In [38]:
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score

In [39]:
lr = linear_model.LinearRegression()

from sklearn.model_selection import cross_val_score

cross_val_score(lr, X_train, y_train, cv=3)



array([0.33167445, 0.35048543, 0.27485087])

In [40]:
# Train the model using the training sets
lr.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [41]:
y_pred = lr.predict(X_test)

# The coefficients
print('Coefficients: \n', lr.coef_)
# The mean squared error
print("Mean squared error: %.2f"
      % mean_squared_error(y_test, y_pred))

Coefficients: 
 [ 1243.89496909  -468.09446036  2222.11631839  -236.71480176
 -2128.69657481  -352.07290429  -999.67905082   644.65494035
  -217.85727799  -582.58632336   675.33128479 -2689.96770364
 -3142.92094566 -1867.13195313  3615.08723129]
Mean squared error: 2697821.18


In [42]:
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(y_test, y_pred))


Variance score: 0.52


In [43]:
lasso = linear_model.Lasso()
lasso.fit(X_train, y_train)

Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)

In [44]:
cross_val_score(lasso, X_train, y_train)

array([0.33293736, 0.35060433, 0.27800117])

In [45]:
y_pred = lasso.predict(X_test)

# The coefficients
print('Coefficients: \n', lasso.coef_)
# The mean squared error
print("Mean squared error: %.2f"
      % mean_squared_error(y_test, y_pred))

Coefficients: 
 [ 1240.71748822  -462.59215024  2215.14691749  -225.41913544
 -2109.89030805  -327.22110887  -976.31850488   608.75956347
  -173.8406491   -525.20852812   625.66194878 -2623.4846474
 -3067.27596946 -1788.03523016  3521.19561575]
Mean squared error: 2702897.93


In [46]:
print('Variance score: %.2f' % r2_score(y_test, y_pred))

Variance score: 0.52


### QUESTION 2: Factors that distinguish job category

Using the job postings you scraped for part 1 (or potentially new job postings from a second round of scraping), identify features in the data related to job postings that can distinguish job titles from each other. There are a variety of interesting ways you can frame the target variable, for example:
- What components of a job posting distinguish data scientists from other data jobs?
- What features are important for distinguishing junior vs. senior positions?
- Do the requirements for titles vary significantly with industry (e.g. healthcare vs. government)?

You may end up making multiple classification models to tackle different questions. Be sure to clearly explain your hypotheses and framing, any feature engineering, and what your target variables are. The type of classification model you choose is up to you. Be sure to interpret your results and evaluate your models' performance.


### BONUS PROBLEM

Your boss would rather tell a client incorrectly that they would get a lower salary job than tell a client incorrectly that they would get a high salary job. Adjust one of your models to ease his mind, and explain what it is doing and any tradeoffs. Plot the ROC curve.

---


# 1. Feature extraction

In [47]:
all_jobsraw.head(2)

Unnamed: 0,company,country,job_title,description,required_skills,date_created,equity,job_category,job_type,last_updated,vacancies,salary_range,years_of_experience_required,currency,lower,higher,rate,lower_sgd,higher_sgd,index,mean_salary_sgd,full_description,is_category
0,Tech in Asia,Singapore,data scientist,"Tech in Asia (YC W15) is a media, events, and ...",Amazon Web Services (AWS) Data Visualization S...,28 Jan 2018,Yes,Data & Analytics,Full-time,28 Jan 2018,1.0,"SGD 4,500 - 6,000",1 – 4 years,SGD,4500.0,6000.0,1.0,4500.0,6000.0,,5250.0,"Tech in Asia (YC W15) is a media, events, and ...",scientist
1,F Corporation,Indonesia,data scientist,Description\n\nThe focus of our data science t...,What is it like working at F Corporation?Worki...,26 Jan 2018,,Data & Analytics,Full-time,26 Jan 2018,1.0,"IDR 5,000,000 - 9,000,000",1 – 4 years,IDR,5000000.0,9000000.0,9.8e-05,490.0,882.0,,686.0,Description The focus of our data science tea...,scientist


In [48]:
all_jobsraw.is_category.value_counts()

engineer      818
analyst       688
leadership    624
dont_care     375
scientist     361
intern         54
database        1
Name: is_category, dtype: int64

In [49]:
# X matrix will be full description
all_jobsraw['is_datasci'] = all_jobsraw['is_category'].map(lambda x: 1 if x == 'scientist' else 0)


In [50]:
df = all_jobsraw.loc[:,['is_datasci', 'full_description']]

In [51]:
# imba dataset
df.is_datasci.value_counts()

0    2560
1     361
Name: is_datasci, dtype: int64

In [52]:
df = df.reset_index(drop=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2921 entries, 0 to 2920
Data columns (total 2 columns):
is_datasci          2921 non-null int64
full_description    2921 non-null object
dtypes: int64(1), object(1)
memory usage: 45.7+ KB


In [53]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report

In [54]:
vect = TfidfVectorizer(stop_words='english', ngram_range=(2,4), max_features=1000)

# 2. Imba dataset! sake of simplicity, use random o.s.

In [55]:
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=0)

In [56]:
df.columns
df.reset_index(drop=True)

Unnamed: 0,is_datasci,full_description
0,1,"Tech in Asia (YC W15) is a media, events, and ..."
1,1,Description The focus of our data science tea...
2,1,"Job Description: Transform large, complex data..."
3,1,Data Scientist The position is for one of our ...
4,1,Job Description We are looking for a Data Scie...
5,1,"Knorex develops a cloud-based, highly scalable..."
6,1,"Knorex develops a cloud-based, highly scalable..."
7,1,Responsibility Part of data science team Respo...
8,1,"Do research using combine statistic, data anal..."
9,0,Taralite is a marketplace lender in Southeast ...


In [57]:
df_text = vect.fit_transform(df.full_description)

In [58]:
df_text1 = pd.DataFrame(data=df_text.todense(), columns=vect.get_feature_names())
df_text1.head(2)

Unnamed: 0,01 singapore,01 singapore 079120,01 singapore 079120 tel,01 singapore sg,079120 tel,079120 tel 6778,079120 tel 6778 5288,08c2893 rcb,09 01,09 01 singapore,09 01 singapore 079120,10 years,208 368,208 368 4748,208 368 4748 keywords,336 8918,336 8918 208,336 8918 208 368,368 4748,368 4748 keywords,368 4748 keywords singapore,4748 keywords,4748 keywords singapore,4748 keywords singapore central,5288 fax,5288 fax 6578,5288 fax 6578 7400,6578 7400,6778 5288,6778 5288 fax,6778 5288 fax 6578,78 shenton,78 shenton way,78 shenton way 09,800 336,800 336 8918,800 336 8918 208,8918 208,8918 208 368,8918 208 368 4748,ability work,ability work independently,able work,able work independently,accommodation team,accommodation team members,accommodation team members disabilities,according experience,ad hoc,added advantage,additional selection,additional selection criteria,additional selection criteria based,administration policy,administration policy administrator,administration policy administrator monitor,administrator human,administrator human resources,administrator human resources responsible,administrator monitor,administrator monitor compliance,administrator monitor compliance available,age national,age national origin,age national origin disability,agreed consented,agreed consented collecting,agreed consented collecting using,amended time,amended time time,amended time time notice,analysis data,analytical problem,analytical problem solving,analytical skills,analytics data,answer questions,answer questions eeo,answer questions eeo matters,applicants apply,applicants notified,applicants send,application emailing,application emailing detailed,application emailing detailed copy,application employment,application employment people,application employment people profilers,application people,application people profilers,application people profilers collected,application process,application process contact,application process contact micron,application sume,application sume deemed,application sume deemed agreed,applications treated,applications treated strictest,applications treated strictest confidence,apply button,apply interested,apply interested applicants,apply online,apply team,apply team player,apply team player meticulous,artificial intelligence,asia pacific,assistance application,assistance application process,assistance application process contact,availability commence,available answer,available answer questions,available answer questions eeo,bachelor degree,bachelor degree computer,bachelor degree computer science,bachelor degree post,bachelor degree post graduate,based prevailing,based prevailing recruitment,based prevailing recruitment policies,based singapore,behalf people,behalf people profilers,behalf people profilers determine,beliefs practices,beliefs practices manager,beliefs practices manager supervisor,believe make,believe make difference,believe make difference like,best practices,big data,big data analytics,business acumen,business analysis,business analyst,business analytics,business development,business intelligence,business needs,business problems,business process,business processes,business requirements,business unit,business units,business users,candidate future,candidate future suitable,candidate future suitable positions,candidate notified,candidate notified applications,candidate notified applications treated,candidate possess,candidates expect,candidates expect competitive,candidates expect competitive remuneration,candidates notified,candidates wish,candidates wish apply,capita pte,carrying policy,carrying policy eeo,carrying policy eeo administrator,central singapore,central singapore sg,central singapore sg 01,change management,classifications protected,classifications protected law,classifications protected law includes,click apply,clients including,clients including identifying,clients including identifying potential,clients manage,clients manage application,clients manage application employment,collected used,collected used disclosed,collected used disclosed behalf,collecting using,collecting using retaining,collecting using retaining disclosing,color religion,color religion sex,color religion sex age,com sg,commensurate according,communication interpersonal,communication interpersonal skills,communication presentation,communication skills,communication skills ability,competitive remuneration,competitive remuneration package,competitive remuneration package comprehensive,compliance available,compliance available answer,compliance available answer questions,comprehensive range,comprehensive range benefits,computer engineering,computer science,computer science engineering,computer science information,computer science related,conditions employment,conditions employment regard,conditions employment regard person,confidence submitting,confidence submitting application,confidence submitting application sume,confidential discussion,confidential discussion indicate,confidential discussion indicate information,consented collecting,consented collecting using,consented collecting using retaining,consideration success,consideration success achievement,contact micron,contact micron human,contact micron human resources,continuous improvement,copy updated,copy updated resume,copy updated resume ms,criteria based,criteria based prevailing,criteria based prevailing recruitment,criteria exhaustive,criteria exhaustive star,criteria exhaustive star include,cross functional,cross functional teams,current expected,current expected salary,current expected salaryreason,customer service,cutting edge,data analysis,data analyst,data analytics,data architecture,data collection,data driven,data governance,data management,data mining,data modelling,data provided,data provided way,data provided way job,data quality,data science,data scientist,data scientists,data sets,data sources,data visualization,data warehouse,day day,days work,decision making,deemed agreed,deemed agreed consented,deemed agreed consented collecting,deep learning,degree business,degree computer,degree computer science,degree diploma,degree engineering,degree post,degree post graduate,degree post graduate diploma,demonstrated ability,department 800,department 800 336,department 800 336 8918,design develop,design development,detailed copy,detailed copy updated,detailed copy updated resume,detailed resume,determine investigate,determine investigate suitability,determine investigate suitability eligibility,develop implement,develop maintain,development experience,development implementation,development team,difference like,difference like hear,difference like hear simply,digital marketing,diploma degree,diploma professional,diploma professional degree,disabilities religious,disabilities religious beliefs,disabilities religious beliefs practices,disability sexual,disability sexual orientation,disability sexual orientation gender,discipline provide,discipline provide conditions,discipline provide conditions employment,disclosed behalf,disclosed behalf people,disclosed behalf people profilers,disclosing personal,disclosing personal information,disclosing personal information prospective,discussion indicate,discussion indicate information,discussion indicate information resume,duties assigned,duties responsibilities,ea licence,ea license,ea license 08c2893,ea personnel,ea personnel reg,ea personnel registration,ea registration,eeo administrator,eeo administrator human,eeo administrator human resources,eeo matters,eeo matters request,eeo matters request assistance,electrical engineering,eligibility criteria,eligibility criteria exhaustive,eligibility criteria exhaustive star,eligibility qualifications,eligibility qualifications employment,eligibility qualifications employment people,email detailed,email email,email email protected,email protected,email protected regret,email resume,email resume detailed,emailing detailed,emailing detailed copy,emailing detailed copy updated,employers consideration,employers consideration success,employers consideration success achievement,employment people,employment people profilers,employment people profilers clients,employment regard,employment regard person,employment regard person race,end end,engineering computer,engineering related,ensure data,excellent communication,excellent communication skills,exhaustive star,exhaustive star include,exhaustive star include additional,existing future,expect competitive,expect competitive remuneration,expect competitive remuneration package,expected salary,expected salaryreason,experience business,experience data,experience following,experience qualifications,experience related,experience related field,experience related field required,experience using,experience working,experience years,experienced regular,expression pregnancy,expression pregnancy veteran,expression pregnancy veteran status,fast paced,fast paced environment,fax 6578,fax 6578 7400,field required,field required position,financial services,following areas,format email,format email protected,friendly consultant,functional teams,future suitable,future suitable positions,future suitable positions notifying,gender identity,gender identity expression,gender identity expression pregnancy,good communication,good communication skills,good interpersonal,good knowledge,good understanding,graduate diploma,graduate diploma professional,graduate diploma professional degree,growing business,hands experience,hardware software,hear simply,hear simply submit,hear simply submit application,high level,high quality,highly motivated,hire train,hire train promote,hire train promote discipline,human resources,human resources department,human resources department 800,human resources responsible,human resources responsible administration,identifying potential,identifying potential candidate,identifying potential candidate future,identity expression,identity expression pregnancy,identity expression pregnancy veteran,importantly believe,importantly believe make,importantly believe make difference,include additional,include additional selection,include additional selection criteria,include following,includes providing,includes providing reasonable,includes providing reasonable accommodation,including identifying,including identifying potential,including identifying potential candidate,indicate information,indicate information resume,indicate information resume current,inform shortlisted,inform shortlisted candidates,inform shortlisted candidates notified,information location,information management,information prospective,information prospective employers,information prospective employers consideration,information resume,information resume current,information resume current expected,information security,information systems,information technology,informed personal,informed personal data,informed personal data provided,interested applicants,interested applicants apply,interested candidates,interested candidates wish,interested candidates wish apply,internal external,interpersonal communication,interpersonal skills,investigate suitability,investigate suitability eligibility,investigate suitability eligibility qualifications,job application,job application people,job application people profilers,job description,job id,job requirements,job responsibilities,job segment,join growing,key responsibilities,key stakeholders,keywords singapore,...,kindly send,knowledge data,knowledge experience,language processing,large scale,law includes,law includes providing,law includes providing reasonable,learning data,learning models,learning techniques,li sing,li sing job,li sing job segment,licence number,license 08c2893,license number,life cycle,like hear,like hear simply,like hear simply submit,listed candidate,listed candidate notified,listed candidate notified applications,listed candidates,long term,machine learning,machine learning techniques,make difference,make difference like,make difference like hear,manage application,manage application employment,manage application employment people,management experience,management skills,management team,manager supervisor,manager supervisor team,manager supervisor team member,manufacturing engineer,market intelligence,market research,master degree,matters request,matters request assistance,matters request assistance application,member responsible,member responsible carrying,member responsible carrying policy,members disabilities,members disabilities religious,members disabilities religious beliefs,meticulous organized,meticulous organized importantly,meticulous organized importantly believe,micron human,micron human resources,micron human resources department,microsoft excel,microsoft office,min years,minimum years,minimum years experience,minimum years relevant,mon fri,monday friday,monitor compliance,monitor compliance available,monitor compliance available answer,ms excel,ms office,ms word,ms word format,ms word format email,national origin,national origin disability,national origin disability sexual,natural language,new business,new product,new products,notice period,notice period availability,notice regret,notice regret shortlisted,notice regret shortlisted candidates,notified applications,notified applications treated,notified applications treated strictest,notifying positions,notifying positions existing,notifying positions existing future,organized importantly,organized importantly believe,organized importantly believe make,orientation gender,orientation gender identity,orientation gender identity expression,origin disability,origin disability sexual,origin disability sexual orientation,paced environment,package comprehensive,package comprehensive range,package comprehensive range benefits,people profilers,people profilers clients,people profilers clients including,people profilers clients manage,people profilers collected,people profilers collected used,people profilers determine,people profilers determine investigate,period availability,period availability commence,person race,person race color,person race color religion,personal data,personal data provided,personal data provided way,personal information,personal information prospective,personal information prospective employers,personnel reg,personnel registration,player meticulous,player meticulous organized,player meticulous organized importantly,policies amended,policies amended time,policies amended time time,policies policies,policies policies amended,policies policies amended time,policy administrator,policy administrator monitor,policy administrator monitor compliance,policy eeo,policy eeo administrator,policy eeo administrator human,positions existing,positions existing future,positions notifying,positions notifying positions,positions notifying positions existing,possess strong,post graduate,post graduate diploma,post graduate diploma professional,potential candidate,potential candidate future,potential candidate future suitable,practices manager,practices manager supervisor,practices manager supervisor team,pre sales,pregnancy veteran,pregnancy veteran status,pregnancy veteran status classifications,presentation skills,prevailing recruitment,prevailing recruitment policies,prevailing recruitment policies policies,prior experience,privacy policy,problem solving,problem solving skills,process contact,process contact micron,process contact micron human,process improvement,product development,product management,products services,professional degree,profilers clients,profilers clients including,profilers clients including identifying,profilers clients manage,profilers clients manage application,profilers collected,profilers collected used,profilers collected used disclosed,profilers determine,profilers determine investigate,profilers determine investigate suitability,profilers pte,programming languages,project management,project manager,promote discipline,promote discipline provide,promote discipline provide conditions,prospective employers,prospective employers consideration,prospective employers consideration success,protected law,protected law includes,protected law includes providing,protected regret,proven track,proven track record,provide conditions,provide conditions employment,provide conditions employment regard,provide support,provide technical,provided way,provided way job,provided way job application,providing reasonable,providing reasonable accommodation,providing reasonable accommodation team,pte ea,pte ea license,qualifications employment,qualifications employment people,qualifications employment people profilers,quality assurance,questions eeo,questions eeo matters,questions eeo matters request,race color,race color religion,race color religion sex,range benefits,real estate,real time,reason leaving,reasonable accommodation,reasonable accommodation team,reasonable accommodation team members,recruit hire,recruit hire train,recruit hire train promote,recruitment policies,recruitment policies policies,recruitment policies policies amended,regard person,regard person race,regard person race color,registration number,regret inform,regret inform shortlisted,regret inform shortlisted candidates,regret short,regret short listed,regret short listed candidate,regret shortlisted,regret shortlisted candidates,regret shortlisted candidates notified,regular engineering,related discipline,related field,related field required,related field required position,relevant experience,relevant working,relevant working experience,religion sex,religion sex age,religion sex age national,religious beliefs,religious beliefs practices,religious beliefs practices manager,remuneration package,remuneration package comprehensive,remuneration package comprehensive range,req id,request assistance,request assistance application,request assistance application process,required position,requirements bachelor,requirements bachelor degree,requirements degree,requirements diploma,requirements min,requirements minimum,resources department,resources department 800,resources department 800 336,resources responsible,resources responsible administration,resources responsible administration policy,responsibilities include,responsibilities provide,responsibilities requirements,responsible administration,responsible administration policy,responsible administration policy administrator,responsible carrying,responsible carrying policy,responsible carrying policy eeo,resume current,resume current expected,resume current expected salaryreason,resume detailed,resume email,resume email protected,resume ms,resume ms word,resume ms word format,retaining disclosing,retaining disclosing personal,retaining disclosing personal information,risk management,root cause,salary commensurate,sales marketing,sales team,science computer,science engineering,science information,science related,science technology,selection criteria,selection criteria based,selection criteria based prevailing,self motivated,send resume,send resume email,send resume email protected,send updated,senior management,services pte,sex age,sex age national,sex age national origin,sexual orientation,sexual orientation gender,sexual orientation gender identity,sg 01,sg 01 singapore,sg 01 singapore sg,shenton way,shenton way 09,shenton way 09 01,short listed,short listed candidate,short listed candidate notified,short listed candidates,shortlisted applicants,shortlisted applicants notified,shortlisted candidates,shortlisted candidates notified,simply submit,simply submit application,simply submit application emailing,sing job,sing job segment,singapore 079120,singapore 079120 tel,singapore 079120 tel 6778,singapore central,singapore central singapore,singapore central singapore sg,singapore sg,singapore sg 01,singapore sg 01 singapore,skills ability,skills able,skills experience,skills strong,social media,software development,software engineering,solving skills,sql server,star include,star include additional,star include additional selection,state art,statistical analysis,status classifications,status classifications protected,status classifications protected law,strictest confidence,strictest confidence submitting,strictest confidence submitting application,strong analytical,strong analytical skills,strong communication,strong knowledge,strong understanding,subject matter,submit application,submit application emailing,submit application emailing detailed,submit resume,submitting application,submitting application sume,submitting application sume deemed,success achievement,successful candidate,successful candidates,successful candidates expect,successful candidates expect competitive,suitability eligibility,suitability eligibility qualifications,suitability eligibility qualifications employment,suitable positions,suitable positions notifying,suitable positions notifying positions,sume deemed,sume deemed agreed,sume deemed agreed consented,supervisor team,supervisor team member,supervisor team member responsible,supply chain,support business,team member,team member responsible,team member responsible carrying,team members,team members disabilities,team members disabilities religious,team player,team player meticulous,team player meticulous organized,technical support,tel 6778,tel 6778 5288,tel 6778 5288 fax,time notice,time notice regret,time notice regret shortlisted,time time,time time notice,time time notice regret,timely manner,tools like,track record,train promote,train promote discipline,train promote discipline provide,treated strictest,treated strictest confidence,treated strictest confidence submitting,understand business,updated resume,updated resume ms,updated resume ms word,used disclosed,used disclosed behalf,used disclosed behalf people,using data,using retaining,using retaining disclosing,using retaining disclosing personal,using statistical,verbal communication,verbal communication skills,verbal written,verbal written communication,veteran status,veteran status classifications,veteran status classifications protected,visit www,way 09,way 09 01,way 09 01 singapore,way job,way job application,way job application people,wish apply,word format,word format email,word format email protected,work closely,work experience,work fast,work independently,work team,working closely,working days,working environment,working experience,working experience related,working experience related field,working knowledge,working location,world class,written communication,written communication skills,written verbal,written verbal communication,year working,year working experience,year working experience related,years experience,years relevant,years relevant experience,years relevant working,years working,years working experience
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.173089,0.0,0.0,0.0,0.190167,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.181393,0.183186,0.0,0.0,0.0,0.0,0.0,0.196003,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.10019,0.177626,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.117544,0.0,0.0,0.0,0.16711,0.0,0.0,0.0,0.144109,0.0,0.0,0.0,0.0,0.0,0.28444,0.170323,0.0,0.0,0.0,0.181834,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.158506,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.399907,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125963,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.130361,0.174546,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.398536,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.171339,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.164131,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.161088,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.190714,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.138685,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.159777,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.106457,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.22879,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.309515,0.0,0.0,0.0,0.0,0.0,0.0,0.560493,0.0,0.0,0.246489,0.0,0.0,0.0,0.0,0.0,0.243257,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.246489,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.340835,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.430903,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.234881,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [59]:
df1 = df.drop('full_description', axis=1)
df1 = df1.reset_index(drop=True)

In [60]:
df_train = pd.concat([df1, df_text1], axis=1, join_axes=[df1.index])
df_train.head(2)

Unnamed: 0,is_datasci,01 singapore,01 singapore 079120,01 singapore 079120 tel,01 singapore sg,079120 tel,079120 tel 6778,079120 tel 6778 5288,08c2893 rcb,09 01,09 01 singapore,09 01 singapore 079120,10 years,208 368,208 368 4748,208 368 4748 keywords,336 8918,336 8918 208,336 8918 208 368,368 4748,368 4748 keywords,368 4748 keywords singapore,4748 keywords,4748 keywords singapore,4748 keywords singapore central,5288 fax,5288 fax 6578,5288 fax 6578 7400,6578 7400,6778 5288,6778 5288 fax,6778 5288 fax 6578,78 shenton,78 shenton way,78 shenton way 09,800 336,800 336 8918,800 336 8918 208,8918 208,8918 208 368,8918 208 368 4748,ability work,ability work independently,able work,able work independently,accommodation team,accommodation team members,accommodation team members disabilities,according experience,ad hoc,added advantage,additional selection,additional selection criteria,additional selection criteria based,administration policy,administration policy administrator,administration policy administrator monitor,administrator human,administrator human resources,administrator human resources responsible,administrator monitor,administrator monitor compliance,administrator monitor compliance available,age national,age national origin,age national origin disability,agreed consented,agreed consented collecting,agreed consented collecting using,amended time,amended time time,amended time time notice,analysis data,analytical problem,analytical problem solving,analytical skills,analytics data,answer questions,answer questions eeo,answer questions eeo matters,applicants apply,applicants notified,applicants send,application emailing,application emailing detailed,application emailing detailed copy,application employment,application employment people,application employment people profilers,application people,application people profilers,application people profilers collected,application process,application process contact,application process contact micron,application sume,application sume deemed,application sume deemed agreed,applications treated,applications treated strictest,applications treated strictest confidence,apply button,apply interested,apply interested applicants,apply online,apply team,apply team player,apply team player meticulous,artificial intelligence,asia pacific,assistance application,assistance application process,assistance application process contact,availability commence,available answer,available answer questions,available answer questions eeo,bachelor degree,bachelor degree computer,bachelor degree computer science,bachelor degree post,bachelor degree post graduate,based prevailing,based prevailing recruitment,based prevailing recruitment policies,based singapore,behalf people,behalf people profilers,behalf people profilers determine,beliefs practices,beliefs practices manager,beliefs practices manager supervisor,believe make,believe make difference,believe make difference like,best practices,big data,big data analytics,business acumen,business analysis,business analyst,business analytics,business development,business intelligence,business needs,business problems,business process,business processes,business requirements,business unit,business units,business users,candidate future,candidate future suitable,candidate future suitable positions,candidate notified,candidate notified applications,candidate notified applications treated,candidate possess,candidates expect,candidates expect competitive,candidates expect competitive remuneration,candidates notified,candidates wish,candidates wish apply,capita pte,carrying policy,carrying policy eeo,carrying policy eeo administrator,central singapore,central singapore sg,central singapore sg 01,change management,classifications protected,classifications protected law,classifications protected law includes,click apply,clients including,clients including identifying,clients including identifying potential,clients manage,clients manage application,clients manage application employment,collected used,collected used disclosed,collected used disclosed behalf,collecting using,collecting using retaining,collecting using retaining disclosing,color religion,color religion sex,color religion sex age,com sg,commensurate according,communication interpersonal,communication interpersonal skills,communication presentation,communication skills,communication skills ability,competitive remuneration,competitive remuneration package,competitive remuneration package comprehensive,compliance available,compliance available answer,compliance available answer questions,comprehensive range,comprehensive range benefits,computer engineering,computer science,computer science engineering,computer science information,computer science related,conditions employment,conditions employment regard,conditions employment regard person,confidence submitting,confidence submitting application,confidence submitting application sume,confidential discussion,confidential discussion indicate,confidential discussion indicate information,consented collecting,consented collecting using,consented collecting using retaining,consideration success,consideration success achievement,contact micron,contact micron human,contact micron human resources,continuous improvement,copy updated,copy updated resume,copy updated resume ms,criteria based,criteria based prevailing,criteria based prevailing recruitment,criteria exhaustive,criteria exhaustive star,criteria exhaustive star include,cross functional,cross functional teams,current expected,current expected salary,current expected salaryreason,customer service,cutting edge,data analysis,data analyst,data analytics,data architecture,data collection,data driven,data governance,data management,data mining,data modelling,data provided,data provided way,data provided way job,data quality,data science,data scientist,data scientists,data sets,data sources,data visualization,data warehouse,day day,days work,decision making,deemed agreed,deemed agreed consented,deemed agreed consented collecting,deep learning,degree business,degree computer,degree computer science,degree diploma,degree engineering,degree post,degree post graduate,degree post graduate diploma,demonstrated ability,department 800,department 800 336,department 800 336 8918,design develop,design development,detailed copy,detailed copy updated,detailed copy updated resume,detailed resume,determine investigate,determine investigate suitability,determine investigate suitability eligibility,develop implement,develop maintain,development experience,development implementation,development team,difference like,difference like hear,difference like hear simply,digital marketing,diploma degree,diploma professional,diploma professional degree,disabilities religious,disabilities religious beliefs,disabilities religious beliefs practices,disability sexual,disability sexual orientation,disability sexual orientation gender,discipline provide,discipline provide conditions,discipline provide conditions employment,disclosed behalf,disclosed behalf people,disclosed behalf people profilers,disclosing personal,disclosing personal information,disclosing personal information prospective,discussion indicate,discussion indicate information,discussion indicate information resume,duties assigned,duties responsibilities,ea licence,ea license,ea license 08c2893,ea personnel,ea personnel reg,ea personnel registration,ea registration,eeo administrator,eeo administrator human,eeo administrator human resources,eeo matters,eeo matters request,eeo matters request assistance,electrical engineering,eligibility criteria,eligibility criteria exhaustive,eligibility criteria exhaustive star,eligibility qualifications,eligibility qualifications employment,eligibility qualifications employment people,email detailed,email email,email email protected,email protected,email protected regret,email resume,email resume detailed,emailing detailed,emailing detailed copy,emailing detailed copy updated,employers consideration,employers consideration success,employers consideration success achievement,employment people,employment people profilers,employment people profilers clients,employment regard,employment regard person,employment regard person race,end end,engineering computer,engineering related,ensure data,excellent communication,excellent communication skills,exhaustive star,exhaustive star include,exhaustive star include additional,existing future,expect competitive,expect competitive remuneration,expect competitive remuneration package,expected salary,expected salaryreason,experience business,experience data,experience following,experience qualifications,experience related,experience related field,experience related field required,experience using,experience working,experience years,experienced regular,expression pregnancy,expression pregnancy veteran,expression pregnancy veteran status,fast paced,fast paced environment,fax 6578,fax 6578 7400,field required,field required position,financial services,following areas,format email,format email protected,friendly consultant,functional teams,future suitable,future suitable positions,future suitable positions notifying,gender identity,gender identity expression,gender identity expression pregnancy,good communication,good communication skills,good interpersonal,good knowledge,good understanding,graduate diploma,graduate diploma professional,graduate diploma professional degree,growing business,hands experience,hardware software,hear simply,hear simply submit,hear simply submit application,high level,high quality,highly motivated,hire train,hire train promote,hire train promote discipline,human resources,human resources department,human resources department 800,human resources responsible,human resources responsible administration,identifying potential,identifying potential candidate,identifying potential candidate future,identity expression,identity expression pregnancy,identity expression pregnancy veteran,importantly believe,importantly believe make,importantly believe make difference,include additional,include additional selection,include additional selection criteria,include following,includes providing,includes providing reasonable,includes providing reasonable accommodation,including identifying,including identifying potential,including identifying potential candidate,indicate information,indicate information resume,indicate information resume current,inform shortlisted,inform shortlisted candidates,inform shortlisted candidates notified,information location,information management,information prospective,information prospective employers,information prospective employers consideration,information resume,information resume current,information resume current expected,information security,information systems,information technology,informed personal,informed personal data,informed personal data provided,interested applicants,interested applicants apply,interested candidates,interested candidates wish,interested candidates wish apply,internal external,interpersonal communication,interpersonal skills,investigate suitability,investigate suitability eligibility,investigate suitability eligibility qualifications,job application,job application people,job application people profilers,job description,job id,job requirements,job responsibilities,job segment,join growing,key responsibilities,key stakeholders,...,kindly send,knowledge data,knowledge experience,language processing,large scale,law includes,law includes providing,law includes providing reasonable,learning data,learning models,learning techniques,li sing,li sing job,li sing job segment,licence number,license 08c2893,license number,life cycle,like hear,like hear simply,like hear simply submit,listed candidate,listed candidate notified,listed candidate notified applications,listed candidates,long term,machine learning,machine learning techniques,make difference,make difference like,make difference like hear,manage application,manage application employment,manage application employment people,management experience,management skills,management team,manager supervisor,manager supervisor team,manager supervisor team member,manufacturing engineer,market intelligence,market research,master degree,matters request,matters request assistance,matters request assistance application,member responsible,member responsible carrying,member responsible carrying policy,members disabilities,members disabilities religious,members disabilities religious beliefs,meticulous organized,meticulous organized importantly,meticulous organized importantly believe,micron human,micron human resources,micron human resources department,microsoft excel,microsoft office,min years,minimum years,minimum years experience,minimum years relevant,mon fri,monday friday,monitor compliance,monitor compliance available,monitor compliance available answer,ms excel,ms office,ms word,ms word format,ms word format email,national origin,national origin disability,national origin disability sexual,natural language,new business,new product,new products,notice period,notice period availability,notice regret,notice regret shortlisted,notice regret shortlisted candidates,notified applications,notified applications treated,notified applications treated strictest,notifying positions,notifying positions existing,notifying positions existing future,organized importantly,organized importantly believe,organized importantly believe make,orientation gender,orientation gender identity,orientation gender identity expression,origin disability,origin disability sexual,origin disability sexual orientation,paced environment,package comprehensive,package comprehensive range,package comprehensive range benefits,people profilers,people profilers clients,people profilers clients including,people profilers clients manage,people profilers collected,people profilers collected used,people profilers determine,people profilers determine investigate,period availability,period availability commence,person race,person race color,person race color religion,personal data,personal data provided,personal data provided way,personal information,personal information prospective,personal information prospective employers,personnel reg,personnel registration,player meticulous,player meticulous organized,player meticulous organized importantly,policies amended,policies amended time,policies amended time time,policies policies,policies policies amended,policies policies amended time,policy administrator,policy administrator monitor,policy administrator monitor compliance,policy eeo,policy eeo administrator,policy eeo administrator human,positions existing,positions existing future,positions notifying,positions notifying positions,positions notifying positions existing,possess strong,post graduate,post graduate diploma,post graduate diploma professional,potential candidate,potential candidate future,potential candidate future suitable,practices manager,practices manager supervisor,practices manager supervisor team,pre sales,pregnancy veteran,pregnancy veteran status,pregnancy veteran status classifications,presentation skills,prevailing recruitment,prevailing recruitment policies,prevailing recruitment policies policies,prior experience,privacy policy,problem solving,problem solving skills,process contact,process contact micron,process contact micron human,process improvement,product development,product management,products services,professional degree,profilers clients,profilers clients including,profilers clients including identifying,profilers clients manage,profilers clients manage application,profilers collected,profilers collected used,profilers collected used disclosed,profilers determine,profilers determine investigate,profilers determine investigate suitability,profilers pte,programming languages,project management,project manager,promote discipline,promote discipline provide,promote discipline provide conditions,prospective employers,prospective employers consideration,prospective employers consideration success,protected law,protected law includes,protected law includes providing,protected regret,proven track,proven track record,provide conditions,provide conditions employment,provide conditions employment regard,provide support,provide technical,provided way,provided way job,provided way job application,providing reasonable,providing reasonable accommodation,providing reasonable accommodation team,pte ea,pte ea license,qualifications employment,qualifications employment people,qualifications employment people profilers,quality assurance,questions eeo,questions eeo matters,questions eeo matters request,race color,race color religion,race color religion sex,range benefits,real estate,real time,reason leaving,reasonable accommodation,reasonable accommodation team,reasonable accommodation team members,recruit hire,recruit hire train,recruit hire train promote,recruitment policies,recruitment policies policies,recruitment policies policies amended,regard person,regard person race,regard person race color,registration number,regret inform,regret inform shortlisted,regret inform shortlisted candidates,regret short,regret short listed,regret short listed candidate,regret shortlisted,regret shortlisted candidates,regret shortlisted candidates notified,regular engineering,related discipline,related field,related field required,related field required position,relevant experience,relevant working,relevant working experience,religion sex,religion sex age,religion sex age national,religious beliefs,religious beliefs practices,religious beliefs practices manager,remuneration package,remuneration package comprehensive,remuneration package comprehensive range,req id,request assistance,request assistance application,request assistance application process,required position,requirements bachelor,requirements bachelor degree,requirements degree,requirements diploma,requirements min,requirements minimum,resources department,resources department 800,resources department 800 336,resources responsible,resources responsible administration,resources responsible administration policy,responsibilities include,responsibilities provide,responsibilities requirements,responsible administration,responsible administration policy,responsible administration policy administrator,responsible carrying,responsible carrying policy,responsible carrying policy eeo,resume current,resume current expected,resume current expected salaryreason,resume detailed,resume email,resume email protected,resume ms,resume ms word,resume ms word format,retaining disclosing,retaining disclosing personal,retaining disclosing personal information,risk management,root cause,salary commensurate,sales marketing,sales team,science computer,science engineering,science information,science related,science technology,selection criteria,selection criteria based,selection criteria based prevailing,self motivated,send resume,send resume email,send resume email protected,send updated,senior management,services pte,sex age,sex age national,sex age national origin,sexual orientation,sexual orientation gender,sexual orientation gender identity,sg 01,sg 01 singapore,sg 01 singapore sg,shenton way,shenton way 09,shenton way 09 01,short listed,short listed candidate,short listed candidate notified,short listed candidates,shortlisted applicants,shortlisted applicants notified,shortlisted candidates,shortlisted candidates notified,simply submit,simply submit application,simply submit application emailing,sing job,sing job segment,singapore 079120,singapore 079120 tel,singapore 079120 tel 6778,singapore central,singapore central singapore,singapore central singapore sg,singapore sg,singapore sg 01,singapore sg 01 singapore,skills ability,skills able,skills experience,skills strong,social media,software development,software engineering,solving skills,sql server,star include,star include additional,star include additional selection,state art,statistical analysis,status classifications,status classifications protected,status classifications protected law,strictest confidence,strictest confidence submitting,strictest confidence submitting application,strong analytical,strong analytical skills,strong communication,strong knowledge,strong understanding,subject matter,submit application,submit application emailing,submit application emailing detailed,submit resume,submitting application,submitting application sume,submitting application sume deemed,success achievement,successful candidate,successful candidates,successful candidates expect,successful candidates expect competitive,suitability eligibility,suitability eligibility qualifications,suitability eligibility qualifications employment,suitable positions,suitable positions notifying,suitable positions notifying positions,sume deemed,sume deemed agreed,sume deemed agreed consented,supervisor team,supervisor team member,supervisor team member responsible,supply chain,support business,team member,team member responsible,team member responsible carrying,team members,team members disabilities,team members disabilities religious,team player,team player meticulous,team player meticulous organized,technical support,tel 6778,tel 6778 5288,tel 6778 5288 fax,time notice,time notice regret,time notice regret shortlisted,time time,time time notice,time time notice regret,timely manner,tools like,track record,train promote,train promote discipline,train promote discipline provide,treated strictest,treated strictest confidence,treated strictest confidence submitting,understand business,updated resume,updated resume ms,updated resume ms word,used disclosed,used disclosed behalf,used disclosed behalf people,using data,using retaining,using retaining disclosing,using retaining disclosing personal,using statistical,verbal communication,verbal communication skills,verbal written,verbal written communication,veteran status,veteran status classifications,veteran status classifications protected,visit www,way 09,way 09 01,way 09 01 singapore,way job,way job application,way job application people,wish apply,word format,word format email,word format email protected,work closely,work experience,work fast,work independently,work team,working closely,working days,working environment,working experience,working experience related,working experience related field,working knowledge,working location,world class,written communication,written communication skills,written verbal,written verbal communication,year working,year working experience,year working experience related,years experience,years relevant,years relevant experience,years relevant working,years working,years working experience
0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.173089,0.0,0.0,0.0,0.190167,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.181393,0.183186,0.0,0.0,0.0,0.0,0.0,0.196003,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.10019,0.177626,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.117544,0.0,0.0,0.0,0.16711,0.0,0.0,0.0,0.144109,0.0,0.0,0.0,0.0,0.0,0.28444,0.170323,0.0,0.0,0.0,0.181834,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.158506,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.399907,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125963,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.130361,0.174546,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.398536,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.171339,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.164131,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.161088,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.190714,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.138685,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.159777,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.106457,0.0,0.0,0.0,0.0,0.0
1,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.22879,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.309515,0.0,0.0,0.0,0.0,0.0,0.0,0.560493,0.0,0.0,0.246489,0.0,0.0,0.0,0.0,0.0,0.243257,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.246489,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.340835,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.430903,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.234881,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [62]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2921 entries, 0 to 2920
Columns: 1001 entries, is_datasci to years working experience
dtypes: float64(1000), int64(1)
memory usage: 22.3 MB


In [63]:
df_train.isnull().sum()

is_datasci                                            0
01 singapore                                          0
01 singapore 079120                                   0
01 singapore 079120 tel                               0
01 singapore sg                                       0
079120 tel                                            0
079120 tel 6778                                       0
079120 tel 6778 5288                                  0
08c2893 rcb                                           0
09 01                                                 0
09 01 singapore                                       0
09 01 singapore 079120                                0
10 years                                              0
208 368                                               0
208 368 4748                                          0
208 368 4748 keywords                                 0
336 8918                                              0
336 8918 208                                    

In [64]:
X_1 = df_train.drop('is_datasci', axis=1)
y_1 = df_train.is_datasci.values

In [65]:
X_train, X_test, y_train, y_test = train_test_split(X_1, y_1, test_size=0.3, random_state=10000)

In [66]:
X_resampled, y_resampled = ros.fit_sample(X_train, y_train)

In [70]:
print('Before resample: class 0 :  ', (y_train == 0).sum()), '\n',
print ('class 1: ',(y_train == 1).sum())
print('After resample: class 0 :  ', (y_resampled == 0).sum()), '\n',
print ('class 1: ',(y_resampled == 1).sum())

Before resample: class 0 :   1790
class 1:  254
After resample: class 0 :   1790
class 1:  1790


In [71]:
logreg = linear_model.LogisticRegression()
logreg.fit(X_resampled, y_resampled)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [72]:
from sklearn.model_selection import cross_val_score


In [75]:
cross_val_score(logreg,X_resampled, y_resampled, cv=5)

array([0.87430168, 0.91620112, 0.87430168, 0.87430168, 0.88687151])

In [76]:
y_pred = logreg.predict(X_test)

In [77]:
from sklearn.metrics import classification_report
target_names = ['non-science', 'science']
print(classification_report(y_test, y_pred, target_names=target_names))

             precision    recall  f1-score   support

non-science       0.96      0.94      0.95       770
    science       0.60      0.71      0.65       107

avg / total       0.92      0.91      0.91       877



In [78]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(max_depth=5, random_state=0, n_estimators=20)
clf.fit(X_resampled, y_resampled)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=5, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=20, n_jobs=1,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

In [79]:
y_pred = clf.predict(X_test)

In [80]:
target_names = ['non-science', 'science']
print(classification_report(y_test, y_pred, target_names=target_names))

             precision    recall  f1-score   support

non-science       0.95      0.94      0.95       770
    science       0.60      0.65      0.62       107

avg / total       0.91      0.90      0.91       877

