# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Project 4: Web Scraping Job Posting + Salary Prediction + Job Title feature importance


In [1]:
import numpy as np
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler #statsmodels api module
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.cross_validation import cross_val_score, cross_val_predict

sns.set_style('darkgrid')
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

  from pandas.core import datetools


# Load dataset

In [2]:
df_raw = pd.read_csv('./data_MCF_combined.csv', keep_default_na=False)
df_raw.drop(columns='Unnamed: 0' , axis=1, inplace=True)

In [3]:
df_raw.head()

Unnamed: 0,Company,JobTitle,Location,EmploymentType,Seniority,Category,GovSupport,SalaryRange,JobLink,PostedDate,ClosingDate,RoleResponsibility,Requirements
0,GOOGLE ASIA PACIFIC PTE. LTD.,"Data Center Strategic Negotiator, Site Acquisi...",South,Full Time,Executive,General Management,Government support available,"$10,500to$17,500Monthly",/job/c7be1cb4a491ee3740a7f44fa5e2f03c,19-Apr-18,19-May-18,Company overview: Google is not a conventional...,Minimum qualifications: Bachelor's degree in ...
1,BLOOMBERG L.P.,Data Contribution Support Representative,South,Full Time,Executive,General Management,Government support available,"$60,000to$75,000Annually",/job/8c04f46eebe78966a20d0f6b66874036,19-Apr-18,19-May-18,You are excited by the prospect of operating o...,You'll need to have - Excellent communication ...
2,A*STAR RESEARCH ENTITIES,SICS - Data Manager,Central,Full Time,Junior Executive,Banking and Finance,Government support available,"$2,500to$5,000Monthly",/job/56cd672437e47eec4b2f269fc7bb4e56,17-Apr-18,17-May-18,About Singapore Institute for Clinical Science...,Degree in Bioinformatics or relevant field 5-...
3,CARAT MEDIA SERVICES SINGAPORE PTE LTD,Data Engineer,Central,Full Time,Junior Executive,Banking and Finance,Government support available,"$3,500to$6,000Monthly",/job/fadf2ca5185e0bb439ca1ffae5f526e7,12-Apr-18,12-May-18,Summary: iProspect helps our clients achieve t...,Skills & Experience Required: The role will be...
4,UBS AG,CMO Maintenance Data Analyst,South,Contract ...,Non-executive,Sciences / Laboratory / R&D,,"$35,000to$58,000Annually",/job/531f8e41ed9984b7595b176d7edab410,11-Apr-18,11-May-18,Your role : Are you incredibly organized with ...,Your��experience and skills :�ۢ Past experienc...


# Data cleaning

In [4]:
print df_raw.get_dtype_counts()
print '----------------------------------'
print df_raw.dtypes
print '----------------------------------'
print df_raw.isnull().sum()
print '----------------------------------'
print df_raw.index
print '----------------------------------'
print df_raw.shape

object    13
dtype: int64
----------------------------------
Company               object
JobTitle              object
Location              object
EmploymentType        object
Seniority             object
Category              object
GovSupport            object
SalaryRange           object
JobLink               object
PostedDate            object
ClosingDate           object
RoleResponsibility    object
Requirements          object
dtype: object
----------------------------------
Company               0
JobTitle              0
Location              0
EmploymentType        0
Seniority             0
Category              0
GovSupport            0
SalaryRange           0
JobLink               0
PostedDate            0
ClosingDate           0
RoleResponsibility    0
Requirements          0
dtype: int64
----------------------------------
RangeIndex(start=0, stop=1020, step=1)
----------------------------------
(1020, 13)


In [5]:
# Drop irrelevant columns
print 'Before irrelevant columns remove: \n', df_raw.columns
df = df_raw.drop(['GovSupport','Location','JobLink','PostedDate','ClosingDate'] , axis = 1)
print '-----------------------------------------------------------------------'
print 'After irrelevant columns remove: \n', df.columns

Before irrelevant columns remove: 
Index([u'Company', u'JobTitle', u'Location', u'EmploymentType', u'Seniority',
       u'Category', u'GovSupport', u'SalaryRange', u'JobLink', u'PostedDate',
       u'ClosingDate', u'RoleResponsibility', u'Requirements'],
      dtype='object')
-----------------------------------------------------------------------
After irrelevant columns remove: 
Index([u'Company', u'JobTitle', u'EmploymentType', u'Seniority', u'Category',
       u'SalaryRange', u'RoleResponsibility', u'Requirements'],
      dtype='object')


In [6]:
for col in df.columns :
    print 'Column Name:' ,col
    print df[col].value_counts()
    print '-----------------------------------------------'
    

Column Name: Company
A*STAR RESEARCH ENTITIES                                53
NANYANG TECHNOLOGICAL UNIVERSITY                        28
DBS BANK LTD.                                           28
NATIONAL UNIVERSITY OF SINGAPORE                        25
COMTEL SOLUTIONS PTE LTD                                15
LAZADA SOUTH EAST ASIA PTE. LTD.                        14
MANPOWER STAFFING SERVICES (SINGAPORE) PTE LTD          10
STANDARD CHARTERED BANK                                  9
AMAZON ASIA-PACIFIC RESOURCES PRIVATE LIMITED            8
NATIONAL UNIVERSITY HOSPITAL (SINGAPORE) PTE LTD         7
BLUECHIP PLATFORMS ASIA PTE. LTD.                        7
ADECCO PERSONNEL PTE LTD                                 7
GRABTAXI HOLDINGS PTE. LTD.                              7
NTT DATA SINGAPORE PTE. LTD.                             7
Singapore Land Authority                                 7
HCL SINGAPORE PTE. LTD.                                  7
DIMENSION DATA (SINGAPORE) PTE. LTD

In [7]:
# What to clean
# Plan 1) Remove the ... from feature EmploymentType and Seniority and Category
# Plan 2) remove SalaryRange '$' and ',' and 'Monthly'
# Plan 3) Split  SalaryRange to min columns and max columns
# Plan 4) Remove remove_non_ascii characters in RoleResponsibility and Requirements and JobTitle


In [8]:
# Plan 1) Remove the ... and blank space from feature EmploymentType and Seniority and Category
print '######################################################################'
print "Before removal : " , df_raw.EmploymentType.unique()
df.EmploymentType = df.EmploymentType.str.replace(".", "")
print '---------------------------------------------------------------------'
df.EmploymentType = df.EmploymentType.str.replace('\s+', '')
print "After removal : ", df.EmploymentType.unique()
print '######################################################################'
print "Before removal : " , df_raw.Seniority.unique()
df.Seniority = df.Seniority.str.replace(".", "")
print '----------------------------------------------------------------------'
df.Seniority = df.Seniority.str.replace('\s+', '')
print "After removal : ", df.Seniority.unique()
print '######################################################################'
print "Before removal : " , df_raw.Category.unique()
df.Category = df.Category.str.replace(".", "")
print '----------------------------------------------------------------------'
df.Category = df.Category.str.replace('\s+', '')
print "After removal : ", df.Category.unique()
print '########################################################################'

######################################################################
Before removal :  ['Full Time' 'Contract ...' 'Permanent ...' 'Contract' 'Permanent'
 'Part Time ...' 'Temporary ...' 'Temporary' 'Freelance ...']
---------------------------------------------------------------------
After removal :  ['FullTime' 'Contract' 'Permanent' 'PartTime' 'Temporary' 'Freelance']
######################################################################
Before removal :  ['Executive' 'Junior Executive' 'Non-executive' 'Executive ...'
 'Professional' 'Senior Executive' 'Professional ...' 'Manager ...'
 'Manager' 'Senior Management' 'Middle Management' 'Fresh/entry level'
 'Fresh/entry level ...' 'Middle Management ...' 'Senior Management ...']
----------------------------------------------------------------------
After removal :  ['Executive' 'JuniorExecutive' 'Non-executive' 'Professional'
 'SeniorExecutive' 'Manager' 'SeniorManagement' 'MiddleManagement'
 'Fresh/entrylevel']
####################

In [9]:
# Plan 2) remove SalaryRange '$' and ',' and 'Monthly', so left only 'to'

# Must create a new column for Monthly AND Annually for easy computation later.
df['Monthly'] = df.SalaryRange.str.contains(pat='Monthly')
df['Annually'] = df.SalaryRange.str.contains(pat='Annually')

# Now clean up Salary
df.SalaryRange = df.SalaryRange.str.replace("Monthly", "")
df.SalaryRange = df.SalaryRange.str.replace("Annually", "")
df.SalaryRange = df.SalaryRange.str.replace("$", "")
df.SalaryRange = df.SalaryRange.str.replace(",", "")

In [10]:
# Plan 3) Split SalaryRange and compute and display to only Annual Pay (min and max)

# Split min_pay and Max_pay between 'to'
result = pd.concat([df, df.SalaryRange.str.split(pat='to', expand=True)], axis=1)
df = result.rename(columns={0:'Min_pay', 1:'Max_pay'})

# Convert column to int so that can compute later
df.Min_pay = pd.to_numeric(df.Min_pay, errors='coerce').fillna(0).astype(np.int64)
df.Max_pay = pd.to_numeric(df.Max_pay, errors='coerce').fillna(0).astype(np.int64)

# Compute new value for Annual pays
df['Annual_min'] = 0
df.loc[df['Monthly'] == True, 'Annual_min'] = df['Min_pay']*12
df.loc[df['Monthly'] == False, 'Annual_min'] = df['Min_pay']

df['Annual_max'] = 0
df.loc[df['Monthly'] == True, 'Annual_max'] = df['Max_pay']*12
df.loc[df['Monthly'] == False, 'Annual_max'] = df['Max_pay']

# Remove unnecessarily columns
#df = df.drop(columns=['SalaryRange', 'Monthly','Annually','Min_pay','Max_pay' ] , axis=1)

In [11]:
# Plan 4 : Remove remove_non_ascii characters 
def remove_non_ascii(text):
    return ''.join(i for i in text if ord(i)<128)

df['Requirements'] = df['Requirements'].apply(remove_non_ascii)
df['RoleResponsibility'] = df['RoleResponsibility'].apply(remove_non_ascii)
df['JobTitle'] = df['JobTitle'].apply(remove_non_ascii)


In [12]:
df.columns

Index([u'Company', u'JobTitle', u'EmploymentType', u'Seniority', u'Category',
       u'SalaryRange', u'RoleResponsibility', u'Requirements', u'Monthly',
       u'Annually', u'Min_pay', u'Max_pay', u'Annual_min', u'Annual_max'],
      dtype='object')

In [13]:
# Plan 5: Combine all text and dump into 1 column
df['combined'] = df['Requirements'] 
df['Annual_medium'] = (df['Annual_max'] + df['Annual_min']) / 2
# df.loc[df['Monthly'] == False, 'Annual_max'] = df['Max_pay']
df['Annual_medium'].describe()

count      1020.000000
mean      69650.323529
std       47682.868933
min           0.000000
25%       42000.000000
50%       66000.000000
75%       90000.000000
max      390000.000000
Name: Annual_medium, dtype: float64

In [14]:
# Plan 6: Split those without salaries into another df, call df_topredict
df_topredict = df.loc[df['Annual_medium'] == 0]
df_train = df.loc[df['Annual_medium'] != 0]

In [15]:
X = df_train['combined']
y = df_train['Annual_medium']
# split the new DataFrame into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

print X_train.shape
print X_test.shape
print y_train.shape
print y_test.shape

type(X_train)

(683L,)
(228L,)
(683L,)
(228L,)


pandas.core.series.Series



### QUESTION 1: Factors that impact salary

To predict salary you will be building either a classification or regression model, using features like the location, title, and summary of the job. If framing this as a regression problem, you will be estimating the listed salary amounts. You may instead choose to frame this as a classification problem, in which case you will create labels from these salaries (high vs. low salary, for example) according to thresholds (such as median salary).

You have learned a variety of new skills and models that may be useful for this problem:
- NLP
- Unsupervised learning and dimensionality reduction techniques (PCA, clustering)
- Ensemble methods and decision tree models
- SVM models

Whatever you decide to use, the most important thing is to justify your choices and interpret your results. *Communication of your process is key.* Note that most listings **DO NOT** come with salary information. You'll need to able to extrapolate or predict the expected salaries for these listings.

### Answer Question 1: 
    

My approach: Every step below will checks regression score.<br>
1) CountVectorizer without stop words<br>
2) CountVectorizer again with default sklearn's stop words<br>
3) Add in more stop word, with CountVectorizer<br>
4) Stemming and CountVectorizer<br>
5) Lemmation and CountVectorizer<br>
6) TfidfVectorizer and Hashing Vectorizer /.<br>
7) Take the best scored model, re-run on all regression that I know.<br>
8) If all dun give good scores, I will give up and change to classification.

# NLP

### Load stop words either from NLTK or sklearn

In [16]:
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS, CountVectorizer, TfidfVectorizer, HashingVectorizer
import nltk
import numpy as np

# NLP : Tokenization

- **What:** Separate text into units such as sentences or words
- **Why:** Gives structure to previously unstructured text
- **Notes:** Relatively easy with English language text, not easy with some languages

In [17]:
# NLTK and sklearn both have their own stop words list 
nltk.download('stopwords')
nltk_stops = stopwords.words()
custom_stop_words = list(ENGLISH_STOP_WORDS)

[nltk_data] Downloading package stopwords to C:\Users\default.LAPTOP-
[nltk_data]     2CI68M4P\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [18]:
# use default options for CountVectorizer
vect = CountVectorizer()

# create document-term matrices
Xs_train = vect.fit_transform(X_train)
Xs_test = vect.transform(X_test)
# Lets check the length of our data that is in a vectorized state
print('# of features: {}'.format( len(vect.get_feature_names()) ))
print 'Xs_train.shape', Xs_train.shape
print 'Xs_test.shape', Xs_test.shape

# of features: 6165
Xs_train.shape (683, 6165)
Xs_test.shape (228, 6165)


#### As you can see after my training set gets vectorized, the feature expanded MANY columns, and remained the same rows.

In [19]:
# But what the hell!! you can't see what's inside easily. 
Xs_train

<683x6165 sparse matrix of type '<type 'numpy.int64'>'
	with 57656 stored elements in Compressed Sparse Row format>

In [20]:
# to see what's inside, re-do this and put into a df.
# Transforming our x_train data using our fit cvec.
# And converting the result to a DataFrame.
Xs_train = pd.DataFrame(vect.transform(df['combined']).todense(),
                       columns=vect.get_feature_names())

Xs_train.head()

Unnamed: 0,00,000,0030,00am,00pm,02c3423,03,03c4590,03c5451,03c5577,...,zenhub,zephus,zeppelin,zero,zfs,zone,zones,zoo,zookeeper,zurich
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [21]:
# Which words appear the most?
word_counts = Xs_train.sum(axis=0)
word_counts.sort_values(ascending = False).head(20)

and           6595
in            3700
to            3446
of            2713
experience    2437
with          2029
the           1859
or            1536
skills        1362
data          1302
for            833
knowledge      808
be             757
on             743
ability        737
strong         729
work           726
years          699
management     673
is             657
dtype: int64

Wait WHAT!!!! Top words has no meaning. Let's re-vec with stop-words.

# NLP : Stopword removal**: a/an/the

In [22]:
from sklearn.feature_extraction import text 
new_words = ['work','new', 'ensure', 'team', 'provide', 'including' , 'experience', 'skills', 'data', 'knowledge',
            'ability', 
'strong'            ,
'years'             ,
'management'        ,
'business'          ,
'working'           ,
'good'              ,
'degree'            ,
'communication'     ,
'excellent'         ,
'able'              ,
'science'           ,
'computer'          ,
'time'              ,
'development'       ,
'environment'       ,
'related'           ,
'engineering'       ,
'understanding'     ,
'technical'         ,
'analytical'        ,
'relevant'          ,
'requirements'      ,
'minimum'           ,
'tools'             ,
'systems'           ,
'analytics'         ,
'information',
            preferred   ,            
learning               , 
design                  ,
software                ,
written                 ,
technology              ,
project                 ,
plus                    ,
candidates              ,
using                   ,
required                
research                
com                     
industry                
interpersonal           
python                  
solutions               
excel                   
problem                 
projects                
advantage               
apply                   
solving                 
statistics              
programming             
level                   
high                    
player    ]
custom_stop_words.extend(new_words)
stop_words = text.ENGLISH_STOP_WORDS.union(custom_stop_words)
type(stop_words)

frozenset

In [32]:
# remove English stop words
vect = CountVectorizer(stop_words=stop_words, ngram_range=(1,2),lowercase=True)
Xs_train = vect.fit_transform(X_train)
Xs_test = vect.transform(X_test)

Xs_train = pd.DataFrame(vect.transform(df['combined']).todense(),
                       columns=vect.get_feature_names())

word_counts = Xs_train.sum(axis=0)
word_counts.sort_values(ascending = False).head

<bound method Series.head of analysis                254
sql                     254
preferred               253
learning                245
design                  240
software                239
written                 238
technology              230
project                 226
plus                    225
candidates              218
using                   210
required                200
research                195
com                     192
industry                188
interpersonal           184
python                  183
solutions               182
excel                   181
problem                 178
projects                175
advantage               172
apply                   172
solving                 171
statistics              171
programming             170
level                   165
high                    165
player                  163
                       ... 
nrfs create               1
nse                       1
nse level                 1
notch organizationa

That's more like it. But I see some repeated words like 'Data" and 'data'. So I re-vec again with lowercase=True.

### Baseline
Let's now check the score, use it as baseline.

In [24]:
print Xs_train.shape
print Xs_test.shape
print y_train.shape
print y_test.shape

(1020, 34446)
(228, 34446)
(683L,)
(228L,)


ValueError: Found input variables with inconsistent numbers of samples: [1020, 683] <br>
Oops!! Because Xs_train is not in df type. Just re-vec without turning to df.

### Re-Baseline
 (ignoring above X conversions, Re-vectorize X from here onwards) <br>
Let's now check the score, use it as baseline.

In [None]:
# check
type(X_train)
# X_train

#### CountVectorizer

In [None]:
vect = CountVectorizer(stop_words=stop_words, ngram_range=(1, 2),lowercase=True)
Xs_train = vect.fit_transform(X_train)
Xs_test = vect.transform(X_test)

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(Xs_train, y_train)
lr.score(Xs_test, y_test)

Wow!! The score is terrible. Let's try other vectorizer

#### TF-IDF Vectorizer - word importance

In [None]:
# TfidfVectorizer
tfidfvect = TfidfVectorizer(stop_words=stop_words, ngram_range = (1,2))
Xs_train = tfidfvect.fit_transform(X_train)
Xs_test = tfidfvect.transform(X_test)

Xs_train = pd.DataFrame(tfidfvect.transform(df['combined']).todense(),
                       columns=tfidfvect.get_feature_names())

word_counts = Xs_train.sum(axis=0)
word_counts.sort_values(ascending = False)

In [None]:
Xs_train = tfidfvect.fit_transform(X_train)
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(Xs_train, y_train)
lr.score(Xs_test, y_test)

#### Hashing Vectorizer

In [None]:
hashvec = HashingVectorizer(stop_words=stop_words, non_negative=True, ngram_range = (1,2))
Xs_train = hashvec.fit_transform(X_train)
Xs_test = hashvec.transform(X_test)

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(Xs_train, y_train)
lr.score(Xs_test, y_test)

I have tried different vectorization, but only TfidfVectorizer scores better. We need to do more. Let's try Stemming and Lemmatization.

### Stemming and Lemmatization
**Stemming**        'was' has 'wa' as stem
- **What:** Reduce a word to its base/stem/root form
- **Why:** Often makes sense to treat related words the same way

**Lemmatization**  "better" has "good" as its lemma
- **What:** Derive the canonical form ('lemma') of a word
- **Why:** Can be better than stemming

In [None]:
nltk.download('punkt')

Stem and Lemma didn't like pandas dataframe. It takes a word at a time. So I had to use textblob to break each row into its word.

In [None]:
#check
type(df['combined'][0])

Textblob to word now

In [None]:
from textblob import TextBlob
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer('english')
# review = TextBlob(yelp_best_worst.text[0])

textblobbed = []

for rows in df['combined']:
    textblobbed.append(TextBlob(rows).words)
df['textblobbed']= textblobbed
df['textblobbed'] = df['textblobbed'].apply(', '.join)
df['textblobbed'] 

Stemming now.

In [None]:
df['stemmed'] = [stemmer.stem(word) for word in df['textblobbed']]
df['stemmed'][0]

I don't like Stemming, seem to remove many 'es' causing the meaning change. Don't use it. Let's try Lemmatize instead.

Lemmatization now

In [None]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
lemmatized = []
for rows in df['textblobbed']:
    lemmatized.append(lemmatizer.lemmatize(rows))

df['lemmatized'] = lemmatized
df.lemmatized[0]

Lemmatized dataframe looked better, spelling intacted. Let's use this for modelling.

In [None]:
df_train_new = df.loc[df['Annual_medium'] != 0]
X_lem = df['lemmatized']
y_lem = df['Annual_medium']
# split the new DataFrame into training and testing sets
Xl_train, Xl_test, yl_train, yl_test = train_test_split(X_lem, y_lem, test_size=0.25, random_state=1)

print Xl_train.shape
print Xl_test.shape
print yl_train.shape
print yl_test.shape

In [None]:
# lemmatized text on tfidfvect with stop words and ngram1,2 on Linear regression
tfidfvect = TfidfVectorizer(stop_words=stop_words, ngram_range = (1,2))
Xls_train = tfidfvect.fit_transform(Xl_train)
Xls_test = tfidfvect.transform(Xl_test)


from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(Xls_train, yl_train)
lr.score(Xls_test, yl_test)

This is the best score so far. TfidfVectorizer with stop+ words, lemmatized, ngram 1 and 2.

Since it scored well, lets do ALL regression that I know.

All regression on TfidfVectorizer with stop+ words, lemmatized, ngram 1 and 2.

In [None]:
from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression, RidgeCV, LassoCV, ElasticNetCV
from sklearn.preprocessing import PolynomialFeatures
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor

lr = LinearRegression()
lr_lasso = Lasso(random_state=0, alpha=500) # alpha 0 to 1
lr_ridge = Ridge(alpha=100)   # alpha 0 to 1
elast = ElasticNet(alpha=0.1, l1_ratio=0.1)   # alpha 0 to 1
svr = SVR(kernel='linear')  # almost the same as poly
dtr = DecisionTreeRegressor(random_state=0)
rfr = RandomForestRegressor(n_estimators = 10, random_state = 0) # number of trees 10. Pick random.
gradboost_reg = GradientBoostingRegressor(n_estimators=500, learning_rate=0.1) # default n_estimators=100, learning_rate=0.1

lr.fit(Xls_train, yl_train)
lr_lasso.fit(Xls_train, yl_train)
lr_ridge.fit(Xls_train, yl_train)
elast.fit(Xls_train, yl_train)
svr.fit(Xls_train, yl_train)
dtr.fit(Xls_train, yl_train)
rfr.fit(Xls_train, yl_train)
gradboost_reg.fit(Xls_train, yl_train)


print 'Accuracy score :'
print 'LinearRegression          : ', lr.score(Xls_test, yl_test)
print 'Lasso                     : ', lr_lasso.score(Xls_test, yl_test)
print 'Ridge                     : ', lr_ridge.score(Xls_test, yl_test)
print 'ElasticNet                : ', elast.score(Xls_test, yl_test)
print 'SVR                       : ', svr.score(Xls_test, yl_test)
print 'DecisionTreeRegressor     : ', dtr.score(Xls_test, yl_test)
print 'RandomForestRegressor     : ', rfr.score(Xls_test, yl_test)
print "GradientBoostingRegressor : ", gradboost_reg.score(Xls_test, yl_test)
print '=====================================================================:'
print 'Cross_val_score :'
print 'LinearRegression          : ', cross_val_score(lr, Xls_train, yl_train, cv=3, n_jobs=-1).mean()
print 'Lasso                     : ', cross_val_score(lr_lasso, Xls_train, yl_train, cv=3, n_jobs=-1).mean()
print 'Ridge                     : ', cross_val_score(lr_ridge, Xls_train, yl_train, cv=3, n_jobs=-1).mean()
print 'ElasticNet                : ', cross_val_score(elast, Xls_train, yl_train, cv=3, n_jobs=-1).mean()
print 'SVR                       : ', cross_val_score(svr, Xls_train, yl_train, cv=3, n_jobs=-1).mean()
print 'DecisionTreeRegressor     : ', cross_val_score(dtr, Xls_train, yl_train, cv=3, n_jobs=-1).mean()
print 'RandomForestRegressor     : ', cross_val_score(rfr, Xls_train, yl_train, cv=3, n_jobs=-1).mean()
print "GradientBoostingRegressor : ", cross_val_score(gradboost_reg, Xls_train, yl_train, cv=3, n_jobs=-1).mean()

In [None]:
#Ensemble method ==> Bagging , AdaBoost, GradientBoost
# Bagging regressor.
lr = LinearRegression()
from sklearn.ensemble import BaggingRegressor
lr_bagger = BaggingRegressor(lr)
print "lr Bagging Score            : ", cross_val_score(lr_bagger, Xls_train, yl_train, cv=10, n_jobs=-1).mean()

In [None]:
# AdaBoost Regressor
from sklearn.ensemble import AdaBoostRegressor
lr_adaboost = AdaBoostRegressor(base_estimator=lr, learning_rate=0.1 )
print "lr_adaboost Score       : ", cross_val_score(lr_adaboost, Xls_train, yl_train, cv=10, n_jobs=-1).mean()

In [None]:
lr_adaboost.fit(Xls_train, yl_train)
y_pred_lr_adaboost = lr_adaboost.predict(Xls_test)

In [None]:
df_plot = pd.DataFrame({'y_pred': y_pred_lr_adaboost, 'y_true': yl_test})
#df.head()
sns.lmplot("y_pred", "y_true", data=df_plot)

Perhaps we should first look into the regression assumption. <br>
For linear regression, it needs the relationship between the independent and dependent variables to be linear.  

I give up regression!! None is giving good scores.   <br>
Let's turn it into a classification.<br>
1. The plan is to bin the salary into 4 classes as out target. <br>
2. Make sure that target is a balanced distribution <br>
3. Same like working on regression, I will do the same to classification, <br>
ie. countvectorize, tfidfVectorizer, HashingVectorizer (with/without stop_words), ngram stick to 1 & 2. <br>
4. Then do all classification models.<br>
5. If good, I will use it for prediction.

### Classification

In [None]:
df['Annual_medium'].hist(bins=20) 

In [None]:
df_train['Annual_medium'].describe()

In [None]:
# Plan 6: Split those without salaries into another df, call df_topredict
df_topredict = df.loc[df['Annual_medium'] == 0]
df_train = df.loc[df['Annual_medium'] != 0]

In [5]:
(df_train['Annual_medium'] >= 48000) & (df_train['Annual_medium'] < 72000) 

NameError: name 'df_train' is not defined

In [69]:
# Binning the salary and make sure it is balanced.
df_train.loc[(df_train['Annual_medium'] >= 0) & (df_train['Annual_medium'] < 48000) , 'Pay_class'] = 1
df_train.loc[(df_train['Annual_medium'] >= 48000) & (df_train['Annual_medium'] < 72000) , 'Pay_class'] = 2
df_train.loc[(df_train['Annual_medium'] >= 72000) & (df_train['Annual_medium'] < 96000) , 'Pay_class'] = 3
df_train.loc[(df_train['Annual_medium'] >= 96000) , 'Pay_class'] = 4
df_train['Pay_class']= df_train['Pay_class'].fillna(0.0).astype(int)  # Turn y into int
df_train['Pay_class'].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


3    245
2    236
4    232
1    198
Name: Pay_class, dtype: int64

#### Above are the bins and the qty. Ok quite balanced.

#### Baseline = 207/911= 0.22

In [70]:
df_train.loc[df_train['Pay_class'] == 1, 'Pay_pred_median'] =  (0.0 + 48000.0)/2
df_train.loc[df_train['Pay_class'] == 2, 'Pay_pred_median'] = (48000.0 + 72000.0)/2
df_train.loc[df_train['Pay_class'] == 3, 'Pay_pred_median'] = (72000.0 + 96000.0)/2
df_train.loc[df_train['Pay_class'] == 4, 'Pay_pred_median'] = (96000.0 + 390000.0)/2
df_train

Unnamed: 0,Company,JobTitle,EmploymentType,Seniority,Category,SalaryRange,RoleResponsibility,Requirements,Monthly,Annually,...,Max_pay,Annual_min,Annual_max,combined,Annual_medium,textblobbed,stemmed,lemmatized,Pay_class,Pay_pred_median
0,GOOGLE ASIA PACIFIC PTE. LTD.,"Data Center Strategic Negotiator, Site Acquisi...",FullTime,Executive,GeneralManagement,10500to17500,Company overview: Google is not a conventional...,Minimum qualifications: Bachelor's degree in ...,True,False,...,17500,126000,210000,"Data Center Strategic Negotiator, Site Acquisi...",168000.0,"Data, Center, Strategic, Negotiator, Site, Acq...","data, center, strategic, negotiator, site, acq...","Data, Center, Strategic, Negotiator, Site, Acq...",4,243000.0
1,BLOOMBERG L.P.,Data Contribution Support Representative,FullTime,Executive,GeneralManagement,60000to75000,You are excited by the prospect of operating o...,You'll need to have - Excellent communication ...,False,True,...,75000,60000,75000,Data Contribution Support RepresentativeFullTi...,67500.0,"Data, Contribution, Support, RepresentativeFul...","data, contribution, support, representativeful...","Data, Contribution, Support, RepresentativeFul...",2,60000.0
2,A*STAR RESEARCH ENTITIES,SICS - Data Manager,FullTime,JuniorExecutive,BankingandFinance,2500to5000,About Singapore Institute for Clinical Science...,Degree in Bioinformatics or relevant field 5-...,True,False,...,5000,30000,60000,SICS - Data ManagerFullTimeJuniorExecutiveAbou...,45000.0,"SICS, Data, ManagerFullTimeJuniorExecutiveAbou...","sics, data, managerfulltimejuniorexecutiveabou...","SICS, Data, ManagerFullTimeJuniorExecutiveAbou...",1,24000.0
3,CARAT MEDIA SERVICES SINGAPORE PTE LTD,Data Engineer,FullTime,JuniorExecutive,BankingandFinance,3500to6000,Summary: iProspect helps our clients achieve t...,Skills & Experience Required: The role will be...,True,False,...,6000,42000,72000,Data EngineerFullTimeJuniorExecutiveSummary: i...,57000.0,"Data, EngineerFullTimeJuniorExecutiveSummary, ...","data, engineerfulltimejuniorexecutivesummary, ...","Data, EngineerFullTimeJuniorExecutiveSummary, ...",2,60000.0
4,UBS AG,CMO Maintenance Data Analyst,Contract,Non-executive,Sciences/Laboratory/R&D,35000to58000,Your role : Are you incredibly organized with ...,Yourexperience and skills : Past experience in...,False,True,...,58000,35000,58000,CMO Maintenance Data AnalystContractNon-execut...,46500.0,"CMO, Maintenance, Data, AnalystContractNon-exe...","cmo, maintenance, data, analystcontractnon-exe...","CMO, Maintenance, Data, AnalystContractNon-exe...",1,24000.0
5,MONEYSMART SINGAPORE PTE. LTD.,Data Analyst,Contract,Non-executive,Sciences/Laboratory/R&D,5000to7000,The mission As part of becoming one of the fou...,"Requirements Degree in Computer Science, Math...",True,False,...,7000,60000,84000,Data AnalystContractNon-executiveThe mission A...,72000.0,"Data, AnalystContractNon-executiveThe, mission...","data, analystcontractnon-executivethe, mission...","Data, AnalystContractNon-executiveThe, mission...",3,84000.0
6,PRICEWATERHOUSECOOPERS CONSULTING (SINGAPORE) ...,Technology Consulting - Data and Analytics Ass...,Permanent,Executive,Advertising/Media,3500to7000,Consulting We help organisations to work smar...,Requirements Below are the attributes and ski...,True,False,...,7000,42000,84000,Technology Consulting - Data and Analytics Ass...,63000.0,"Technology, Consulting, Data, and, Analytics, ...","technology, consulting, data, and, analytics, ...","Technology, Consulting, Data, and, Analytics, ...",2,60000.0
7,DENTSU AEGIS NETWORK HUB2050,Data Science Analyst,Permanent,Executive,Advertising/Media,4000to6000,BACKGROUND: About Dentsu Aegis Network Dentsu ...,We invite people passionate about data and art...,True,False,...,6000,48000,72000,Data Science AnalystPermanentExecutiveBACKGROU...,60000.0,"Data, Science, AnalystPermanentExecutiveBACKGR...","data, science, analystpermanentexecutivebackgr...","Data, Science, AnalystPermanentExecutiveBACKGR...",2,60000.0
8,COMTEL SOLUTIONS PTE LTD,Data Analyst / Business Data Analyst,FullTime,Professional,BankingandFinance,7000to10500,Responsible to perform Data Profiling Respons...,Over all 5 to 8 years of experience Analyst w...,True,False,...,10500,84000,126000,Data Analyst / Business Data AnalystFullTimePr...,105000.0,"Data, Analyst, Business, Data, AnalystFullTime...","data, analyst, business, data, analystfulltime...","Data, Analyst, Business, Data, AnalystFullTime...",4,243000.0
9,LAZADA SERVICES SOUTH EAST ASIA PTE. LTD.,Senior Data Engineer,FullTime,Professional,BankingandFinance,9000to11000,Introduction to Lazada eLogistics (LeL) Every...,"To succeed in the role, you should ideally hav...",True,False,...,11000,108000,132000,Senior Data EngineerFullTimeProfessionalIntrod...,120000.0,"Senior, Data, EngineerFullTimeProfessionalIntr...","senior, data, engineerfulltimeprofessionalintr...","Senior, Data, EngineerFullTimeProfessionalIntr...",4,243000.0


train_test_split

In [71]:
X_cat = df_train['combined']
y_cat = df_train['Pay_class']

# split the new DataFrame into training and testing sets
Xc_train, Xc_test, yc_train, yc_test = train_test_split(X_cat, y_cat, test_size=0.25, random_state=1)

print Xc_train.shape
print Xc_test.shape
print yc_train.shape
print yc_test.shape

(683L,)
(228L,)
(683L,)
(228L,)


In [72]:
#CountVectorizer with Log_regression
cvec_sn12 = CountVectorizer(stop_words=stop_words, ngram_range=(1, 2),lowercase=True)
X_cvec_sn12_train = cvec_sn12.fit_transform(Xc_train)
X_cvec_sn12_test = cvec_sn12.transform(Xc_test)

from sklearn.linear_model import LogisticRegression
logr = LogisticRegression(random_state = 0)
logr.fit(X_cvec_sn12_train, yc_train)
print 'logr.score with X_cvec_sn12' , logr.score(X_cvec_sn12_test, yc_test)

yc_pred = logr.predict(X_cvec_sn12_test)

from sklearn.metrics import classification_report
print(classification_report(yc_test, yc_pred)) 

logr.score with X_cvec_sn12 0.47368421052631576
             precision    recall  f1-score   support

          1       0.59      0.67      0.63        43
          2       0.49      0.34      0.40        74
          3       0.36      0.53      0.43        51
          4       0.52      0.45      0.48        60

avg / total       0.49      0.47      0.47       228



In [73]:
#CountVectorizer with decision tree
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier(max_depth=None)
dtc.fit(X_cvec_sn12_train, yc_train)
print 'dtc.score with X_cvec_sn12' , dtc.score(X_cvec_sn12_test, yc_test)

ydtc_pred = dtc.predict(X_cvec_sn12_test)

from sklearn.metrics import classification_report
print(classification_report(yc_test, ydtc_pred))

dtc.score with X_cvec_sn12 0.43859649122807015
             precision    recall  f1-score   support

          1       0.56      0.53      0.55        43
          2       0.45      0.30      0.36        74
          3       0.31      0.47      0.37        51
          4       0.51      0.52      0.51        60

avg / total       0.45      0.44      0.44       228



In [74]:
#CountVectorizer with naive_bayes MultinomialNB
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(X_cvec_sn12_train, yc_train)
print 'MultinomialNB .score with X_cvec_sn12: ' , nb.score(X_cvec_sn12_test, yc_test)

ynb_pred = nb.predict(X_cvec_sn12_test)


from sklearn.metrics import classification_report
print(classification_report(yc_test, ynb_pred))

MultinomialNB .score with X_cvec_sn12:  0.47368421052631576
             precision    recall  f1-score   support

          1       0.85      0.51      0.64        43
          2       0.79      0.20      0.32        74
          3       0.36      0.49      0.41        51
          4       0.41      0.77      0.53        60

avg / total       0.60      0.47      0.46       228



In [75]:
# TfidfVectorizer with LogisticRegression with Stop word and n-gram1,2.
tfidfvect = TfidfVectorizer(stop_words=stop_words, ngram_range=(1, 2))
X_tfidfvect_train = tfidfvect.fit_transform(Xc_train)
X_tfidfvect_test = tfidfvect.transform(Xc_test)

from sklearn.linear_model import LogisticRegression
logr2 = LogisticRegression(random_state = 0)
logr2.fit(X_tfidfvect_train, yc_train)
print 'logr2.score with tfidfvect sn12: ' , logr2.score(X_tfidfvect_test, yc_test)

ylogr2_pred = logr2.predict(X_tfidfvect_test)


from sklearn.metrics import classification_report
print(classification_report(yc_test, ylogr2_pred))

logr2.score with tfidfvect sn12:  0.4868421052631579
             precision    recall  f1-score   support

          1       0.64      0.65      0.64        43
          2       0.70      0.19      0.30        74
          3       0.34      0.55      0.42        51
          4       0.50      0.68      0.58        60

avg / total       0.56      0.49      0.46       228



In [76]:
#HashingVectorizer
hashvec = HashingVectorizer(stop_words=stop_words, non_negative=True, ngram_range=(1, 2))
Xhvec_train = hashvec.fit_transform(Xc_train)
Xhvec_test = hashvec.transform(Xc_test)

from sklearn.linear_model import LogisticRegression
logr3 = LogisticRegression(random_state = 0)
logr3.fit(Xhvec_train, yc_train)

print 'logr3.score with hashvec with stop' , logr3.score(Xhvec_test, yc_test)

ylogr3_pred = logr3.predict(Xhvec_test)


from sklearn.metrics import classification_report
print(classification_report(yc_test, ylogr3_pred))



logr3.score with hashvec with stop 0.4649122807017544
             precision    recall  f1-score   support

          1       0.56      0.63      0.59        43
          2       0.58      0.19      0.29        74
          3       0.35      0.57      0.43        51
          4       0.49      0.60      0.54        60

avg / total       0.50      0.46      0.44       228



In [77]:
#TfidfVectorizer with GradientBoostingClassifier
from sklearn.ensemble import GradientBoostingClassifier  
gbc = GradientBoostingClassifier(n_estimators=10)
gbc.fit(X_tfidfvect_train, yc_train)

print 'gbc.score with tfidfvect' , gbc.score(X_tfidfvect_test, yc_test)

ygbc_pred = gbc.predict(X_tfidfvect_test)

from sklearn.metrics import classification_report
print(classification_report(yc_test, ygbc_pred))

gbc.score with tfidfvect 0.3991228070175439
             precision    recall  f1-score   support

          1       0.49      0.44      0.46        43
          2       0.45      0.19      0.27        74
          3       0.28      0.59      0.38        51
          4       0.56      0.47      0.51        60

avg / total       0.45      0.40      0.39       228



### Now Predict salary and show features that is important, that has higher impact to the prediction. We will use TfidfVectorizer with LogisticRegression with Stop word and n-gram1,2.


In [78]:
df_topredict.head(2)

Unnamed: 0,Company,JobTitle,EmploymentType,Seniority,Category,SalaryRange,RoleResponsibility,Requirements,Monthly,Annually,Min_pay,Max_pay,Annual_min,Annual_max,combined,Annual_medium,textblobbed,stemmed,lemmatized
10,BLUECHIP PLATFORMS ASIA PTE. LTD.,Data analyst - global wealth management (contr...,FullTime,SeniorExecutive,InformationTechnology,Salary undisclosed,"Develop and implement databases, data collecti...",Minimum Degree At least 3 years of Data rela...,False,False,0,0,0,0,Data analyst - global wealth management (contr...,0.0,"Data, analyst, global, wealth, management, con...","data, analyst, global, wealth, management, con...","Data, analyst, global, wealth, management, con..."
12,BEACON CONSULTING PTE LTD,Temp Admin (Data Analytics),Permanent,Executive,InformationTechnology,Salary undisclosed,Temp Admins are required for a Data Analytics ...,Proficient in Microsoft Excel (data tabulatio...,False,False,0,0,0,0,Temp Admin (Data Analytics)PermanentExecutiveT...,0.0,"Temp, Admin, Data, Analytics, PermanentExecuti...","temp, admin, data, analytics, permanentexecuti...","Temp, Admin, Data, Analytics, PermanentExecuti..."


In [79]:
X_pred = df_topredict['combined']
X_pred_vec = tfidfvect.transform(X_pred)    # CountVectorizer+stop_words ngram_range=(1, 2),lowercase=True)
df_topredict['Pay_class'] = logr2.predict(X_pred_vec)
df_topredict

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,Company,JobTitle,EmploymentType,Seniority,Category,SalaryRange,RoleResponsibility,Requirements,Monthly,Annually,Min_pay,Max_pay,Annual_min,Annual_max,combined,Annual_medium,textblobbed,stemmed,lemmatized,Pay_class
10,BLUECHIP PLATFORMS ASIA PTE. LTD.,Data analyst - global wealth management (contr...,FullTime,SeniorExecutive,InformationTechnology,Salary undisclosed,"Develop and implement databases, data collecti...",Minimum Degree At least 3 years of Data rela...,False,False,0,0,0,0,Data analyst - global wealth management (contr...,0.0,"Data, analyst, global, wealth, management, con...","data, analyst, global, wealth, management, con...","Data, analyst, global, wealth, management, con...",1
12,BEACON CONSULTING PTE LTD,Temp Admin (Data Analytics),Permanent,Executive,InformationTechnology,Salary undisclosed,Temp Admins are required for a Data Analytics ...,Proficient in Microsoft Excel (data tabulatio...,False,False,0,0,0,0,Temp Admin (Data Analytics)PermanentExecutiveT...,0.0,"Temp, Admin, Data, Analytics, PermanentExecuti...","temp, admin, data, analytics, permanentexecuti...","Temp, Admin, Data, Analytics, PermanentExecuti...",1
18,BLUECHIP PLATFORMS ASIA PTE. LTD.,"Data analyst, reference data (contract)",FullTime,Professional,InformationTechnology,Salary undisclosed,"Established for over 80 years, the Client is a...",At least 3 years of Data related experience w...,False,False,0,0,0,0,"Data analyst, reference data (contract)FullTim...",0.0,"Data, analyst, reference, data, contract, Full...","data, analyst, reference, data, contract, full...","Data, analyst, reference, data, contract, Full...",1
38,Health Promotion Board,"Senior Manager / Manager, Data Management",FullTime,Professional,BuildingandConstruction,Salary undisclosed,Job Responsibilities \r\rManage internal and e...,-,False,False,0,0,0,0,"Senior Manager / Manager, Data ManagementFullT...",0.0,"Senior, Manager, Manager, Data, ManagementFull...","senior, manager, manager, data, managementfull...","Senior, Manager, Manager, Data, ManagementFull...",3
47,GRABTAXI HOLDINGS PTE. LTD.,data scientist,Permanent,SeniorManagement,Advertising/Media,Salary undisclosed,Get to know our Team: Grabs Data Science Depa...,"The must haves: Ph.D. graduate, or Masters(wi...",False,False,0,0,0,0,data scientistPermanentSeniorManagementGet to ...,0.0,"data, scientistPermanentSeniorManagementGet, t...","data, scientistpermanentseniormanagementget, t...","data, scientistPermanentSeniorManagementGet, t...",4
54,TELEKOMUNIKASI INDONESIA INTERNATIONAL PTE. LTD.,"Senior Sales Manager, Data Center",Permanent,Professional,InformationTechnology,Salary undisclosed,Responsibilities Support the region and pursu...,Requirements Bachelor Degree in any disciplin...,False,False,0,0,0,0,"Senior Sales Manager, Data CenterPermanentProf...",0.0,"Senior, Sales, Manager, Data, CenterPermanentP...","senior, sales, manager, data, centerpermanentp...","Senior, Sales, Manager, Data, CenterPermanentP...",4
65,HNWI PRIVATE LIMITED,"Data Entry (3 Months Part Time Contract, Flexi...",Contract,Fresh/entrylevel,Sciences/Laboratory/R&D,Salary undisclosed,"About HNWI Private Limited You Deliver, We Gro...",JOB APPLICATION PROCESS If you share our missi...,False,False,0,0,0,0,"Data Entry (3 Months Part Time Contract, Flexi...",0.0,"Data, Entry, 3, Months, Part, Time, Contract, ...","data, entry, 3, months, part, time, contract, ...","Data, Entry, 3, Months, Part, Time, Contract, ...",1
68,BLUECHIP PLATFORMS ASIA PTE. LTD.,Data Scientist - Financial Services,Permanent,Manager,InformationTechnology,Salary undisclosed,Ambitious growth plans in Asia Pacific Valu...,This candidate should have a degree in Statis...,False,False,0,0,0,0,Data Scientist - Financial ServicesPermanentMa...,0.0,"Data, Scientist, Financial, ServicesPermanentM...","data, scientist, financial, servicespermanentm...","Data, Scientist, Financial, ServicesPermanentM...",4
73,HNWI PRIVATE LIMITED,"Data Entry (3 Months Part Time Contract, Flexi...",FullTime,Professional,InformationTechnology,Salary undisclosed,"About HNWI Private Limited You Deliver, We Gro...",JOB APPLICATION PROCESS If you share our missi...,False,False,0,0,0,0,"Data Entry (3 Months Part Time Contract, Flexi...",0.0,"Data, Entry, 3, Months, Part, Time, Contract, ...","data, entry, 3, months, part, time, contract, ...","Data, Entry, 3, Months, Part, Time, Contract, ...",1
84,Auditor-General's Office,AGO Summer Internship (Data Analytics Unit),Permanent,Professional,BankingandFinance,Salary undisclosed,The Auditor-General's Office (AGO) is an indep...,-,False,False,0,0,0,0,AGO Summer Internship (Data Analytics Unit)Per...,0.0,"AGO, Summer, Internship, Data, Analytics, Unit...","ago, summer, internship, data, analytics, unit...","AGO, Summer, Internship, Data, Analytics, Unit...",3


OK, we have the Pay_class predicted for the empty salary. Now, I will fill in the blank of Annual_medium Salary.

In [80]:
df_topredict.loc[df_topredict['Pay_class'] == 1, 'Annual_medium'] =  (0.0 + 48000.0)/2
df_topredict.loc[df_topredict['Pay_class'] == 2, 'Annual_medium'] = (48000.0 + 72000.0)/2
df_topredict.loc[df_topredict['Pay_class'] == 3, 'Annual_medium'] = (72000.0 + 96000.0)/2
df_topredict.loc[df_topredict['Pay_class'] == 4, 'Annual_medium'] = (96000.0 + 390000.0)/2
df_topredict

Unnamed: 0,Company,JobTitle,EmploymentType,Seniority,Category,SalaryRange,RoleResponsibility,Requirements,Monthly,Annually,Min_pay,Max_pay,Annual_min,Annual_max,combined,Annual_medium,textblobbed,stemmed,lemmatized,Pay_class
10,BLUECHIP PLATFORMS ASIA PTE. LTD.,Data analyst - global wealth management (contr...,FullTime,SeniorExecutive,InformationTechnology,Salary undisclosed,"Develop and implement databases, data collecti...",Minimum Degree At least 3 years of Data rela...,False,False,0,0,0,0,Data analyst - global wealth management (contr...,24000.0,"Data, analyst, global, wealth, management, con...","data, analyst, global, wealth, management, con...","Data, analyst, global, wealth, management, con...",1
12,BEACON CONSULTING PTE LTD,Temp Admin (Data Analytics),Permanent,Executive,InformationTechnology,Salary undisclosed,Temp Admins are required for a Data Analytics ...,Proficient in Microsoft Excel (data tabulatio...,False,False,0,0,0,0,Temp Admin (Data Analytics)PermanentExecutiveT...,24000.0,"Temp, Admin, Data, Analytics, PermanentExecuti...","temp, admin, data, analytics, permanentexecuti...","Temp, Admin, Data, Analytics, PermanentExecuti...",1
18,BLUECHIP PLATFORMS ASIA PTE. LTD.,"Data analyst, reference data (contract)",FullTime,Professional,InformationTechnology,Salary undisclosed,"Established for over 80 years, the Client is a...",At least 3 years of Data related experience w...,False,False,0,0,0,0,"Data analyst, reference data (contract)FullTim...",24000.0,"Data, analyst, reference, data, contract, Full...","data, analyst, reference, data, contract, full...","Data, analyst, reference, data, contract, Full...",1
38,Health Promotion Board,"Senior Manager / Manager, Data Management",FullTime,Professional,BuildingandConstruction,Salary undisclosed,Job Responsibilities \r\rManage internal and e...,-,False,False,0,0,0,0,"Senior Manager / Manager, Data ManagementFullT...",84000.0,"Senior, Manager, Manager, Data, ManagementFull...","senior, manager, manager, data, managementfull...","Senior, Manager, Manager, Data, ManagementFull...",3
47,GRABTAXI HOLDINGS PTE. LTD.,data scientist,Permanent,SeniorManagement,Advertising/Media,Salary undisclosed,Get to know our Team: Grabs Data Science Depa...,"The must haves: Ph.D. graduate, or Masters(wi...",False,False,0,0,0,0,data scientistPermanentSeniorManagementGet to ...,243000.0,"data, scientistPermanentSeniorManagementGet, t...","data, scientistpermanentseniormanagementget, t...","data, scientistPermanentSeniorManagementGet, t...",4
54,TELEKOMUNIKASI INDONESIA INTERNATIONAL PTE. LTD.,"Senior Sales Manager, Data Center",Permanent,Professional,InformationTechnology,Salary undisclosed,Responsibilities Support the region and pursu...,Requirements Bachelor Degree in any disciplin...,False,False,0,0,0,0,"Senior Sales Manager, Data CenterPermanentProf...",243000.0,"Senior, Sales, Manager, Data, CenterPermanentP...","senior, sales, manager, data, centerpermanentp...","Senior, Sales, Manager, Data, CenterPermanentP...",4
65,HNWI PRIVATE LIMITED,"Data Entry (3 Months Part Time Contract, Flexi...",Contract,Fresh/entrylevel,Sciences/Laboratory/R&D,Salary undisclosed,"About HNWI Private Limited You Deliver, We Gro...",JOB APPLICATION PROCESS If you share our missi...,False,False,0,0,0,0,"Data Entry (3 Months Part Time Contract, Flexi...",24000.0,"Data, Entry, 3, Months, Part, Time, Contract, ...","data, entry, 3, months, part, time, contract, ...","Data, Entry, 3, Months, Part, Time, Contract, ...",1
68,BLUECHIP PLATFORMS ASIA PTE. LTD.,Data Scientist - Financial Services,Permanent,Manager,InformationTechnology,Salary undisclosed,Ambitious growth plans in Asia Pacific Valu...,This candidate should have a degree in Statis...,False,False,0,0,0,0,Data Scientist - Financial ServicesPermanentMa...,243000.0,"Data, Scientist, Financial, ServicesPermanentM...","data, scientist, financial, servicespermanentm...","Data, Scientist, Financial, ServicesPermanentM...",4
73,HNWI PRIVATE LIMITED,"Data Entry (3 Months Part Time Contract, Flexi...",FullTime,Professional,InformationTechnology,Salary undisclosed,"About HNWI Private Limited You Deliver, We Gro...",JOB APPLICATION PROCESS If you share our missi...,False,False,0,0,0,0,"Data Entry (3 Months Part Time Contract, Flexi...",24000.0,"Data, Entry, 3, Months, Part, Time, Contract, ...","data, entry, 3, months, part, time, contract, ...","Data, Entry, 3, Months, Part, Time, Contract, ...",1
84,Auditor-General's Office,AGO Summer Internship (Data Analytics Unit),Permanent,Professional,BankingandFinance,Salary undisclosed,The Auditor-General's Office (AGO) is an indep...,-,False,False,0,0,0,0,AGO Summer Internship (Data Analytics Unit)Per...,84000.0,"AGO, Summer, Internship, Data, Analytics, Unit...","ago, summer, internship, data, analytics, unit...","AGO, Summer, Internship, Data, Analytics, Unit...",3


NOw lets find the top features from my model.

In [81]:
pd.DataFrame( logr2.coef_, columns=tfidfvect.get_feature_names())

Unnamed: 0,000,000 businesses,000 clients,000 colleagues,000 employees,000 individuals,000 patients,000 people,000 professional,000 singaporean,...,zomwork understand,zones,zones immerse,zones placing,zones start,zp,zp dashboard,zuellig,zuellig pharma,zuellig pharmas
0,-0.046956,-0.013668,-0.004738,-0.007786,-0.032804,-0.00834,-0.008073,-0.015399,0.052239,-0.007429,...,-0.009503,0.0347,-0.00426,-0.011307,0.056352,-0.006025,-0.006025,-0.012049,-0.006025,-0.006025
1,-0.042843,0.049644,-0.00708,-0.011778,-0.006539,-0.008074,-0.008528,-0.019452,-0.017846,-0.007229,...,0.033356,-0.035402,-0.007565,-0.013179,-0.019308,-0.008399,-0.008399,-0.016798,-0.008399,-0.008399
2,0.040332,-0.016323,-0.008715,0.0328,0.016678,0.027485,-0.008831,-0.025256,-0.01821,0.020057,...,-0.01135,-0.042989,-0.011573,-0.017096,-0.019897,-0.011304,-0.011304,-0.022607,-0.011304,-0.011304
3,0.043627,-0.018643,0.019676,-0.013637,0.02176,-0.010497,0.02628,0.054652,-0.017777,-0.005474,...,-0.012114,0.039218,0.022164,0.038485,-0.017786,0.02411,0.02411,0.04822,0.02411,0.02411


Top important features which have significate effect on my model

In [82]:
# logr.coef_
docs = pd.DataFrame( abs(logr2.coef_),
                    columns=tfidfvect.get_feature_names()).sum()
    
docs.sort_values(ascending=False).head(30)

business        3.058813
entry           2.837043
data entry      2.716768
research        2.205513
solutions       2.110293
organisation    2.072412
design          2.036304
assist          1.893735
risk            1.811746
global          1.710312
equipment       1.630824
product         1.597170
experience      1.581317
strategy        1.575858
requirements    1.547995
teams           1.529368
security        1.520442
prepare         1.515517
partner         1.509604
perform         1.504023
drive           1.501009
manager         1.492813
analytics       1.492066
technical       1.475487
data science    1.385011
day             1.364485
solution        1.345659
duties          1.343590
lead            1.315667
maintenance     1.291709
dtype: float64

### QUESTION 2: Factors that distinguish job category

Using the job postings you scraped for part 1 (or potentially new job postings from a second round of scraping), identify features in the data related to job postings that can distinguish job titles from each other. There are a variety of interesting ways you can frame the target variable, for example:
- What components of a job posting distinguish data scientists from other data jobs?
- What features are important for distinguishing junior vs. senior positions?
- Do the requirements for titles vary significantly with industry (e.g. healthcare vs. government)?

You may end up making multiple classification models to tackle different questions. Be sure to clearly explain your hypotheses and framing, any feature engineering, and what your target variables are. The type of classification model you choose is up to you. Be sure to interpret your results and evaluate your models' performance.


## Answer to Question 2

Create our Target , y. All Data job is = 1. The rest are 0.

In [83]:
df[df.JobTitle.str.contains(pat='Data')]

Unnamed: 0,Company,JobTitle,EmploymentType,Seniority,Category,SalaryRange,RoleResponsibility,Requirements,Monthly,Annually,Min_pay,Max_pay,Annual_min,Annual_max,combined,Annual_medium,textblobbed,stemmed,lemmatized
0,GOOGLE ASIA PACIFIC PTE. LTD.,"Data Center Strategic Negotiator, Site Acquisi...",FullTime,Executive,GeneralManagement,10500to17500,Company overview: Google is not a conventional...,Minimum qualifications: Bachelor's degree in ...,True,False,10500,17500,126000,210000,"Data Center Strategic Negotiator, Site Acquisi...",168000.0,"Data, Center, Strategic, Negotiator, Site, Acq...","data, center, strategic, negotiator, site, acq...","Data, Center, Strategic, Negotiator, Site, Acq..."
1,BLOOMBERG L.P.,Data Contribution Support Representative,FullTime,Executive,GeneralManagement,60000to75000,You are excited by the prospect of operating o...,You'll need to have - Excellent communication ...,False,True,60000,75000,60000,75000,Data Contribution Support RepresentativeFullTi...,67500.0,"Data, Contribution, Support, RepresentativeFul...","data, contribution, support, representativeful...","Data, Contribution, Support, RepresentativeFul..."
2,A*STAR RESEARCH ENTITIES,SICS - Data Manager,FullTime,JuniorExecutive,BankingandFinance,2500to5000,About Singapore Institute for Clinical Science...,Degree in Bioinformatics or relevant field 5-...,True,False,2500,5000,30000,60000,SICS - Data ManagerFullTimeJuniorExecutiveAbou...,45000.0,"SICS, Data, ManagerFullTimeJuniorExecutiveAbou...","sics, data, managerfulltimejuniorexecutiveabou...","SICS, Data, ManagerFullTimeJuniorExecutiveAbou..."
3,CARAT MEDIA SERVICES SINGAPORE PTE LTD,Data Engineer,FullTime,JuniorExecutive,BankingandFinance,3500to6000,Summary: iProspect helps our clients achieve t...,Skills & Experience Required: The role will be...,True,False,3500,6000,42000,72000,Data EngineerFullTimeJuniorExecutiveSummary: i...,57000.0,"Data, EngineerFullTimeJuniorExecutiveSummary, ...","data, engineerfulltimejuniorexecutivesummary, ...","Data, EngineerFullTimeJuniorExecutiveSummary, ..."
4,UBS AG,CMO Maintenance Data Analyst,Contract,Non-executive,Sciences/Laboratory/R&D,35000to58000,Your role : Are you incredibly organized with ...,Yourexperience and skills : Past experience in...,False,True,35000,58000,35000,58000,CMO Maintenance Data AnalystContractNon-execut...,46500.0,"CMO, Maintenance, Data, AnalystContractNon-exe...","cmo, maintenance, data, analystcontractnon-exe...","CMO, Maintenance, Data, AnalystContractNon-exe..."
5,MONEYSMART SINGAPORE PTE. LTD.,Data Analyst,Contract,Non-executive,Sciences/Laboratory/R&D,5000to7000,The mission As part of becoming one of the fou...,"Requirements Degree in Computer Science, Math...",True,False,5000,7000,60000,84000,Data AnalystContractNon-executiveThe mission A...,72000.0,"Data, AnalystContractNon-executiveThe, mission...","data, analystcontractnon-executivethe, mission...","Data, AnalystContractNon-executiveThe, mission..."
6,PRICEWATERHOUSECOOPERS CONSULTING (SINGAPORE) ...,Technology Consulting - Data and Analytics Ass...,Permanent,Executive,Advertising/Media,3500to7000,Consulting We help organisations to work smar...,Requirements Below are the attributes and ski...,True,False,3500,7000,42000,84000,Technology Consulting - Data and Analytics Ass...,63000.0,"Technology, Consulting, Data, and, Analytics, ...","technology, consulting, data, and, analytics, ...","Technology, Consulting, Data, and, Analytics, ..."
7,DENTSU AEGIS NETWORK HUB2050,Data Science Analyst,Permanent,Executive,Advertising/Media,4000to6000,BACKGROUND: About Dentsu Aegis Network Dentsu ...,We invite people passionate about data and art...,True,False,4000,6000,48000,72000,Data Science AnalystPermanentExecutiveBACKGROU...,60000.0,"Data, Science, AnalystPermanentExecutiveBACKGR...","data, science, analystpermanentexecutivebackgr...","Data, Science, AnalystPermanentExecutiveBACKGR..."
8,COMTEL SOLUTIONS PTE LTD,Data Analyst / Business Data Analyst,FullTime,Professional,BankingandFinance,7000to10500,Responsible to perform Data Profiling Respons...,Over all 5 to 8 years of experience Analyst w...,True,False,7000,10500,84000,126000,Data Analyst / Business Data AnalystFullTimePr...,105000.0,"Data, Analyst, Business, Data, AnalystFullTime...","data, analyst, business, data, analystfulltime...","Data, Analyst, Business, Data, AnalystFullTime..."
9,LAZADA SERVICES SOUTH EAST ASIA PTE. LTD.,Senior Data Engineer,FullTime,Professional,BankingandFinance,9000to11000,Introduction to Lazada eLogistics (LeL) Every...,"To succeed in the role, you should ideally hav...",True,False,9000,11000,108000,132000,Senior Data EngineerFullTimeProfessionalIntrod...,120000.0,"Senior, Data, EngineerFullTimeProfessionalIntr...","senior, data, engineerfulltimeprofessionalintr...","Senior, Data, EngineerFullTimeProfessionalIntr..."


In [101]:
df['DS'] = df.JobTitle.str.contains(pat='Data').astype(int)

In [102]:
df[df.JobTitle.str.contains(pat='Data')].shape

(282, 20)

In [103]:
X_ds = df['combined']
y_ds = df['DS']

print y_ds.shape
print X_ds.shape

(1020L,)
(1020L,)


The target y is unbalanced. imblearn RandomUnderSampler technique

In [104]:
#CountVectorizer with Log_regression
vectds = CountVectorizer(stop_words=stop_words, ngram_range=(1, 2),lowercase=True)
X_ds_vec = vectds.fit_transform(X_ds)
# Xds_test_vec = vectds.transform(Xds_test)

In [105]:
Xds_train, Xds_test, yds_train, yds_test = train_test_split(X_ds_vec, y_ds, test_size=0.25, random_state=1)

In [106]:
from imblearn.under_sampling import RandomUnderSampler

us = RandomUnderSampler(ratio=0.5, random_state=1)
X_rus, y_rus = us.fit_sample(Xds_train, yds_train)



In [107]:
from sklearn.linear_model import LogisticRegression
logr4 = LogisticRegression(random_state = 0)
logr4.fit(X_rus, y_rus)
logr4.score(Xds_test, yds_test)
y_pred = logr4.predict(Xds_test)

In [108]:
from sklearn.metrics import classification_report
print(classification_report(yds_test, y_pred))

             precision    recall  f1-score   support

          0       0.90      0.85      0.87       182
          1       0.67      0.77      0.71        73

avg / total       0.83      0.82      0.83       255



Apply ALL Classifier.

In [109]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegressionCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve, auc
from sklearn.ensemble import GradientBoostingClassifier  
from xgboost import XGBClassifier

logr = LogisticRegression(random_state = 0, n_jobs=1)
logr_lasso = LogisticRegressionCV(penalty='l1', solver='liblinear', Cs=100, cv=10, n_jobs=1)
logr_ridge = LogisticRegressionCV(penalty='l2', Cs=200, cv=5, n_jobs=1) #l2 is ridge, Cs: How many different (automatically-selected) regularization strengths should be tested.
knn = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p=2, n_jobs=1) # KNN = Neaest Neighbors classifier # choosig Euclidean('yoo-klid-ee-uhn') , the other type is called Manhattan
dtc = DecisionTreeClassifier(max_depth=None)
gbc = GradientBoostingClassifier(n_estimators=10) # warm_start=True meaning retain old training data when fit, but careful might overfit
xgbc = XGBClassifier(n_jobs=1)  #Takes many model,

logr.fit(X_rus, y_rus)
logr_lasso.fit(X_rus, y_rus)
logr_ridge.fit(X_rus, y_rus)
knn.fit(X_rus, y_rus)
dtc.fit(X_rus, y_rus)
gbc.fit(X_rus, y_rus)
xgbc.fit(X_rus, y_rus)

print 'Classification Report (precision/recall/f1-score/support)'
print 'LogisticRegression             : ' , classification_report(yds_test, logr.predict(Xds_test))
print 'LogisticRegressionCV L1 Lasso  : ' , classification_report(yds_test, logr_lasso.predict(Xds_test))
print 'LogisticRegressionCV L2 Ridge  : ' , classification_report(yds_test, logr_ridge.predict(Xds_test))
print 'KNeighborsClassifier           : ' , classification_report(yds_test, knn.predict(Xds_test))
print 'DecisionTreeClassifier         : ' , classification_report(yds_test, dtc.predict(Xds_test))
print 'GradientBoostClassifier        : ' , classification_report(yds_test, gbc.predict(Xds_test))
print 'XGBClassifier                  : ' , classification_report(yds_test, xgbc.predict(Xds_test))


Classification Report (precision/recall/f1-score/support)
LogisticRegression             :               precision    recall  f1-score   support

          0       0.90      0.85      0.87       182
          1       0.67      0.77      0.71        73

avg / total       0.83      0.82      0.83       255

LogisticRegressionCV L1 Lasso  :               precision    recall  f1-score   support

          0       0.89      0.88      0.89       182
          1       0.72      0.74      0.73        73

avg / total       0.84      0.84      0.84       255

LogisticRegressionCV L2 Ridge  :               precision    recall  f1-score   support

          0       0.88      0.89      0.88       182
          1       0.71      0.68      0.70        73

avg / total       0.83      0.83      0.83       255

KNeighborsClassifier           :               precision    recall  f1-score   support

          0       0.96      0.12      0.21       182
          1       0.31      0.99      0.47        73



  if diff:


In [111]:
pd.DataFrame( abs(logr.coef_),
                    columns=vectds.get_feature_names()).sum()
    
docs.sort_values(ascending=False)

business                    3.058813
entry                       2.837043
data entry                  2.716768
research                    2.205513
solutions                   2.110293
organisation                2.072412
design                      2.036304
assist                      1.893735
risk                        1.811746
global                      1.710312
equipment                   1.630824
product                     1.597170
experience                  1.581317
strategy                    1.575858
requirements                1.547995
teams                       1.529368
security                    1.520442
prepare                     1.515517
partner                     1.509604
perform                     1.504023
drive                       1.501009
manager                     1.492813
analytics                   1.492066
technical                   1.475487
data science                1.385011
day                         1.364485
solution                    1.345659
d

Looks like it is a bad idea to look at Data Scientist or not, because it is too unbalanced give a reasonable insights.  Lets look at Junior vs Senior.

# BONUS

5. Answer the salary discussion by using your model to explain the tradeoffs between detecting high vs low salary positions.

6. Convert your executive summary into a public blog post of at least 500 words, in which you document your approach in a tutorial for other aspiring data scientists. Link to this in your notebook.


## Answer Bonus: 

The utimate goal is to predict the annual salary based on the available features. In order to get the target y, I had to prepare for it first, as the scrapped salaries are joined together in a single string  with min, max amount and whether the amount is in monthly or annually basis (example 6000to10000monthly, 120000to180000annually). So I had to split and give each part to its own column, then multiply x12 for those in monthly based, and consolidate to a new Annual_median column. For the above example, we will 6000 + 10000 and divide by 2 for the median. So annual median column is done. 

I had to split the dataset to 2, one with annual median salary that can be our train set, which will be further split as test set. Whereas the other set without salary info will be our prediction set.

Once the target is properly defined, we will focus on train set feature X. Since it is a prediction of numbers, my first attempt is to use regression. However,the X features are all in text. So it is time to use the new skill learn from the course, yes, NLP. 

The first thing we learn is to tokenize the text. What it means is to separate the text into untis such as sentences or words. Then the words need to be encoded as integers or floating point values for use as input to a machine learning algorithm. We call this process feature extraction, also known as vectorization. This is when NLTL package comes in. NLTK feature extraction module comes with 3 very common vectorizer. 

1) CountVectorizer: to convert text to word count vectors .
2) TfidfVectorizer : to convert text to word frequency vectors.
3) HashingVectorizer : to convert text to unique integers.

Each have it's own stop word ability. Stop word is a list of words which its meaning has no effect to the meaning of the context. These words are stored in a list, and the vectorizer is smart enough to remove them.

Next feature that the vectorizer have is ngram. I put 1 to 2 settings. This is telling the vectorizer to put maximum 2 words together then do the counting on it as a whole. Of course I could have put 3 or 4, not only that the score drops, the 3 words combined don't make any sense. For example ngram=4, "research capabilities shared sensor".

Now with stop words, ngram applied to all 3 vectorizor, the regresion score appears to be terrible. So I decided to try stemming.

Stemming unnecessarily chops of "es", "is" which somehow removed the meaning for some of the words, example "analysis" becomes "analys". For this reason, I did not apply it on vectorizer.

Next I tried Lemmatizer. It properly uses vocab and morphological analyses the word, usually produces very good retrieval. After applying, the socre jumps to 0.42. This is a good sign. 

I did run back and forth changing the param and adding new stop words, but without any improvement. I have even tried to apply ALL the regression I know, example lasso, ridge, decisiontree, support vector, random forest, gridient boost, bagging with all regressor in, as well as adaboost. None of them gives any good score.


That is when I realized, it could be due to regression's assumption. For regression to work, the relationship between independent and depentent varible must be linear. I did not check this and jump into regression is a grave mistake. But a good one.

So I tried classification technique. Same thing, we need to prepare y target. I binned the salary into 4 bins, according to the distribution, so that all class is balanced. 

After y target is defined, and apply it to countvectorizer cum lemmatized words, stop word, ngram 1,2, into logistic regression, decisiontress, and multinomial navie bayes. The score highest achieved is only 0.48 from tfid logistic regression. 

With this score, although not good enough, I used it to predict the Annual_medium anyway. You can see the df_predict.Annual_medium at the end (initially there is no value).

Here I shopped question 1.



As for Question 2, we are supposed to find the Factors that distinguish job category. I changed it to factors that distinguish a job title "Data Scientist". However, I hit into a serious problem, yes, y_target becomes extremely unbalanced (i.e only 37 out of 1000. So I changed to just looking "data". That increased to about 282 roles. Still unbalanced, but I did not split any further and see how it goes. Just iike classification I did before in qestion 1, I applied into countvectorized with stop word, ngram1,2 and lemmatized it, same goes to tfid and hash, and throw them all into ALL the classification I know. Suprisingly KNN appears to have the least FP and FN. And looking into the coefficient, the words or aka factors that distinuguish "data" job title are "