# Analysis of Job Postings (Data Analytics)

## Business Case Overview

You're working as a data scientist for a contracting firm that's rapidly expanding. Now that they have their most valuable employee (you!), they need to leverage data to win more contracts. Your firm offers technology and scientific solutions and wants to be competitive in the hiring market. Your principal has two main objectives:

   1. Determine the industry factors that are most important in predicting the salary amounts for these data.
   2. Determine the factors that distinguish job categories and titles from each other. 


---


### QUESTION 1: Factors that impact salary

To predict salary you can frame this as a classification problem, in which case you will create labels from these salaries (high vs. low salary, for example) according to thresholds (such as median salary).


### QUESTION 2: Factors that distinguish job category

There are a variety of interesting ways you can frame the target variable, for example:
- What components of a job posting distinguish data scientists from other data jobs?
- What features are important for distinguishing junior vs. senior positions?
- Do the requirements for titles vary significantly with industry ?

###  Overview:

Part 1. Scrape and prepare your own data.

Part 2. Data Cleaning and Exploratory data analysis (EDA)

Part 3. Modelling and evaluation

Part 4. Executive summary

In [1]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

In [2]:
# load csv file
csv = './cleaned_df.csv'
df = pd.read_csv(csv)

# Part 3. Modelling and evaluation
### Question 1 - Factors that impact salary. Set as a classification problem


1a) First i took a look at columns 'company','job_title','location','employment_type','seniority','job_categories' individually in relation with salary. 

1b) Then i put the features from above all together and analysed. (for the boss who is interested in the overall features)

1c) Look at columns 'job_description' and 	'requirements'. (for HR who is interested in which skills and Keywords)

Models used: BernoulliNB, Logistic regression, Decision tree Classifier

In [3]:
df['high_pay'] = df.salary_avg
df.salary_avg.median()

5750.0

In [4]:
# change to binary classification
df.loc[df.salary_avg <= 5750, 'high_pay'] = 0

In [5]:
df.loc[df.salary_avg> 5750, 'high_pay'] = 1

In [6]:
df.high_pay.head(10)

0    1.0
1    1.0
2    1.0
3    1.0
4    1.0
5    1.0
6    1.0
7    0.0
8    1.0
9    1.0
Name: high_pay, dtype: float64

In [7]:
df.shape

(2637, 13)

In [8]:
df.head(2)

Unnamed: 0.1,Unnamed: 0,company,job_title,location,employment_type,seniority,job_categories,job_description,requirements,salary_low,salary_high,salary_avg,high_pay
0,0,national university of singapore,"senior / associate director, data governance /...",lower kent ridge road,"permanent, full time",senior management,"education and training, information technology",\r\r\nthis leadership role will interact and e...,"\r\r\ndegree in information technology, comput...",7000.0,9000.0,8000.0,1.0
1,1,ntuc enterprise nexus co-operative limited,data scientist,marina boulevard,full time,executive,information technology,\r\r\nntuc enterprise is in the midst of its d...,"\r\r\n• masters in statistics, mathematics, co...",3500.0,10000.0,6750.0,1.0


# 1a) Features of 'company', 'job_title', 'location', 'employment_type', 'seniority', 'job_categories' that affect salary

In [9]:
import nltk
import scipy as sp
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

# Analyse at company

In [10]:
group=df.groupby('company')[['salary_avg']].mean()
company_pay=group.sort_values(by='salary_avg',ascending = False)

print(company_pay.head(5))
# companies with the highest average salary

print(company_pay.tail(5))
# companies with the worst average salary

                                                 salary_avg
company                                                    
cardinal health singapore 225 pte. ltd.        21500.000000
mastercard asia/pacific pte. ltd.              21125.000000
smith & nephew pte. limited                    21000.000000
airbnb singapore private limited               20000.000000
amazon web services singapore private limited  19261.904762
                                   salary_avg
company                                      
smartkarma innovations pte. ltd.       1000.0
ardex singapore pte. ltd.               900.0
fyreflyz pte. ltd.                      700.0
rht management services pte. ltd.       700.0
four star industries pte ltd            650.0


# Analyse job_title

In [11]:
cv = CountVectorizer(ngram_range=(1,2), max_features=100, binary=True, stop_words='english')
title_words = cv.fit_transform(df.job_title)
title_words = pd.DataFrame(title_words.todense(), columns=cv.get_feature_names())
X_train, X_test, y_train, y_test = train_test_split(title_words.values, df.high_pay.values, test_size=0.25)

In [12]:
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.naive_bayes import BernoulliNB
nb = BernoulliNB()
nb.fit(X_train, y_train)
nb_scores = cross_val_score(BernoulliNB(), X_train, y_train, cv=10)
print(nb_scores)
print(np.mean(nb_scores))
print(np.mean(y_train))

[0.75757576 0.71212121 0.75252525 0.77777778 0.69191919 0.72222222
 0.72222222 0.73232323 0.75126904 0.80612245]
0.7426078353199455
0.4795144157814871


In [13]:
feat_lp = nb.feature_log_prob_
high_p = np.exp(feat_lp[1])
low_p = np.exp(feat_lp[0])
# Make a dataframe with the probabilities and features
feat_probs = pd.DataFrame({'high_p':high_p, 'low_p':low_p, 'feature':title_words.columns.values})
#  Create a column that is the difference between high probability of appearance and low
feat_probs['high_diff'] = feat_probs.high_p - feat_probs.low_p

# Look at the most likely words for high and low salary
feat_probs.sort_values('high_diff', ascending=False, inplace=True)  # most low, use ascending=True
print(feat_probs.head(10))
print(feat_probs.tail(10))

# model on the test set
print(nb.score(X_test, y_test))
print(np.mean(y_test))

      high_p     low_p     feature  high_diff
55  0.209474  0.094083     manager   0.115390
80  0.204211  0.107662      senior   0.096548
31  0.161053  0.064985        data   0.096067
26  0.088421  0.031038  consultant   0.057383
21  0.076842  0.032008    business   0.044834
53  0.054737  0.011639        lead   0.043098
37  0.067368  0.025218   developer   0.042150
86  0.049474  0.010669   singapore   0.038804
78  0.048421  0.009699   scientist   0.038722
40  0.038947  0.001940    director   0.037008
      high_p     low_p            feature  high_diff
15  0.005263  0.029098  assistant manager  -0.023835
84  0.004211  0.029098            service  -0.024887
96  0.001053  0.028128         technician  -0.027075
27  0.012632  0.039767           contract  -0.027136
72  0.031579  0.066925           research  -0.035346
1   0.001053  0.046557           accounts  -0.045504
61  0.003158  0.049467            officer  -0.046309
4   0.002105  0.056256              admin  -0.054151
43  0.009474  0.1

#### Job title

- Best paying features: manager,senior, consultant, business,developer,scientist,director

- Worst paying features: executive, assistant, accounts, admin, officer, technician

# Analyse at seniority

In [14]:
cv = CountVectorizer(ngram_range=(1,2), max_features=20, binary=True, stop_words='english')
seniority_words = cv.fit_transform(df.seniority)
seniority_words = pd.DataFrame(seniority_words.todense(), columns=cv.get_feature_names())
X_train, X_test, y_train, y_test = train_test_split(seniority_words.values, df.high_pay.values, test_size=0.25)

In [15]:
nb = BernoulliNB()
nb.fit(X_train, y_train)
nb_scores = cross_val_score(BernoulliNB(), X_train, y_train, cv=10)
print(nb_scores)
print(np.mean(nb_scores))

[0.72864322 0.7638191  0.73737374 0.78282828 0.68181818 0.70558376
 0.7106599  0.74111675 0.70050761 0.69035533]
0.7242705863831799


In [16]:
feat_lp = nb.feature_log_prob_
high_p = np.exp(feat_lp[1])
low_p = np.exp(feat_lp[0])
# Make a dataframe with the probabilities and features
feat_probs = pd.DataFrame({'high_p':high_p, 'low_p':low_p, 'feature':seniority_words.columns.values})
#  Create a column that is the difference between high probability of appearance and low
feat_probs['high_diff'] = feat_probs.high_p - feat_probs.low_p

# Look at the most likely words for high and low salary
feat_probs.sort_values('high_diff', ascending=False, inplace=True)  # most low, use ascending=True
print(feat_probs.head(10))
print(feat_probs.tail(10))

# model on the test set
print(nb.score(X_test, y_test))

      high_p     low_p            feature  high_diff
16  0.333685  0.144101       professional   0.189585
17  0.261880  0.122824             senior   0.139056
10  0.166843  0.029981         management   0.136862
11  0.194298  0.074468            manager   0.119830
19  0.090813  0.009671  senior management   0.081142
18  0.173178  0.114120   senior executive   0.059059
12  0.080253  0.024178             middle   0.056075
13  0.080253  0.024178  middle management   0.056075
4   0.023231  0.031915   executive senior  -0.008684
3   0.008448  0.039652   executive junior  -0.031204
      high_p     low_p           feature  high_diff
5   0.007392  0.135397             fresh  -0.128005
6   0.007392  0.135397       fresh entry  -0.128005
9   0.007392  0.135397             level  -0.128005
1   0.007392  0.135397       entry level  -0.128005
0   0.007392  0.135397             entry  -0.128005
14  0.047518  0.177950               non  -0.130431
15  0.047518  0.177950     non executive  -0.130431
7

#### Seniority

- Best paying features: professional, management, senior management, middle management

- Worst paying features: executive, junior executive, non executive, entry level

# Analyse at job_categories

In [17]:
cv = CountVectorizer(ngram_range=(1,2), max_features=30, binary=True, stop_words='english')
job_categories_words = cv.fit_transform(df.job_categories)
job_categories_words = pd.DataFrame(job_categories_words.todense(), columns=cv.get_feature_names())
X_train, X_test, y_train, y_test = train_test_split(job_categories_words.values, df.high_pay.values, test_size=0.25)

In [18]:
nb = BernoulliNB()
nb.fit(X_train, y_train)
nb_scores = cross_val_score(BernoulliNB(), X_train, y_train, cv=10)
print(nb_scores)
print(np.mean(nb_scores))


[0.65326633 0.68686869 0.62626263 0.65656566 0.71212121 0.63131313
 0.63451777 0.63959391 0.68527919 0.71573604]
0.6641524548342905


In [19]:
feat_lp = nb.feature_log_prob_
high_p = np.exp(feat_lp[1])
low_p = np.exp(feat_lp[0])
# Make a dataframe with the probabilities and features
feat_probs = pd.DataFrame({'high_p':high_p, 'low_p':low_p, 'feature':job_categories_words.columns.values})
#  Create a column that is the difference between high probability of appearance and low
feat_probs['high_diff'] = feat_probs.high_p - feat_probs.low_p

# Look at the most likely words for high and low salary
feat_probs.sort_values('high_diff', ascending=False, inplace=True)  # most low, use ascending=True
print(feat_probs.head(10))
print(feat_probs.tail(10))

# model on the test set
print(nb.score(X_test, y_test))


      high_p     low_p                 feature  high_diff
29  0.489097  0.241650              technology   0.247446
14  0.489097  0.241650  information technology   0.247446
13  0.489097  0.241650             information   0.247446
6   0.109034  0.052063                 banking   0.056971
7   0.109034  0.052063         banking finance   0.056971
10  0.109034  0.052063                 finance   0.056971
8   0.066459  0.025540              consulting   0.040919
22  0.043614  0.036346               relations   0.007268
21  0.043614  0.036346        public relations   0.007268
19  0.043614  0.036346        marketing public   0.007268
      high_p     low_p              feature  high_diff
17  0.020768  0.055992     logistics supply  -0.035224
27  0.020768  0.055992         supply chain  -0.035224
1   0.023884  0.122790  accounting auditing  -0.098906
5   0.023884  0.122790    auditing taxation  -0.098906
4   0.023884  0.122790             auditing  -0.098906
28  0.023884  0.122790          

#### Job categories

- Best paying features: information technology, consulting, banking finance

- Worst paying features: admin secretarial, accounting, auditing taxation

# Analyse at location

In [20]:
cv = CountVectorizer(ngram_range=(1,2), max_features=200, binary=True, stop_words=['boulevard','road','quay','way'])
location_words = cv.fit_transform(df.location)

In [21]:
location_words = pd.DataFrame(location_words.todense(), columns=cv.get_feature_names())
X_train, X_test, y_train, y_test = train_test_split(location_words.values, df.high_pay.values, test_size=0.25)

In [22]:
nb = BernoulliNB()
nb.fit(X_train, y_train)
nb_scores = cross_val_score(BernoulliNB(), X_train, y_train, cv=10)
print(nb_scores)
print(np.mean(nb_scores))


[0.62311558 0.64646465 0.62121212 0.67171717 0.5959596  0.58080808
 0.6751269  0.60913706 0.62944162 0.60913706]
0.6262119833644971


In [23]:
feat_lp = nb.feature_log_prob_
high_p = np.exp(feat_lp[1])
low_p = np.exp(feat_lp[0])
# Make a dataframe with the probabilities and features
feat_probs = pd.DataFrame({'high_p':high_p, 'low_p':low_p, 'feature':location_words.columns.values})
#  Create a column that is the difference between high probability of appearance and low
feat_probs['high_diff'] = feat_probs.high_p - feat_probs.low_p

# Look at the most likely words for high and low salary
feat_probs.sort_values('high_diff', ascending=False, inplace=True)  # most low, use ascending=True
print(feat_probs.head(10))
print(feat_probs.tail(10))
# model on the test set
print(nb.score(X_test, y_test))


       high_p     low_p          feature  high_diff
106  0.121901  0.030602           marina   0.091299
155  0.065083  0.029615          shenton   0.035468
144  0.050620  0.025666         robinson   0.024953
33   0.035124  0.011846  changi business   0.023278
23   0.047521  0.024679    business park   0.022841
22   0.047521  0.024679         business   0.022841
12   0.032025  0.011846          battery   0.020179
32   0.042355  0.022705           changi   0.019651
124  0.067149  0.048371             park   0.018778
139  0.076446  0.058243          raffles   0.018203
       high_p     low_p         feature  high_diff
143  0.028926  0.050346           ridge  -0.021420
87   0.028926  0.050346            kent  -0.021420
88   0.028926  0.050346      kent ridge  -0.021420
116  0.006198  0.030602  nanyang avenue  -0.024404
115  0.006198  0.030602         nanyang  -0.024404
117  0.032025  0.057256           north  -0.025231
99   0.006198  0.034551           lebar  -0.028352
132  0.006198  0.034

#### Location

- Best paying features: central busniess district - (marina, robinson, shenton, raffles) and business park

- Worst paying features: universities  -(kent ridge, nanyang avenue) and paya lebar

# Analyse at employment_type

In [24]:
cv = CountVectorizer(ngram_range=(1,2), max_features=40, binary=True, stop_words='english')
employment_type_words = cv.fit_transform(df.employment_type)
employment_type_words = pd.DataFrame(employment_type_words.todense(), columns=cv.get_feature_names())
X_train, X_test, y_train, y_test = train_test_split(employment_type_words.values, df.high_pay.values, test_size=0.25)

In [25]:
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.naive_bayes import BernoulliNB
nb = BernoulliNB()
nb.fit(X_train, y_train)
nb_scores = cross_val_score(BernoulliNB(), X_train, y_train, cv=10)
print(nb_scores)
print(np.mean(nb_scores))


[0.53030303 0.58585859 0.5        0.54040404 0.53030303 0.48989899
 0.58080808 0.55837563 0.46192893 0.53299492]
0.5310875249961544


In [26]:
feat_lp = nb.feature_log_prob_
high_p = np.exp(feat_lp[1])
low_p = np.exp(feat_lp[0])
# Make a dataframe with the probabilities and features
feat_probs = pd.DataFrame({'high_p':high_p, 'low_p':low_p, 'feature':employment_type_words.columns.values})
#  Create a column that is the difference between high probability of appearance and low
feat_probs['high_diff'] = feat_probs.high_p - feat_probs.low_p

# Look at the most likely words for high and low salary
feat_probs.sort_values('high_diff', ascending=False, inplace=True)  # most low, use ascending=True
print(feat_probs.head(10))
print(feat_probs.tail(10))
# model on the test set
print(nb.score(X_test, y_test))


      high_p     low_p               feature  high_diff
9   0.451143  0.396467             permanent   0.054676
12  0.186071  0.173700        permanent time   0.012371
10  0.018711  0.015702    permanent contract   0.003009
15  0.001040  0.000981  temporary internship   0.000058
22  0.001040  0.000981        time temporary   0.000058
25  0.001040  0.001963       work internship  -0.000923
20  0.001040  0.001963       time internship  -0.000923
18  0.001040  0.001963         time contract  -0.000923
11  0.001040  0.001963   permanent temporary  -0.000923
2   0.001040  0.001963   contract internship  -0.000923
      high_p     low_p             feature  high_diff
5   0.001040  0.005888          flexi work  -0.004849
4   0.001040  0.005888               flexi  -0.004849
24  0.001040  0.005888                work  -0.004849
21  0.001040  0.006869      time permanent  -0.005830
8   0.001040  0.012758          internship  -0.011718
14  0.006237  0.018646  temporary contract  -0.012409
0   0.

#### employment type

- Best paying features: permanent

- Worst paying features: temporary, contract, internship, flexi

# 1b) Putting the features together (except job description and requirement)

In [27]:
features = pd.concat([title_words, seniority_words,job_categories_words,location_words,employment_type_words], axis=1, sort=False)
print(features.shape)
print(features.columns)

(2637, 376)
Index(['account', 'accounts', 'accounts assistant', 'accounts executive',
       'admin', 'admin assistant', 'administrative', 'administrator',
       'advisory', 'analyst',
       ...
       'temporary time', 'time', 'time contract', 'time flexi',
       'time internship', 'time permanent', 'time temporary', 'time time',
       'work', 'work internship'],
      dtype='object', length=376)


In [28]:
companies = df['company']
# convert dummy dummy-coded variables
companies = pd.get_dummies(companies,drop_first=True) 
print(companies.shape)

(2637, 1060)


In [29]:
features_words = pd.concat([features,companies], axis=1, sort=False)
print(features_words.shape)

(2637, 1436)


In [30]:
X_train, X_test, y_train, y_test = train_test_split(features_words.values, df.high_pay.values, test_size=0.25)

# BernoulliNB

In [31]:
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.naive_bayes import BernoulliNB
nb = BernoulliNB()
nb.fit(X_train, y_train)
nb_scores = cross_val_score(BernoulliNB(), X_train, y_train, cv=10)
print(nb_scores)
print(np.mean(nb_scores))


[0.8040201  0.73366834 0.71859296 0.7979798  0.78680203 0.80203046
 0.81218274 0.75126904 0.74619289 0.8071066 ]
0.7759844961360152


In [32]:
feat_lp = nb.feature_log_prob_
high_p = np.exp(feat_lp[1])
low_p = np.exp(feat_lp[0])
# Make a dataframe with the probabilities and features
feat_probs = pd.DataFrame({'high_p':high_p, 'low_p':low_p, 'feature':features_words.columns.values})
#  Create a column that is the difference between high probability of appearance and low
feat_probs['high_diff'] = feat_probs.high_p - feat_probs.low_p

# Look at the most likely words for high and low reviews
feat_probs.sort_values('high_diff', ascending=False, inplace=True)  # most low, use ascending=True
print(feat_probs.head(10))
print(feat_probs.tail(10))
# model on the test set
print(nb.score(X_test, y_test))


       high_p     low_p                 feature  high_diff
134  0.484472  0.234483  information technology   0.249989
149  0.484472  0.234483              technology   0.249989
133  0.484472  0.234483             information   0.249989
116  0.322981  0.140887            professional   0.182095
117  0.260870  0.118227                  senior   0.142643
111  0.211180  0.079803                 manager   0.131377
110  0.154244  0.025616              management   0.128629
55   0.221532  0.095567                 manager   0.125966
80   0.199793  0.091626                  senior   0.108167
256  0.126294  0.035468                  marina   0.090826
       high_p     low_p            feature  high_diff
122  0.003106  0.121182              admin  -0.118077
123  0.003106  0.121182  admin secretarial  -0.118077
146  0.003106  0.121182        secretarial  -0.118077
115  0.045549  0.170443      non executive  -0.124895
114  0.045549  0.170443                non  -0.124895
14   0.013458  0.139901    

# Logistic regression

In [33]:
y = df.high_pay.values
X = features_words.values

from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
Xs = ss.fit_transform(X)

from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(Xs, y, test_size=0.25)
lr = LogisticRegression().fit(Xs, y)
lr.fit(X_train, y_train)

# predictions and pred prob.
yhat = lr.predict(X_test)
yhat_pp = lr.predict_proba(X_test)

# confusion matrix metrics
conmat = np.array(confusion_matrix(y_test, yhat, labels=[1,0]))

confusion = pd.DataFrame(conmat, index=['is_high', 'is_low'],
                         columns=['predicted_high','predicted_low'])
print(confusion)

         predicted_high  predicted_low
is_high             275             50
is_low               72            263


In [34]:
from sklearn.metrics import classification_report
print (classification_report(y_test, yhat))
# Support is simply the number of observations of the labeled class.
# The marginal sum of rows in the confusion matrix or, in other words, the total number of observations belonging to a class, regardless of prediction.

# recall
from sklearn.metrics import recall_score
print ('recall',recall_score(y_test, yhat))

# accuracy
from sklearn.metrics import accuracy_score
print ('accuracy',accuracy_score(y_test, yhat))

# precision
from sklearn.metrics import precision_score
print ('precision',precision_score(y_test, yhat))

              precision    recall  f1-score   support

         0.0       0.84      0.79      0.81       335
         1.0       0.79      0.85      0.82       325

   micro avg       0.82      0.82      0.82       660
   macro avg       0.82      0.82      0.82       660
weighted avg       0.82      0.82      0.82       660

recall 0.8461538461538461
accuracy 0.8151515151515152
precision 0.792507204610951


In [35]:
coeffs = pd.DataFrame(lr.coef_, columns=features_words.columns)
coeffs_t = coeffs.transpose()
coeffs_t.columns = ['logreg_coefs']
coeffs_topbot = coeffs_t.sort_values('logreg_coefs', ascending=False)
print(coeffs_topbot.head(10))
print(coeffs_topbot.tail(10))

                               logreg_coefs
director                           1.868261
manager                            1.215284
manager                            1.058370
head                               1.037224
senior associate                   1.032047
management                         0.950654
google asia pacific pte. ltd.      0.921734
scientist                          0.825769
unilever asia private limited      0.825399
analytics                          0.822351
                                  logreg_coefs
junior executive                     -0.599292
engineering                          -0.664236
temporary time                       -0.707778
zalora south east asia pte. ltd.     -0.731701
battery                              -0.740962
specialist                           -0.772620
marketing                            -0.827126
assistant manager                    -0.894786
associate                            -0.911423
internship                           -1.230415

# Decision tree Classifier

In [36]:
# try decision tree
from sklearn.tree import DecisionTreeClassifier

#  Set up and run the gridsearch on the data
dtc_params = {
    'max_depth':[None,1,2,3,4],
    'max_features':[None,1,2,3,4,5],
    'min_samples_split':[2,3,4,5,10,15,20,25,30,40,50]
}

from sklearn.model_selection import GridSearchCV
# set the gridsearch
dtc_gs = GridSearchCV(DecisionTreeClassifier(), 
                      dtc_params, 
                      cv=5, 
                      verbose=1, 
                      scoring='roc_auc', 
                      n_jobs=-1)

dtc_gs.fit(features_words.values, df.high_pay.values)

dtc_best = dtc_gs.best_estimator_
print('best_params:' ,dtc_gs.best_params_)
print('best_score:', dtc_gs.best_score_)


Fitting 5 folds for each of 330 candidates, totalling 1650 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   14.4s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   31.7s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:   58.9s
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 1242 tasks      | elapsed:  2.4min
[Parallel(n_jobs=-1)]: Done 1650 out of 1650 | elapsed:  3.2min finished


best_params: {'max_depth': None, 'max_features': None, 'min_samples_split': 50}
best_score: 0.8065761302059312


In [37]:
fi = pd.DataFrame({
        'feature':features_words.columns,
        'importance':dtc_best.feature_importances_
    })

fi.sort_values('importance', ascending=False, inplace=True)
print(fi.head(10))
print(fi.tail(10))

                           feature  importance
102                      executive    0.168870
100                          entry    0.084616
118               senior executive    0.067741
133                    information    0.045160
784  google asia pacific pte. ltd.    0.041521
14                       assistant    0.026620
43                       executive    0.025121
44                          fellow    0.014566
78                       scientist    0.013384
80                          senior    0.010249
                                              feature  importance
549          business edge personnel services pte ltd         0.0
539                       bradbury consulting pte ltd         0.0
548                                    busads pte ltd         0.0
547   bureau van dijk electronic publishing pte. ltd.         0.0
545                        bruker singapore pte. ltd.         0.0
543                           british high commission         0.0
542               bri

#### combine features

- Best paying features: google asia pacific, 

- Worst paying features: temporary, contract, internship, flexi

# 1c) Look at columns 'job_description' and 'requirements'

# job_description

In [39]:
cv = CountVectorizer(ngram_range=(1,2), max_features=1500, binary=True, stop_words='english')
jd_words = cv.fit_transform(df.job_description)
jd_words = pd.DataFrame(jd_words.todense(), columns=cv.get_feature_names())
X_train, X_test, y_train, y_test = train_test_split(jd_words.values, df.high_pay.values, test_size=0.25)

In [40]:
nb = BernoulliNB()
nb.fit(X_train, y_train)
nb_scores = cross_val_score(BernoulliNB(), X_train, y_train, cv=10)
print(nb_scores)
print(np.mean(nb_scores))


[0.70707071 0.75252525 0.67171717 0.69191919 0.67171717 0.71212121
 0.72727273 0.73096447 0.74619289 0.75634518]
0.71678459724145


In [41]:
feat_lp = nb.feature_log_prob_
high_p = np.exp(feat_lp[1])
low_p = np.exp(feat_lp[0])
# Make a dataframe with the probabilities and features
feat_probs = pd.DataFrame({'high_p':high_p, 'low_p':low_p, 'feature':jd_words.columns.values})
#  Create a column that is the difference between high probability of appearance and low
feat_probs['high_diff'] = feat_probs.high_p - feat_probs.low_p

# Look at the most likely words for high and low salary
feat_probs.sort_values('high_diff', ascending=False, inplace=True)  # most low, use ascending=True
print(feat_probs.head(10))
print(feat_probs.tail(10))
# model on the test set
print(nb.score(X_test, y_test))

        high_p     low_p      feature  high_diff
171   0.589286  0.276968     business   0.312318
1368  0.359244  0.130224        teams   0.229020
1274  0.365546  0.137026    solutions   0.228520
1364  0.623950  0.410107         team   0.213843
398   0.403361  0.201166       design   0.202195
1377  0.307773  0.112731   technology   0.195042
1193  0.350840  0.172983         role   0.177857
168   0.256303  0.093294        build   0.163008
410   0.401261  0.242954      develop   0.158306
418   0.414916  0.257532  development   0.157384
        high_p     low_p     feature  high_diff
569   0.001050  0.103013      filing  -0.101962
36    0.043067  0.147716      ad hoc  -0.104649
35    0.047269  0.152575          ad  -0.105306
654   0.043067  0.148688         hoc  -0.105621
1152  0.169118  0.280855     reports  -0.111738
110   0.102941  0.215743    assigned  -0.112802
112   0.123950  0.241011      assist  -0.117061
356   0.007353  0.136054  data entry  -0.128701
511   0.013655  0.151603     

# use logistic regression

In [42]:
y = df.high_pay.values
X = jd_words.values


ss = StandardScaler()
Xs = ss.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(Xs, y, test_size=0.25)
lr = LogisticRegression().fit(Xs, y)
lr.fit(X_train, y_train)

# predictions and pred prob.
yhat = lr.predict(X_test)
yhat_pp = lr.predict_proba(X_test)

# confusion matrix metrics
conmat = np.array(confusion_matrix(y_test, yhat, labels=[1,0]))

confusion = pd.DataFrame(conmat, index=['is_high', 'is_low'],
                         columns=['predicted_high','predicted_low'])
print(confusion)

         predicted_high  predicted_low
is_high             250             84
is_low               76            250


In [43]:
from sklearn.metrics import classification_report
print (classification_report(y_test, yhat))
# Support is simply the number of observations of the labeled class.
# The marginal sum of rows in the confusion matrix or, in other words, the total number of observations belonging to a class, regardless of prediction.

# recall
from sklearn.metrics import recall_score
print ('recall',recall_score(y_test, yhat))

# accuracy
from sklearn.metrics import accuracy_score
print ('accuracy',accuracy_score(y_test, yhat))

# precision
from sklearn.metrics import precision_score
print ('precision',precision_score(y_test, yhat))

              precision    recall  f1-score   support

         0.0       0.75      0.77      0.76       326
         1.0       0.77      0.75      0.76       334

   micro avg       0.76      0.76      0.76       660
   macro avg       0.76      0.76      0.76       660
weighted avg       0.76      0.76      0.76       660

recall 0.7485029940119761
accuracy 0.7575757575757576
precision 0.7668711656441718


In [44]:
coeffs = pd.DataFrame(lr.coef_, columns=jd_words.columns)
coeffs_t = coeffs.transpose()
coeffs_t.columns = ['logreg_coefs']
coeffs_topbot = coeffs_t.sort_values('logreg_coefs', ascending=False)
print(coeffs_topbot.head(10))
print(coeffs_topbot.tail(10))

               logreg_coefs
senior             1.003580
lead               0.703744
understanding      0.684230
like               0.663862
creating           0.652380
variety            0.630907
apac               0.630291
skill              0.629081
lifecycle          0.625729
example            0.625506
              logreg_coefs
need             -0.502146
stock            -0.502485
terms            -0.508739
analytical       -0.514360
conversion       -0.516166
house            -0.525220
drives           -0.553535
applications     -0.577807
duties           -0.635769
computer         -0.701632


# Decision tree Classifier

In [45]:
# try decision tree
from sklearn.tree import DecisionTreeClassifier

#  Set up and run the gridsearch on the data
dtc_params = {
    'max_depth':[None,1,2,3,4],
    'max_features':[None,1,2,3,4,5],
    'min_samples_split':[2,3,4,5,10,15,20,25,30,40,50]
}

from sklearn.model_selection import GridSearchCV
# set the gridsearch
dtc_gs = GridSearchCV(DecisionTreeClassifier(), 
                      dtc_params, 
                      cv=5, 
                      verbose=1, 
                      scoring='roc_auc', 
                      n_jobs=-1)

# use the gridearc C model to fit the data
dtc_gs.fit(jd_words.values, df.high_pay.values)

dtc_best = dtc_gs.best_estimator_
print('best_params:' ,dtc_gs.best_params_)
print('best_score:', dtc_gs.best_score_)

# find "feature importances"

# It ranges from 0 to 1, with 1 being the most important. feature importance is how much that particular variable was used to make decisions.
# it also takes into account how much that feature contributed to splitting up the class or reducing the variance.
#A feature with higher feature importance reduced the criterion (impurity) more than the other features.


Fitting 5 folds for each of 330 candidates, totalling 1650 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   15.8s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   32.2s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:   56.6s
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done 1242 tasks      | elapsed:  2.2min
[Parallel(n_jobs=-1)]: Done 1650 out of 1650 | elapsed:  2.8min finished


best_params: {'max_depth': 4, 'max_features': None, 'min_samples_split': 2}
best_score: 0.7194037736747323


In [46]:
fi = pd.DataFrame({
        'feature':jd_words.columns,
        'importance':dtc_best.feature_importances_
    })

fi.sort_values('importance', ascending=False, inplace=True)
print(fi.head(10))
print(fi.tail(10))


            feature  importance
171        business    0.425636
451          duties    0.147678
1274      solutions    0.066504
1377     technology    0.059714
356      data entry    0.050953
1133         region    0.046191
398          design    0.038246
732   international    0.035566
1405          train    0.029286
1429          units    0.026101
                feature  importance
505       ensure timely         0.0
504         ensure data         0.0
503   ensure compliance         0.0
502              ensure         0.0
501           enquiries         0.0
500               enjoy         0.0
499        enhancements         0.0
498         enhancement         0.0
497             enhance         0.0
1499   years experience         0.0


# look at requirements

In [47]:
cv = CountVectorizer(ngram_range=(1,2), max_features=500, binary=True, stop_words='english')
req_words = cv.fit_transform(df.requirements)

req_words = pd.DataFrame(req_words.todense(), columns=cv.get_feature_names())
X_train, X_test, y_train, y_test = train_test_split(req_words.values, df.high_pay.values, test_size=0.25)

In [48]:
nb = BernoulliNB()
nb.fit(X_train, y_train)
nb_scores = cross_val_score(BernoulliNB(), X_train, y_train, cv=10)
print(nb_scores)
print(np.mean(nb_scores))


[0.69191919 0.75252525 0.73737374 0.67676768 0.77272727 0.70707071
 0.70707071 0.76142132 0.74111675 0.76142132]
0.7309413936317489


In [49]:
feat_lp = nb.feature_log_prob_
high_p = np.exp(feat_lp[1])
low_p = np.exp(feat_lp[0])
# Make a dataframe with the probabilities and features
feat_probs = pd.DataFrame({'high_p':high_p, 'low_p':low_p, 'feature':req_words.columns.values})
#  Create a column that is the difference between high probability of appearance and low
feat_probs['high_diff'] = feat_probs.high_p - feat_probs.low_p

# Look at the most likely words for high and low salary
feat_probs.sort_values('high_diff', ascending=False, inplace=True)  # most low, use ascending=True
print(feat_probs.head(10))
print(feat_probs.tail(10))
# model on the test set
print(nb.score(X_test, y_test))

       high_p     low_p           feature  high_diff
2    0.510395  0.299313           ability   0.211082
467  0.325364  0.118744     understanding   0.206620
111  0.312890  0.114818       development   0.198071
435  0.511435  0.329735            strong   0.181699
496  0.665281  0.486752             years   0.178529
147  0.904366  0.736016        experience   0.168350
497  0.319127  0.156035  years experience   0.163091
248  0.433472  0.271835        management   0.161637
89   0.424116  0.268891              data   0.155225
46   0.367983  0.230618          business   0.137365
       high_p     low_p     feature  high_diff
5    0.082121  0.155054   able work  -0.072933
48   0.085239  0.159961  candidates  -0.074722
7    0.025988  0.105986  accounting  -0.079999
176  0.340956  0.424926        good  -0.083970
344  0.090437  0.176644  proficient  -0.086207
274  0.073805  0.160942          ms  -0.087138
139  0.115385  0.214917       excel  -0.099532
264  0.076923  0.178606   microsoft  -0.1

# logistic regression

In [50]:
y = df.high_pay.values
X = req_words.values

from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
Xs = ss.fit_transform(X)

from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(Xs, y, test_size=0.25)
lr = LogisticRegression().fit(Xs, y)
lr.fit(X_train, y_train)

# predictions and pred prob.
yhat = lr.predict(X_test)
yhat_pp = lr.predict_proba(X_test)

# confusion matrix metrics
conmat = np.array(confusion_matrix(y_test, yhat, labels=[1,0]))

confusion = pd.DataFrame(conmat, index=['is_high', 'is_low'],
                         columns=['predicted_high','predicted_low'])
print(confusion)

         predicted_high  predicted_low
is_high             241            103
is_low               96            220


In [51]:
from sklearn.metrics import classification_report
print (classification_report(y_test, yhat))
# Support is simply the number of observations of the labeled class.
# The marginal sum of rows in the confusion matrix or, in other words, the total number of observations belonging to a class, regardless of prediction.

# recall
from sklearn.metrics import recall_score
print ('recall',recall_score(y_test, yhat))

# accuracy
from sklearn.metrics import accuracy_score
print ('accuracy',accuracy_score(y_test, yhat))

# precision
from sklearn.metrics import precision_score
print ('precision',precision_score(y_test, yhat))

              precision    recall  f1-score   support

         0.0       0.68      0.70      0.69       316
         1.0       0.72      0.70      0.71       344

   micro avg       0.70      0.70      0.70       660
   macro avg       0.70      0.70      0.70       660
weighted avg       0.70      0.70      0.70       660

recall 0.7005813953488372
accuracy 0.6984848484848485
precision 0.7151335311572701


In [52]:
coeffs = pd.DataFrame(lr.coef_, columns=req_words.columns)
coeffs_t = coeffs.transpose()
coeffs_t.columns = ['logreg_coefs']
coeffs_topbot = coeffs_t.sort_values('logreg_coefs', ascending=False)
print(coeffs_topbot.head(10))
print(coeffs_topbot.tail(10))

                      logreg_coefs
candidates notified       0.856865
shortlisted               0.821166
practical                 0.750684
player                    0.735283
regret shortlisted        0.652758
architecture              0.573697
phd                       0.519550
self starter              0.513826
communication skills      0.501911
ms office                 0.500870
                        logreg_coefs
regret                     -0.506502
hands                      -0.557009
computer                   -0.588008
sg                         -0.630626
shortlisted candidates     -0.640421
candidates                 -0.653466
diploma                    -0.687557
ms                         -0.743880
team player                -0.854804
notified                   -0.923174


# Decision tree Classifier

In [53]:
# try decision tree
from sklearn.tree import DecisionTreeClassifier

#  Set up and run the gridsearch on the data
dtc_params = {
    'max_depth':[None,1,2,3,4],
    'max_features':[None,1,2,3,4,5],
    'min_samples_split':[2,3,4,5,10,15,20,25,30,40,50]
}

from sklearn.model_selection import GridSearchCV
# set the gridsearch
dtc_gs = GridSearchCV(DecisionTreeClassifier(), 
                      dtc_params, 
                      cv=5, 
                      verbose=1, 
                      scoring='roc_auc', 
                      n_jobs=-1)

# use the gridearc C model to fit the data
dtc_gs.fit(req_words.values, df.high_pay.values)

dtc_best = dtc_gs.best_estimator_
print('best_params:' ,dtc_gs.best_params_)
print('best_score:', dtc_gs.best_score_)

# find "feature importances"

# It ranges from 0 to 1, with 1 being the most important. feature importance is how much that particular variable was used to make decisions.
# it also takes into account how much that feature contributed to splitting up the class or reducing the variance.
#A feature with higher feature importance reduced the criterion (impurity) more than the other features.


Fitting 5 folds for each of 330 candidates, totalling 1650 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    4.2s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:    9.7s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:   17.8s
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:   29.1s
[Parallel(n_jobs=-1)]: Done 1242 tasks      | elapsed:   43.9s


best_params: {'max_depth': None, 'max_features': None, 'min_samples_split': 50}
best_score: 0.7264848830657634


[Parallel(n_jobs=-1)]: Done 1650 out of 1650 | elapsed:   57.5s finished


In [54]:
fi = pd.DataFrame({
        'feature':req_words.columns,
        'importance':dtc_best.feature_importances_
    })

fi.sort_values('importance', ascending=False, inplace=True)
print(fi.head(10))
print(fi.tail(10))


           feature  importance
115        diploma    0.148540
147     experience    0.074869
467  understanding    0.046063
111    development    0.029808
287         office    0.023988
321      practical    0.019119
42             big    0.018009
1         10 years    0.017644
200      including    0.016916
348        project    0.014918
                    feature  importance
208              initiative         0.0
207          infrastructure         0.0
206  information technology         0.0
205             information         0.0
204               influence         0.0
203                industry         0.0
202           independently         0.0
201             independent         0.0
198             improvement         0.0
499           years working         0.0


# Question 2 - Factors that distinguish job category. 

identify features in the data related to job postings that can distinguish job titles from each other

2a) data scientists vs other data jobs?

2b) distinguishing junior vs. senior positions

2c) Do the requirements for titles vary significantly with industry ?

Models used: BernoulliNB, Logistic regression, Decision tree Classifier, MultinomialNB

In [55]:
df.head(2)

Unnamed: 0.1,Unnamed: 0,company,job_title,location,employment_type,seniority,job_categories,job_description,requirements,salary_low,salary_high,salary_avg,high_pay
0,0,national university of singapore,"senior / associate director, data governance /...",lower kent ridge road,"permanent, full time",senior management,"education and training, information technology",\r\r\nthis leadership role will interact and e...,"\r\r\ndegree in information technology, comput...",7000.0,9000.0,8000.0,1.0
1,1,ntuc enterprise nexus co-operative limited,data scientist,marina boulevard,full time,executive,information technology,\r\r\nntuc enterprise is in the midst of its d...,"\r\r\n• masters in statistics, mathematics, co...",3500.0,10000.0,6750.0,1.0


# 2a) data scientists vs other data jobs

In [56]:
data_scientist = df[df['job_title'].str.contains("scientist")]

In [57]:
data_scientist.shape

(76, 13)

In [58]:
df.loc[df['job_title'].str.contains("scientist"), 'data_sci'] = 1

In [59]:
df.data_sci.replace(np.nan, 0, inplace=True)

In [60]:
df.head(5)

Unnamed: 0.1,Unnamed: 0,company,job_title,location,employment_type,seniority,job_categories,job_description,requirements,salary_low,salary_high,salary_avg,high_pay,data_sci
0,0,national university of singapore,"senior / associate director, data governance /...",lower kent ridge road,"permanent, full time",senior management,"education and training, information technology",\r\r\nthis leadership role will interact and e...,"\r\r\ndegree in information technology, comput...",7000.0,9000.0,8000.0,1.0,0.0
1,1,ntuc enterprise nexus co-operative limited,data scientist,marina boulevard,full time,executive,information technology,\r\r\nntuc enterprise is in the midst of its d...,"\r\r\n• masters in statistics, mathematics, co...",3500.0,10000.0,6750.0,1.0,1.0
2,2,a*star research entities,scientist (data analytics) / i2r (a*star),raffles place,"contract, full time",professional,sciences / laboratory / r&d,\r\r\nabout the institute for infocomm researc...,\r\r\nphd in computer science or other related...,4500.0,9000.0,6750.0,1.0,1.0
3,4,china aviation oil (singapore) corporation ltd,business data analyst,temasek boulevard,full time,senior executive,information technology,\r\r\nresponsible for front desk business syst...,\r\r\nbachelor degree or equivalent \r\r\n3-5...,4000.0,8000.0,6000.0,1.0,0.0
4,5,nityo infotech services pte. ltd.,big data security consultant,ubi crescent,contract,executive,information technology,\r\r\nsolid technical knowledge in data discov...,\r\r\nat least 4 years of experience in implem...,6500.0,8500.0,7500.0,1.0,0.0


In [61]:
df.shape

(2637, 14)

In [62]:
# exclude job titles
all_features = pd.concat([seniority_words,job_categories_words,location_words,employment_type_words,req_words,jd_words], axis=1, sort=False)
print(all_features.shape)

(2637, 2276)


In [63]:
X_train, X_test, y_train, y_test = train_test_split(all_features.values, df.data_sci.values, test_size=0.25)

In [64]:
nb = BernoulliNB()
nb.fit(X_train, y_train)
nb_scores = cross_val_score(BernoulliNB(), X_train, y_train, cv=10)
print(nb_scores)
print(np.mean(nb_scores))


[0.9040404  0.94444444 0.94949495 0.89393939 0.93939394 0.94949495
 0.93939394 0.90909091 0.95939086 0.91326531]
0.9301949098359541


In [65]:
feat_lp = nb.feature_log_prob_
high_p = np.exp(feat_lp[1])
low_p = np.exp(feat_lp[0])
# Make a dataframe with the probabilities and features
feat_probs = pd.DataFrame({'high_p':high_p, 'low_p':low_p, 'feature':all_features.columns.values})
#  Create a column that is the difference between high probability of appearance and low
feat_probs['high_diff'] = feat_probs.high_p - feat_probs.low_p

# Look at the most likely words for data scientist and other jobs
feat_probs.sort_values('high_diff', ascending=False, inplace=True)  # most low, use ascending=True
print(feat_probs.head(10))
print(feat_probs.tail(10))
# model on the test set
print(nb.score(X_test, y_test))


        high_p     low_p           feature  high_diff
519   0.666667  0.034878  machine learning   0.631789
518   0.666667  0.038001           machine   0.628666
508   0.700000  0.078605          learning   0.621395
632   0.683333  0.089016            python   0.594317
1996  0.583333  0.038522           science   0.544812
708   0.516667  0.042166        statistics   0.474501
365   0.800000  0.339927              data   0.460073
1562  0.533333  0.083811          learning   0.449523
1136  0.466667  0.017699      data science   0.448968
1998  0.433333  0.009370         scientist   0.423963
        high_p     low_p     feature  high_diff
1933  0.066667  0.211869    required  -0.145202
809   0.050000  0.198334  activities  -0.148334
391   0.016667  0.166059     diploma  -0.149393
758   0.316667  0.470068        work  -0.153401
1609  0.250000  0.415409  management  -0.165409
1827  0.083333  0.258719     process  -0.175386
1606  0.100000  0.279542      manage  -0.179542
524   0.166667  0.3466

# use logistic regression

In [66]:
y = df.data_sci.values
X = all_features.values

from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
Xs = ss.fit_transform(X)

from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(Xs, y, test_size=0.25)
lr = LogisticRegression().fit(Xs, y)
lr.fit(X_train, y_train)

# predictions and pred prob.
yhat = lr.predict(X_test)
yhat_pp = lr.predict_proba(X_test)

# confusion matrix metrics
conmat = np.array(confusion_matrix(y_test, yhat, labels=[1,0]))

confusion = pd.DataFrame(conmat, index=['is_high', 'is_low'],
                         columns=['predicted_high','predicted_low'])
print(confusion)

         predicted_high  predicted_low
is_high              10              5
is_low               36            609


In [67]:
from sklearn.metrics import classification_report
print (classification_report(y_test, yhat))
# Support is simply the number of observations of the labeled class.
# The marginal sum of rows in the confusion matrix or, in other words, the total number of observations belonging to a class, regardless of prediction.

# recall
from sklearn.metrics import recall_score
print ('recall',recall_score(y_test, yhat))

# accuracy
from sklearn.metrics import accuracy_score
print ('accuracy',accuracy_score(y_test, yhat))

# precision
from sklearn.metrics import precision_score
print ('precision',precision_score(y_test, yhat))

              precision    recall  f1-score   support

         0.0       0.99      0.94      0.97       645
         1.0       0.22      0.67      0.33        15

   micro avg       0.94      0.94      0.94       660
   macro avg       0.60      0.81      0.65       660
weighted avg       0.97      0.94      0.95       660

recall 0.6666666666666666
accuracy 0.9378787878787879
precision 0.21739130434782608


In [68]:
coeffs = pd.DataFrame(lr.coef_, columns=all_features.columns)
coeffs_t = coeffs.transpose()
coeffs_t.columns = ['logreg_coefs']
coeffs_topbot = coeffs_t.sort_values('logreg_coefs', ascending=False)
print(coeffs_topbot.head(10))
print(coeffs_topbot.tail(10))

                  logreg_coefs
scientist             0.414746
data scientist        0.366823
spark                 0.204635
scientific            0.174299
enterprise            0.169497
data science          0.166818
analytic              0.166506
anson                 0.160395
methods               0.152620
machine learning      0.151980
                logreg_coefs
license            -0.144409
engineer           -0.150412
automation         -0.154448
contract           -0.158879
platform           -0.163544
workplace          -0.169008
write              -0.169833
reg                -0.172625
implementation     -0.180173
battery            -0.238940


# 2b) distinguishing junior vs. senior positions

In [69]:
senior_position = df[df['seniority'].str.contains("senior")]

In [70]:
senior_position.shape

(506, 14)

In [71]:
df.loc[df['seniority'].str.contains("senior"), 'senior_position'] = 1

In [72]:
df.senior_position.replace(np.nan, 0, inplace=True)

In [73]:
df.head(2)

Unnamed: 0.1,Unnamed: 0,company,job_title,location,employment_type,seniority,job_categories,job_description,requirements,salary_low,salary_high,salary_avg,high_pay,data_sci,senior_position
0,0,national university of singapore,"senior / associate director, data governance /...",lower kent ridge road,"permanent, full time",senior management,"education and training, information technology",\r\r\nthis leadership role will interact and e...,"\r\r\ndegree in information technology, comput...",7000.0,9000.0,8000.0,1.0,0.0,1.0
1,1,ntuc enterprise nexus co-operative limited,data scientist,marina boulevard,full time,executive,information technology,\r\r\nntuc enterprise is in the midst of its d...,"\r\r\n• masters in statistics, mathematics, co...",3500.0,10000.0,6750.0,1.0,1.0,0.0


In [74]:
# exclude seniority
all_features = pd.concat([title_words,job_categories_words,location_words,employment_type_words,req_words,jd_words], axis=1, sort=False)
print(all_features.shape)

(2637, 2356)


In [75]:
X_train, X_test, y_train, y_test = train_test_split(all_features.values, df.senior_position.values, test_size=0.25)

In [76]:
nb = BernoulliNB()
nb.fit(X_train, y_train)
nb_scores = cross_val_score(BernoulliNB(), X_train, y_train, cv=10)
print(nb_scores)
print(np.mean(nb_scores))

[0.70854271 0.71356784 0.7638191  0.75252525 0.66497462 0.7106599
 0.69035533 0.7106599  0.67005076 0.69543147]
0.708058688046189


In [77]:
feat_lp = nb.feature_log_prob_
high_p = np.exp(feat_lp[1])
low_p = np.exp(feat_lp[0])
# Make a dataframe with the probabilities and features
feat_probs = pd.DataFrame({'high_p':high_p, 'low_p':low_p, 'feature':all_features.columns.values})
#  Create a column that is the difference between high probability of appearance and low
feat_probs['high_diff'] = feat_probs.high_p - feat_probs.low_p

# Look at the most likely words for senior and junior role
feat_probs.sort_values('high_diff', ascending=False, inplace=True)  # most low, use ascending=True
print(feat_probs.head(10))
print(feat_probs.tail(10))

# model on the test set
print(nb.score(X_test, y_test))


        high_p     low_p                 feature  high_diff
80    0.333333  0.114826                  senior   0.218507
1027  0.558081  0.407571                business   0.150510
852   0.684343  0.552681                   years   0.131662
791   0.507576  0.379180                  strong   0.128396
1254  0.396465  0.270662                  design   0.125802
1788  0.229798  0.114196             operational   0.115602
114   0.454545  0.345110  information technology   0.109435
113   0.454545  0.345110             information   0.109435
129   0.454545  0.345110              technology   0.109435
1909  0.303030  0.193691               processes   0.109339
        high_p     low_p         feature  high_diff
126   0.012626  0.069401     secretarial  -0.056774
1691  0.123737  0.181703         manager  -0.057966
902   0.022727  0.085804  administrative  -0.063077
14    0.027778  0.094637       assistant  -0.066859
731   0.118687  0.189905        required  -0.071218
1212  0.010101  0.088328    

#  use logistic regression

In [78]:
y = df.data_sci.values
X = all_features.values

from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
Xs = ss.fit_transform(X)

from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(Xs, y, test_size=0.25)
lr = LogisticRegression().fit(Xs, y)
lr.fit(X_train, y_train)

# predictions and pred prob.
yhat = lr.predict(X_test)
yhat_pp = lr.predict_proba(X_test)

# confusion matrix metrics
conmat = np.array(confusion_matrix(y_test, yhat, labels=[1,0]))

confusion = pd.DataFrame(conmat, index=['is_high', 'is_low'],
                         columns=['predicted_high','predicted_low'])
print(confusion)

         predicted_high  predicted_low
is_high              20              0
is_low               38            602


In [79]:
from sklearn.metrics import classification_report
print (classification_report(y_test, yhat))
# Support is simply the number of observations of the labeled class.
# The marginal sum of rows in the confusion matrix or, in other words, the total number of observations belonging to a class, regardless of prediction.

# recall
from sklearn.metrics import recall_score
print ('recall',recall_score(y_test, yhat))

# accuracy
from sklearn.metrics import accuracy_score
print ('accuracy',accuracy_score(y_test, yhat))

# precision
from sklearn.metrics import precision_score
print ('precision',precision_score(y_test, yhat))

              precision    recall  f1-score   support

         0.0       1.00      0.94      0.97       640
         1.0       0.34      1.00      0.51        20

   micro avg       0.94      0.94      0.94       660
   macro avg       0.67      0.97      0.74       660
weighted avg       0.98      0.94      0.96       660

recall 1.0
accuracy 0.9424242424242424
precision 0.3448275862068966


In [80]:
coeffs = pd.DataFrame(lr.coef_, columns=all_features.columns)
coeffs_t = coeffs.transpose()
coeffs_t.columns = ['logreg_coefs']
coeffs_topbot = coeffs_t.sort_values('logreg_coefs', ascending=False)
print(coeffs_topbot.head(10))
print(coeffs_topbot.tail(10))

                logreg_coefs
scientist           1.031016
data scientist      0.791236
scientist           0.183748
data                0.173265
data scientist      0.158966
vibrant             0.150510
analytic            0.136657
expertise           0.117478
goal                0.113929
budgets             0.113677
               logreg_coefs
developing        -0.082452
extensive         -0.084526
technology        -0.087750
cloud             -0.089297
engineer          -0.091662
value             -0.093068
real time         -0.099755
university        -0.107744
data analyst      -0.127364
data engineer     -0.127733


# Decision tree Classifier

In [81]:
# try decision tree
from sklearn.tree import DecisionTreeClassifier

#  Set up and run the gridsearch on the data
dtc_params = {
    'max_depth':[None,1,2,3,4],
    'max_features':[None,1,2,3,4,5],
    'min_samples_split':[2,3,4,5,10,15,20,25,30,40,50]
}

from sklearn.model_selection import GridSearchCV
# set the gridsearch
dtc_gs = GridSearchCV(DecisionTreeClassifier(), 
                      dtc_params, 
                      cv=5, 
                      verbose=1, 
                      scoring='roc_auc', 
                      n_jobs=-1)

# use the gridearc C model to fit the data
dtc_gs.fit(all_features.values, df.senior_position.values)

dtc_best = dtc_gs.best_estimator_
print('best_params:' ,dtc_gs.best_params_)
print('best_score:', dtc_gs.best_score_)

Fitting 5 folds for each of 330 candidates, totalling 1650 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   41.4s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  2.0min
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:  3.0min
[Parallel(n_jobs=-1)]: Done 1242 tasks      | elapsed:  4.4min
[Parallel(n_jobs=-1)]: Done 1650 out of 1650 | elapsed:  5.7min finished


best_params: {'max_depth': None, 'max_features': None, 'min_samples_split': 25}
best_score: 0.6668057015050338


In [82]:
fi = pd.DataFrame({
        'feature':all_features.columns,
        'importance':dtc_best.feature_importances_
    })

fi.sort_values('importance', ascending=False, inplace=True)
print(fi.head(10))
print(fi.tail(10))

                feature  importance
1029  business function    0.092525
80               senior    0.076606
428            computer    0.016341
197                 hoe    0.016036
732        requirements    0.015082
21             business    0.015013
40             director    0.014895
1345             engage    0.014500
50                 head    0.014241
1205               data    0.014043
               feature  importance
825                use         0.0
824         university         0.0
823      understanding         0.0
822         understand         0.0
821    troubleshooting         0.0
820             trends         0.0
819             travel         0.0
818           training         0.0
817       track record         0.0
2355  years experience         0.0


# 2c) Do the requirements for titles vary significantly with industry
look at job_categories

In [83]:
df.job_categories.value_counts()

information technology                                                                                         780
engineering                                                                                                    183
accounting / auditing / taxation                                                                               134
banking and finance                                                                                            124
others                                                                                                         105
sciences / laboratory / r&d                                                                                     99
admin / secretarial                                                                                             90
human resources                                                                                                 80
marketing / public relations                                                    

In [84]:
# filter 'job_categories' to keep the top 4 job_categories
counts = df['job_categories'].value_counts()
mask =df['job_categories'].isin(counts[counts > 110].keys().values)
df2=df.loc[mask, :]
print(df2['job_categories'].value_counts())
print(df2.shape)

information technology              780
engineering                         183
accounting / auditing / taxation    134
banking and finance                 124
Name: job_categories, dtype: int64
(1221, 15)


In [85]:
df2.loc[df['job_categories'].str.contains("information technology"), 'four_categories'] = 0
df2.loc[df['job_categories'].str.contains("engineering"), 'four_categories'] = 1
df2.loc[df['job_categories'].str.contains("accounting / auditing / taxation"), 'four_categories'] = 2
df2.loc[df['job_categories'].str.contains("banking and finance"), 'four_categories'] = 3

In [86]:
df2.four_categories.value_counts()

0.0    780
1.0    183
2.0    134
3.0    124
Name: four_categories, dtype: int64

In [89]:
# exclude job_categories
all_features = pd.concat([title_words,seniority_words,location_words,employment_type_words,req_words,jd_words], axis=1, sort=False)
X=all_features.loc[mask, :]
print(X.shape)

(1221, 2346)


In [90]:
X_train, X_test, y_train, y_test = train_test_split(X.values, df2['four_categories'].values, test_size=0.25)

In [91]:
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.naive_bayes import BernoulliNB
nb = BernoulliNB()
nb.fit(X_train, y_train)
nb_scores = cross_val_score(BernoulliNB(), X_train, y_train, cv=10)
print(nb_scores)
print(np.mean(nb_scores))

[0.77173913 0.81521739 0.73913043 0.72826087 0.84782609 0.75
 0.77173913 0.82417582 0.76923077 0.79775281]
0.7815072445873618


In [92]:
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X_train, y_train)
clf_scores = cross_val_score(MultinomialNB(), X_train, y_train, cv=10)
print(clf_scores)
print(np.mean(clf_scores))

[0.7826087  0.82608696 0.76086957 0.7826087  0.86956522 0.7826087
 0.7826087  0.84615385 0.81318681 0.7752809 ]
0.8021578079956194


In [93]:
y_pred = clf.predict(X_test)

print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred,))

accuracy 0.7777777777777778
              precision    recall  f1-score   support

         0.0       0.90      0.81      0.85       191
         1.0       0.68      0.68      0.68        56
         2.0       0.64      0.72      0.68        25
         3.0       0.54      0.79      0.64        34

   micro avg       0.78      0.78      0.78       306
   macro avg       0.69      0.75      0.71       306
weighted avg       0.80      0.78      0.78       306



In [94]:
feat_lp = clf.feature_log_prob_
IT_p = np.exp(feat_lp[0])
engineering_p = np.exp(feat_lp[1])
accounting_p = np.exp(feat_lp[2])
banking_p = np.exp(feat_lp[3])

# Make a dataframe with the probabilities and features
feat_probs = pd.DataFrame({'IT_p':IT_p, 'engineering_p':engineering_p, 'accounting_p':accounting_p,'banking_p':banking_p,'feature':all_features.columns.values})
#  Create a column that is the difference between high probability of IT and other job
feat_probs['IT_engin'] = feat_probs.IT_p - feat_probs.engineering_p
feat_probs['IT_acc'] = feat_probs.IT_p - feat_probs.accounting_p
feat_probs['IT_bank'] = feat_probs.IT_p - feat_probs.banking_p

# Look at the most likely words for IT and other job
feat_probs.sort_values('IT_engin', ascending=False, inplace=True)  
print(feat_probs.loc[:,['feature','IT_engin']].head(10))
print(feat_probs.loc[:,['feature','IT_engin']].tail(10))
feat_probs.sort_values('IT_acc', ascending=False, inplace=True)
print(feat_probs.loc[:,['feature','IT_acc']].head(10))
print(feat_probs.loc[:,['feature','IT_acc']].tail(10))
feat_probs.sort_values('IT_bank', ascending=False, inplace=True)  
print(feat_probs.loc[:,['feature','IT_bank']].head(10))
print(feat_probs.loc[:,['feature','IT_bank']].tail(10))

               feature  IT_engin
1017          business  0.002379
781             strong  0.001560
435               data  0.001520
392           business  0.001446
771                sql  0.001393
594         management  0.001365
419   computer science  0.001275
823                web  0.001250
457        development  0.001185
797       technologies  0.001173
             feature  IT_engin
1338        engineer -0.001322
2134  specifications -0.001406
461          diploma -0.001521
1292        drawings -0.001638
604       mechanical -0.001638
474       electrical -0.001713
1361       equipment -0.001818
1339     engineering -0.002563
477      engineering -0.003212
41          engineer -0.003237
               feature    IT_acc
1244            design  0.003033
457        development  0.002489
1264       development  0.002333
419   computer science  0.002242
2217         technical  0.002208
735            science  0.002138
771                sql  0.002083
418           computer  0.002083

#  use logistic regression

In [95]:
y = df2.four_categories.values

from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
Xs = ss.fit_transform(X)

from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(Xs, y, test_size=0.25)
lr = LogisticRegression().fit(Xs, y)
lr.fit(X_train, y_train)

# predictions and pred prob.
yhat = lr.predict(X_test)
yhat_pp = lr.predict_proba(X_test)

# confusion matrix metrics
conmat = np.array(confusion_matrix(y_test, yhat, labels=[0,1,2,3]))

confusion = pd.DataFrame(conmat, index=['is_IT', 'is_engin','is_acc','is_bank'],
                         columns=['predicted_IT', 'predicted_engin','predicted_acc','predicted_bank'])
print(confusion)

          predicted_IT  predicted_engin  predicted_acc  predicted_bank
is_IT              148               24              6              25
is_engin             4               32              3               0
is_acc               0                1             22               5
is_bank              5                1              1              29


In [96]:
from sklearn.metrics import classification_report
print (classification_report(y_test, yhat))
# Support is simply the number of observations of the labeled class.
# The marginal sum of rows in the confusion matrix or, in other words, the total number of observations belonging to a class, regardless of prediction.


              precision    recall  f1-score   support

         0.0       0.94      0.73      0.82       203
         1.0       0.55      0.82      0.66        39
         2.0       0.69      0.79      0.73        28
         3.0       0.49      0.81      0.61        36

   micro avg       0.75      0.75      0.75       306
   macro avg       0.67      0.79      0.71       306
weighted avg       0.82      0.75      0.77       306



In [97]:
lr.classes_

array([0., 1., 2., 3.])

In [98]:
df_log = pd.DataFrame({'IT':lr.coef_[0]})
df_log['Engin'] = pd.DataFrame({'Engin':lr.coef_[1]})
df_log['Acc'] = pd.DataFrame({'Acc':lr.coef_[2]})
df_log['Bank'] = pd.DataFrame({'Bank':lr.coef_[3]})
df_log['Features'] = pd.DataFrame({'Features':X.columns})
# look at features that are important for the 4 job categories below

In [99]:
df_log.sort_values('IT',ascending = False).head(20)

Unnamed: 0,IT,Engin,Acc,Bank,Features
332,0.443393,-0.482776,0.030415,0.128679,permanent time
329,0.345924,-0.305553,0.148922,-0.140378,permanent
2077,0.255096,-0.143817,-0.005332,-0.078297,security
1916,0.234966,-0.154956,-0.065215,-0.123942,program
323,0.216684,-0.100098,0.009005,-0.05711,contract time
819,0.212528,-0.152652,0.030893,-0.160489,various
320,0.212034,-0.147859,-0.040095,0.113079,contract
26,0.211756,-0.168358,-0.010947,-0.049815,consultant
1463,0.208301,-0.111962,-0.081939,-0.042447,good
59,0.197454,-0.159049,-0.032161,-0.060207,network


In [100]:
df_log.sort_values('Engin',ascending = False).head(20)

Unnamed: 0,IT,Engin,Acc,Bank,Features
477,-0.219835,0.408194,-0.098317,-0.00739,engineering
41,-0.283179,0.406128,-0.075604,-0.0691,engineer
604,-0.160297,0.392788,-0.137205,-0.045848,mechanical
1339,-0.181136,0.340608,-0.090579,0.005896,engineering
2049,-0.16788,0.327395,-0.083715,-0.056532,safety
1292,-0.13561,0.318694,-0.028294,-0.022619,drawings
474,-0.160364,0.305954,-0.125306,-0.07037,electrical
1687,-0.181636,0.280145,-0.025154,0.007266,manufacturing
1134,-0.186642,0.273703,0.036365,0.027232,construction
1361,-0.115977,0.246596,-0.067375,0.008166,equipment


In [101]:
df_log.sort_values('Acc',ascending = False).head(20)

Unnamed: 0,IT,Engin,Acc,Bank,Features
353,-0.260064,-0.075387,0.463401,-0.116321,accounting
2,-0.103687,-0.030477,0.44539,-0.230845,accounts assistant
149,-0.002899,-0.044628,0.411649,-0.4262,central
1623,-0.10134,-0.006576,0.363237,-0.040005,laws
862,-0.185559,-0.066813,0.310465,-0.056231,accounting
134,-0.118506,0.106562,0.30321,-0.313866,boon
14,-0.123473,-0.042845,0.288632,-0.189325,assistant
2154,-0.053144,-0.01972,0.281871,-0.018145,statutory
0,-0.110623,-0.010613,0.279675,0.014973,account
1736,-0.124422,-0.021453,0.275219,0.027056,month end


In [102]:
df_log.sort_values('Bank',ascending = False).head(20)

Unnamed: 0,IT,Engin,Acc,Bank,Features
1584,-0.177741,-0.060824,-0.148046,0.426246,investment
47,-0.163671,-0.046875,-0.170618,0.285032,group
1693,-0.200933,0.083871,0.01498,0.28341,markets
1985,-0.154924,-0.032635,-0.143968,0.277425,regulatory
382,-0.106049,-0.091034,-0.135334,0.268336,banking
2250,-0.086346,-0.007496,0.041818,0.266904,trading
1863,-0.04181,-0.076114,-0.080372,0.242304,positive
18,-0.133509,0.002822,0.038675,0.242253,banking
638,-0.077757,0.06564,0.012734,0.241314,opportunities
969,-0.043608,0.175665,0.232663,0.237345,audience
