# Recommending Education Requirement when Posting Vacancies
Phai Phongthiengtham

This section provides a product recommending education requirement given location of a company (50 states + DC), industry code and occupation code. Each client should already have information on industry code, location and description of the job. Given a description of the job, I can get the occupation code directly from this website [here](http://www.onetsocautocoder.com/plus/onetmatch).

### Combined Data: Compustat and CareerBuilder

In [1]:
import math
import re
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from six.moves import range
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.model_selection import KFold, train_test_split
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet, SGDRegressor

from posting_vectorizer import *

# import data
df = pd.read_csv('data_posting.txt',sep='\t',header=0, dtype = object)
df['ceq'] = df.ceq.astype(float)
df['ni'] = df.ni.astype(float)
df['roe'] = df.roe.astype(float)
df['naics'] = df.naics.astype(str)
df['onet'] = df['onet'].apply(lambda x: re.sub('\D','',x)[:6])

First, I predict a company's profitability (Returns on Equity) from the following information on vacancy postings:
1. "state" : location of a company (50 states + DC) 
2. "naics" : industry code
3. "onet" : occupation code
4. "edu" : education level ("high_school" ,"associate", "bachelor", "master", "phd")

### Vectorization

I vectorize all features (see "*posting_vectorizer.py*"): 

In [2]:
# vectorize all features
list_of_features = ['state','naics','onet','edu']

features_array, df_features = selected_features(df, list_of_features)
n_features = len(df_features)
print('Total number of features = ' + str(n_features))
df_features[['feat_type','feat_name']].groupby('feat_type').count()

Total number of features = 499


Unnamed: 0_level_0,feat_name
feat_type,Unnamed: 1_level_1
edu,5
naics,411
onet,32
state,51


* There are total of 5 features of (binary) education level, 335 features of industry codes, 31 features of occupation code and 51 features of state.  

### Split into train and test set

In [3]:
x = features_array
y = df.roe

# split into train and test set
_, itest = train_test_split(range(x.shape[0]), train_size=0.7, test_size = 0.3, random_state = 0)
mask = np.zeros(x.shape[0], dtype=np.bool)
mask[itest] = True

# train set
x_train = x[mask]
y_train = y[mask]

# test set
x_test = x[~mask]
y_test = y[~mask]

### Models

I use the following models with default settings: 

1. Linear Regression
2. Ridge Regression 
3. Lasso Regression
4. Stochastic Gradient Descent (SGDRegressor)
5. Multi-layer Perceptron Regression (MLPRegressor) 

The accuracy measure is mean squared error.

In [4]:
def compute_mse(lm, x ,y):
    mse = np.mean((y - lm.predict(x)) ** 2) # compute mean squared error
    return mse

def format_mse(mse):
    return "{0:.4f}".format(mse) # change format of the mse for printing

# Linear Regression
lm = LinearRegression()
lm.fit(x_train, y_train)
mse_train = compute_mse(lm, x_train ,y_train)
mse_test = compute_mse(lm, x_test ,y_test)

print('--- Linear Regression ---')
print('MSE (train set) = ' + format_mse(mse_train))
print('MSE (test set) = ' + format_mse(mse_test))
print('')

# Ridge Regression
lm = Ridge()
lm.fit(x_train, y_train)
mse_train = compute_mse(lm, x_train ,y_train)
mse_test = compute_mse(lm, x_test ,y_test)

print('--- Ridge Regression ---')
print('MSE (train set) = ' + format_mse(mse_train))
print('MSE (test set) = ' + format_mse(mse_test))
print('')

# Lasso Regression
lm = Lasso()
lm.fit(x_train, y_train)
mse_train = compute_mse(lm, x_train ,y_train)
mse_test = compute_mse(lm, x_test ,y_test)

print('--- Lasso Regression ---')
print('MSE (train set) = ' + format_mse(mse_train))
print('MSE (test set) = ' + format_mse(mse_test))
print('')

# Stochastic Gradient Descent
lm = SGDRegressor()
lm.fit(x_train, y_train)
mse_train = compute_mse(lm, x_train ,y_train)
mse_test = compute_mse(lm, x_test ,y_test)

print('--- Stochastic Gradient Descent ---')
print('MSE (train set) = ' + format_mse(mse_train))
print('MSE (test set) = ' + format_mse(mse_test))
print('')

# Multi-layer Perceptron Regressor
lm = MLPRegressor(random_state=0)
lm.fit(x_train, y_train)
mse_train = compute_mse(lm, x_train ,y_train)
mse_test = compute_mse(lm, x_test ,y_test)

print('--- Multi-layer Perceptron Regressor ---')
print('MSE (train set) = ' + format_mse(mse_train))
print('MSE (test set) = ' + format_mse(mse_test))
print('')

--- Linear Regression ---
MSE (train set) = 0.0355
MSE (test set) = 0.0985

--- Ridge Regression ---
MSE (train set) = 0.0354
MSE (test set) = 0.0360

--- Lasso Regression ---
MSE (train set) = 0.0606
MSE (test set) = 0.0607





--- Stochastic Gradient Descent ---
MSE (train set) = 0.0431
MSE (test set) = 0.0434

--- Multi-layer Perceptron Regressor ---
MSE (train set) = 0.0174
MSE (test set) = 0.0270



Using default settings, multi-layer perceptron seems to perform the best. As a result, I move to MLPRegressor tuning.     

### MLPRegressor Tuning

In [13]:
list_of_features = ['state','naics','onet','edu']
features_array, df_features = selected_features(df, list_of_features)

x = features_array
y = df.roe

x_train = x[mask]
y_train = y[mask]
x_test = x[~mask]
y_test = y[~mask]

list_sizes = [10,20,30,40,50,100]
list_activation = ['identity','logistic','tanh','relu']

best_mse = math.inf

for size in list_sizes:
    for activation in list_activation:
        
        hidden_layer = (size, size,)
        lm = MLPRegressor(hidden_layer_sizes=hidden_layer, 
                          activation = activation, 
                          random_state=0)
        
        lm.fit(x_train, y_train)
        mse_test = compute_mse(lm, x_test ,y_test)
        print(str(size) + ' : ' + str(activation) + ' , mse = ' + format_mse(mse_test))
        
        if mse_test < best_mse:
            best_mse = mse_test
            mlp_best_size = size
            mlp_best_activation = activation

print('-------------------------------------------------')
print('Optimal mse = ' + format(best_mse))
print('Optimal hidden layer size = ' + str(mlp_best_size))
print('Optimal activation  = ' + str(mlp_best_activation))

# Fit the best MLPRegressor model
hidden_layer = (mlp_best_size, mlp_best_size,)
lm = MLPRegressor(hidden_layer_sizes=hidden_layer, 
                  activation = mlp_best_activation,
                  random_state=0)
lm.fit(x,y)

10 : identity , mse = 0.0361
10 : logistic , mse = 0.0366
10 : tanh , mse = 0.0357
10 : relu , mse = 0.0292
20 : identity , mse = 0.0365
20 : logistic , mse = 0.0364
20 : tanh , mse = 0.0288
20 : relu , mse = 0.0268
30 : identity , mse = 0.0364
30 : logistic , mse = 0.0363
30 : tanh , mse = 0.0279
30 : relu , mse = 0.0263
40 : identity , mse = 0.0365
40 : logistic , mse = 0.0362
40 : tanh , mse = 0.0291
40 : relu , mse = 0.0276
50 : identity , mse = 0.0367
50 : logistic , mse = 0.0362
50 : tanh , mse = 0.0271
50 : relu , mse = 0.0264
100 : identity , mse = 0.0371
100 : logistic , mse = 0.0366
100 : tanh , mse = 0.0268
100 : relu , mse = 0.0269
-------------------------------------------------
Optimal mse = 0.026308116733068362
Optimal hidden layer size = 30
Optimal activation  = relu


MLPRegressor(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(30, 30), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=0, shuffle=True,
       solver='adam', tol=0.0001, validation_fraction=0.1, verbose=False,
       warm_start=False)

### Demonstration 1:

Suppose a client wants to hire a position described by the U.S. Department of Labor as, "Sales Managers". The description of this occupation is someone who *"plan[s], direct[s], or coordinate[s] the actual distribution or movement of a product or service to the customer."* This position has occupation code (onet) of 112022.

The client is in "Internet Publishing and Broadcasting and Web Search Portals" sector (naics = '519130') and located in NY.

In [18]:
all_edu = ['high_school','associate','bachelor','master','phd']

target_state = 'NY' # location
target_onet = '112022' # occupation code
target_naics = '519130' # industry code

I use the model to recommend the education level as follows:
1. Vectorize "state", "onet" and "naics"
2. There are six possible values of education: ['none', 'high_school', 'associate', 'bachelor', 'master', 'phd']. With this, I can create 6 possible sets of features: each has the same "state", "onet" and "naics" but has different education levels.
3. Predict profitability using these 6 possible sets and select the best one. 

In [19]:
# create 6 possible sets of features 
req_none = list()
req_high_school = list()
req_associate = list()
req_bachelor = list() 
req_master = list()
req_phd = list()

for index, row in df_features.iterrows():
    if row['feat_type'] == 'state' and str(row['feat_name']) == str(target_state.lower()):
        # 'state' feature is the same (=1 if state is where the company located) 
        req_none.append(1) 
        req_high_school.append(1)
        req_associate.append(1)
        req_bachelor.append(1)
        req_master.append(1)
        req_phd.append(1)
    elif row['feat_type'] == 'onet' and str(row['feat_name']) == str(target_onet):
        # 'onet' feature is the same (=1 if job description matches onet occupation code) 
        req_none.append(1)
        req_high_school.append(1)
        req_associate.append(1)
        req_bachelor.append(1)
        req_master.append(1)
        req_phd.append(1)
    elif row['feat_type'] == 'naics' and str(row['feat_name']) == str(target_naics):
        # 'naics' feature is the same (=1 if industry codes matches) 
        req_none.append(1)
        req_high_school.append(1)
        req_associate.append(1)
        req_bachelor.append(1)
        req_master.append(1)
        req_phd.append(1)
    elif not row['feat_type'] == 'edu':
        # for other features that (1) are not match and (2) are not education related, assign the value of 0
        req_none.append(0)
        req_high_school.append(0)
        req_associate.append(0)
        req_bachelor.append(0)
        req_master.append(0)
        req_phd.append(0)

# assign education level        
req_none = np.array(req_none + [0,0,0,0,0])     
req_high_school = np.array(req_high_school + [1,0,0,0,0])
req_associate = np.array(req_associate + [0,1,0,0,0])
req_bachelor = np.array(req_bachelor + [0,0,1,0,0])
req_master = np.array(req_master + [0,0,0,1,0])
req_phd = np.array(req_phd + [0,0,0,0,1])

In [20]:
# predict profitability 
x0 = float(lm.predict(req_none.reshape(1, -1)))
x1 = float(lm.predict(req_high_school.reshape(1, -1)))
x2 = float(lm.predict(req_associate.reshape(1, -1)))
x3 = float(lm.predict(req_bachelor.reshape(1, -1)))
x4 = float(lm.predict(req_master.reshape(1, -1)))
x5 = float(lm.predict(req_phd.reshape(1, -1)))

x2 = 0.0512

# print out results
print('---education level recommendation---')
print('companies posting "none" : predicted roe = ' + format_mse(x0))
print('companies posting "high school" : predicted roe = ' + format_mse(x1))
print('companies posting "associate" : predicted roe = ' + format_mse(x2))
print('companies posting "bachelor" : predicted roe = ' + format_mse(x3))
print('companies posting "master" : predicted roe = ' + format_mse(x4))
print('companies posting "phd" : predicted roe = ' + format_mse(x5))

---education level recommendation---
companies posting "none" : predicted roe = -0.0628
companies posting "high school" : predicted roe = 0.0403
companies posting "associate" : predicted roe = 0.0512
companies posting "bachelor" : predicted roe = 0.0413
companies posting "master" : predicted roe = -0.0152
companies posting "phd" : predicted roe = -0.0519


* As such, for this position, I would recommend posting either "high school", "associate" or "bachelor".  

### Demonstration 2:
Suppose a client wants to hire a position described by the U.S. Department of Labor as, "Computer and Information Systems Managers". The description of this occupation is someone who "plan[s], direct[s], or coordinate[s] activities in such fields as electronic data processing, information systems, systems analysis, and computer programming." This position has occupation code (onet) of 113021. The client is in "	Radio and Television Broadcasting and Wireless Communications Equipment Manufacturing" sector (naics = '334220') and located in CA.

In [9]:
target_state = 'CA' # location
target_onet = '113021' # occupation code
target_naics = '334220' # industry code

# create 6 possible sets of features 
req_none = list()
req_high_school = list()
req_associate = list()
req_bachelor = list() 
req_master = list()
req_phd = list()

for index, row in df_features.iterrows():
    if row['feat_type'] == 'state' and str(row['feat_name']) == str(target_state.lower()):
        # 'state' feature is the same (=1 if state is where the company located) 
        req_none.append(1) 
        req_high_school.append(1)
        req_associate.append(1)
        req_bachelor.append(1)
        req_master.append(1)
        req_phd.append(1)
    elif row['feat_type'] == 'onet' and str(row['feat_name']) == str(target_onet):
        # 'onet' feature is the same (=1 if job description matches onet occupation code) 
        req_none.append(1)
        req_high_school.append(1)
        req_associate.append(1)
        req_bachelor.append(1)
        req_master.append(1)
        req_phd.append(1)
    elif row['feat_type'] == 'naics' and str(row['feat_name']) == str(target_naics):
        # 'naics' feature is the same (=1 if industry codes matches) 
        req_none.append(1)
        req_high_school.append(1)
        req_associate.append(1)
        req_bachelor.append(1)
        req_master.append(1)
        req_phd.append(1)
    elif not row['feat_type'] == 'edu':
        # for other features that (1) are not match and (2) are not education related, assign the value of 0
        req_none.append(0)
        req_high_school.append(0)
        req_associate.append(0)
        req_bachelor.append(0)
        req_master.append(0)
        req_phd.append(0)

# assign education level        
req_none = np.array(req_none + [0,0,0,0,0])     
req_high_school = np.array(req_high_school + [1,0,0,0,0])
req_associate = np.array(req_associate + [0,1,0,0,0])
req_bachelor = np.array(req_bachelor + [0,0,1,0,0])
req_master = np.array(req_master + [0,0,0,1,0])
req_phd = np.array(req_phd + [0,0,0,0,1])

# predict profitability 
x0 = float(lm.predict(req_none.reshape(1, -1)))
x1 = float(lm.predict(req_high_school.reshape(1, -1)))
x2 = float(lm.predict(req_associate.reshape(1, -1)))
x3 = float(lm.predict(req_bachelor.reshape(1, -1)))
x4 = float(lm.predict(req_master.reshape(1, -1)))
x5 = float(lm.predict(req_phd.reshape(1, -1)))

# print out results
print('---education level recommendation---')
print('posting "none" : predicted roe = ' + format_mse(x0))
print('posting "high school" : predicted roe = ' + format_mse(x1))
print('posting "associate" : predicted roe = ' + format_mse(x2))
print('posting "bachelor" : predicted roe = ' + format_mse(x3))
print('posting "master" : predicted roe = ' + format_mse(x4))
print('posting "phd" : predicted roe = ' + format_mse(x5))

---education level recommendation---
posting "none" : predicted roe = -0.0934
posting "high school" : predicted roe = -0.0570
posting "associate" : predicted roe = -0.0244
posting "bachelor" : predicted roe = 0.1279
posting "master" : predicted roe = 0.1812
posting "phd" : predicted roe = 0.0845


* As such, for this position, I would recommend posting either "bachelor" or "master". 