# Job Titles and Profitability
Phai Phongthiengtham

***

This section explores relationships between job titles and profitability.

### Import necessary modules and dataset

In [1]:
import math
import re
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from six.moves import range
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.model_selection import KFold, train_test_split
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet, SGDRegressor
import nltk
from nltk import word_tokenize
#nltk.download("stopwords") # un-comment if run for the first time
#nltk.download("punkt") # un-comment if run for the first time

from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

from posting_vectorizer import *

pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 2000)

df = pd.read_csv('data_10K_no_description.txt',sep='\t',header=0,dtype = object)
df['roe'] = df.roe.astype(float)

### Detect job titles with locations

Upon visual inspection, there are a large fraction of job titles where information on location is available. This sub-section (1) creates an extra feature "titleloc", which takes a value of 1 if that particular job title contains information on location and 0 otherwise, and (2) creates an extra feature counting how many words there are in the job titles. 

In [2]:
# "state_name.csv" contains state names and their abbrevations.
state_name_data = pd.read_csv("state_name.csv")
state_name = [w.lower() for w in list(state_name_data.state)]
state_abb = [w.lower() for w in list(state_name_data.abbrevation)]

others = ['dc','north','south','east','west',
          'northeast','northwest','southeast','southwest',
          'pacific','atlantic']

combine = state_name + state_abb + others

print('--- list of state names and their abbrevations ---')
print('')
print(combine)
print('')
regex = '|'.join(['\\b' + w + '\\b' for w in combine])

print('--- some examples ---')
print('')

title = 'Area Manager - CA'
print('"' + title + '" : match output = ' + '|'.join(re.findall(regex,title,re.IGNORECASE)))

title = 'Area Manager - ca'
print('"' + title + '" : match output = ' + '|'.join(re.findall(regex,title,re.IGNORECASE)))

title = 'we can do this'
print('"' + title + '" : match output = ' + '|'.join(re.findall(regex,title,re.IGNORECASE)))

--- list of state names and their abbrevations ---

['alabama', 'alaska', 'arizona', 'arkansas', 'california', 'colorado', 'connecticut', 'delaware', 'florida', 'georgia', 'hawaii', 'idaho', 'illinois', 'indiana', 'iowa', 'kansas', 'kentucky', 'louisiana', 'maine', 'maryland', 'massachusetts', 'michigan', 'minnesota', 'mississippi', 'missouri', 'montana', 'nebraska', 'nevada', 'new hampshire', 'new jersey', 'new mexico', 'new york', 'north carolina', 'north dakota', 'ohio', 'oklahoma', 'oregon', 'pennsylvania', 'rhode island', 'south carolina', 'south dakota', 'tennessee', 'texas', 'utah', 'vermont', 'virginia', 'washington', 'west virginia', 'wisconsin', 'wyoming', 'al', 'ak', 'az', 'ar', 'ca', 'co', 'ct', 'de', 'fl', 'ga', 'hi', 'id', 'il', 'in', 'ia', 'ks', 'ky', 'la', 'me', 'md', 'ma', 'mi', 'mn', 'ms', 'mo', 'mt', 'ne', 'nv', 'nh', 'nj', 'nm', 'ny', 'nc', 'nd', 'oh', 'ok', 'or', 'pa', 'ri', 'sc', 'sd', 'tn', 'tx', 'ut', 'vt', 'va', 'wa', 'wv', 'wi', 'wy', 'dc', 'north', 'south', '

* As seen above, a regular expression can be employed to detect "CA" from a job title "Area Manager - CA". The "re.IGNORECASE" option is added capture "Area Manager - ca". Finally, the "\b" in the regular expression ensures that we are not detecting "ca" in "we **ca**n do this".

In [3]:
def detect_location(title,regex):
    # check if the job title itself contains information on location (state, region)
    if re.findall(regex,title,re.IGNORECASE):
        output = 1 # return 1 if yes
    else:
        output = 0 # return 0 if no
    return output

def count_word(title):
    tokens = [w for w in nltk.word_tokenize(title.lower()) if not w in stop_words]
    selected_tokens = [w for w in tokens if re.findall(r'[a-z]+',w)]
    return len(selected_tokens)

# create a new binary variable 'titleloc', which equals to 1 if the title mentions location 
df['titleloc'] = df['original_jobtitle'].apply(lambda x: detect_location(x,regex))

# remove location from job titles
df['jobtitle'] = df['original_jobtitle'].apply(lambda x: re.sub(regex,'',x.lower()))

# count number of words in the job titles
df['countword'] = df['jobtitle'].apply(lambda x: count_word(x))

# see examples
df[['original_jobtitle','titleloc','jobtitle','countword']][df.titleloc == 1].head()

Unnamed: 0,original_jobtitle,titleloc,jobtitle,countword
2,"Human Resources Manager, AL",1,"human resources manager,",3
9,General Manager - Pacific Region,1,general manager - region,3
33,Area Manager - KY,1,area manager -,2
36,Associate Finance Director - OR,1,associate finance director -,3
40,Territory Manager / Pittsburgh North,1,territory manager / pittsburgh,3


* As seen above, the recently created "titleloc" feature equals one when we are able to detect information on location. The "jobtitle" feature converts the original job titles from postings, "original_jobtitle", to lowercase and remove detected information on location. The "countword" feature counts number of words in "jobtitle".      

In [4]:
# vectorize all features (see posting_vectorizer.py)
list_of_features = ['state','edu','naics','onet','jobtitle','titleloc','countword']

features_array, df_features = selected_features(df, list_of_features)
n_features = len(df_features)
print('Total number of features = ' + str(n_features))
df_features[['feat_type','feat_name']].groupby('feat_type').count()

Total number of features = 2018


Unnamed: 0_level_0,feat_name
feat_type,Unnamed: 1_level_1
countword,3
edu,5
jobtitle,1583
naics,335
onet,40
state,51
titleloc,1


All types of features:
1. "countword" : number of words in the job titles (up to third degree polynomials). 
2. "edu" : education requirement ("high_shool", "associate" ,"bachelor", "master", "phd").
3. "jobtitle" : countvectorizer of job titles.
4. "naics" : industry code
5. "state" : location of a company (50 states + DC)
6. "titleloc" : a binary varialble (=1 if we detect information on location in job titles).

In [5]:
pct_loc = ( df[df.titleloc == True].titleloc.count() / 10000 )*100
print('Percentage of ads with information on location in the job titles = ' + str(pct_loc))

Percentage of ads with information on location in the job titles = 14.67


### Split into train and test set

In [6]:
x = features_array
y = df.roe

# split into train and test set
_, itest = train_test_split(range(x.shape[0]), 
                            train_size=0.7, 
                            test_size=0.3, 
                            random_state = 0)

mask = np.zeros(x.shape[0], dtype=np.bool)
mask[itest] = True

# train set
x_train = x[mask]
y_train = y[mask]

# test set
x_test = x[~mask]
y_test = y[~mask]

### Models

I use the following models with default settings: 

1. Linear Regression
2. Ridge Regression 
3. Lasso
4. Stochastic Gradient Descent (SGDRegressor)
5. Multi-layer Perceptron Regressor (MLPRegressor) 
6. Random Forest Regressor

The accuracy measure is mean squared error.

In [7]:
def compute_mse(lm, x ,y):
    mse = np.mean((y - lm.predict(x)) ** 2) # compute mean squared error
    return mse

def format_mse(mse):
    return "{0:.4f}".format(mse)

# Linear Regression
lm = LinearRegression()
lm.fit(x_train, y_train)
mse_train = compute_mse(lm, x_train ,y_train)
mse_test = compute_mse(lm, x_test ,y_test)

print('--- Linear Regression ---')
print('MSE (train set) = ' + format_mse(mse_train))
print('MSE (test set) = ' + format_mse(mse_test))
print('')

# Ridge Regression
lm = Ridge()
lm.fit(x_train, y_train)
mse_train = compute_mse(lm, x_train ,y_train)
mse_test = compute_mse(lm, x_test ,y_test)

print('--- Ridge Regression ---')
print('MSE (train set) = ' + format_mse(mse_train))
print('MSE (test set) = ' + format_mse(mse_test))
print('')

# Lasso Regression
lm = Lasso()
lm.fit(x_train, y_train)
mse_train = compute_mse(lm, x_train ,y_train)
mse_test = compute_mse(lm, x_test ,y_test)

print('--- Lasso Regression ---')
print('MSE (train set) = ' + format_mse(mse_train))
print('MSE (test set) = ' + format_mse(mse_test))
print('')

# Stochastic Gradient Descent
lm = SGDRegressor()
lm.fit(x_train, y_train)
mse_train = compute_mse(lm, x_train ,y_train)
mse_test = compute_mse(lm, x_test ,y_test)

print('--- Stochastic Gradient Descent ---')
print('MSE (train set) = ' + format_mse(mse_train))
print('MSE (test set) = ' + format_mse(mse_test))
print('')

# Multi-layer Perceptron Regressor
lm = MLPRegressor()
lm.fit(x_train, y_train)
mse_train = compute_mse(lm, x_train ,y_train)
mse_test = compute_mse(lm, x_test ,y_test)

print('--- Multi-layer Perceptron Regressor ---')
print('MSE (train set) = ' + format_mse(mse_train))
print('MSE (test set) = ' + format_mse(mse_test))
print('')

# RandomForestRegressor
lm = RandomForestRegressor(random_state=0)
lm.fit(x_train, y_train)
mse_train = compute_mse(lm, x_train ,y_train)
mse_test = compute_mse(lm, x_test ,y_test)

print('--- Random Forest Regressor ---')
print('MSE (train set) = ' + format_mse(mse_train))
print('MSE (test set) = ' + format_mse(mse_test))
print('')

--- Linear Regression ---
MSE (train set) = 0.0120
MSE (test set) = 0.0896

--- Ridge Regression ---
MSE (train set) = 0.0192
MSE (test set) = 0.0355

--- Lasso Regression ---
MSE (train set) = 0.0610
MSE (test set) = 0.0605





--- Stochastic Gradient Descent ---
MSE (train set) = 0.0469
MSE (test set) = 0.0490

--- Multi-layer Perceptron Regressor ---
MSE (train set) = 0.0341
MSE (test set) = 0.0507

--- Random Forest Regressor ---
MSE (train set) = 0.0057
MSE (test set) = 0.0328



* Overall, ridge regression performs relatively well.

### Ridge Regression Results

This sub-section explores some of the interesting results from the ridge regression.  

In [8]:
lm = Ridge()
lm.fit(x, y)

Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [9]:
x = lm.coef_
df_features['coef'] = lm.coef_ # add coefficients to the dataframe
x.shape

(2018,)

In [10]:
df_results_titleloc = df_features[['feat_type','feat_name','coef']][df_features.feat_type == 'titleloc']
df_results_titleloc

Unnamed: 0,feat_type,feat_name,coef
2014,titleloc,titleloc,0.01766


In [11]:
df_results_countword = df_features[['feat_type','feat_name','coef']][df_features.feat_type == 'countword']
df_results_countword

Unnamed: 0,feat_type,feat_name,coef
2015,countword,countword_poly_1,0.004857
2016,countword,countword_poly_2,-0.000801
2017,countword,countword_poly_3,-1.9e-05


### Discussion

1. Job titles should not be too long. Suppose the coefficients in the polymonial terms are: $ROE = a*(word) + b*(word)^2$, then the optimal number of words is around $-\frac{a}{2b} = \frac{0.004857}{2*0.000801} = 3.03$. On average, the length of job titles should not exceed 3-4 words. 
2. Companies that put information of their location into job titles tend to perform better, as the coefficient of the ridge regression is positive (~0.018). The causal relationship, however, is not clear. Most likely, we have a confounding variable as expanding companies are the ones that performing well. As they are expanding to other area, they tend to put information on the new location into job tities. 