# Salary Predictions Based on Job Descriptions

## Part 1 - Define

**Author:**    Kevin O'Hara <br>
**Email:** kohara91@gmail.com <br>
**LinkedIn:**  https://www.linkedin.com/in/kevinodata/

Searching for jobs induces a sense of uncertainty. Will we find the right one? Will we enjoy our new responsibilities? And, of course, how will we be compensated? 

There are plenty of other things to consider, but this last aspect is something we can shed light on. There are online resources available to obtain this information, but we can also get an idea with machine learning. 

Using an imported list of job postings and their associated salaries, I will develop a model to predict a salary dependent on employee features. The model effectiveness will be evaulated by determining the mean squared error. This tool can prove useful for both businesses and individuals searching for new careers. An effective model can be used by HR to determine an adequate salary to offer to potential hires, and may also be used by individuals to request a reasonable salary. I will aim for a MSE less than 360 before testing and deploying this model. 

### Import Libraries

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as sk
import joblib
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LassoLars
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.pipeline import Pipeline

In [40]:
# Function to display minimum and maximum values in a Dataframe
def min_and_max(df, axis=1):
    smallest = df.min()
    largest = df.max()
    print(smallest)
    print(largest)

# For cleaning our data to remove $0 salaries
def clean_this(df):
    clean_this = df[df.salary>0]
    return clean_this
    

## Part 2 - Discover

### Loading our datasets

In [6]:
train_salaries = pd.read_csv('data/train_salaries.csv')
train_job_postings = pd.read_csv('data/train_features.csv')
test_job_postings = pd.read_csv('data/test_features.csv')

### First Examination

In [8]:
train_salaries.head()

Unnamed: 0,jobId,salary
0,JOB1362684407687,130
1,JOB1362684407688,101
2,JOB1362684407689,137
3,JOB1362684407690,142
4,JOB1362684407691,163


In [9]:
train_salaries.shape

(1000000, 2)

In [10]:
train_salaries.isnull().any()

jobId     False
salary    False
dtype: bool

In [11]:
train_job_postings.head()

Unnamed: 0,jobId,companyId,jobType,degree,major,industry,yearsExperience,milesFromMetropolis
0,JOB1362684407687,COMP37,CFO,MASTERS,MATH,HEALTH,10,83
1,JOB1362684407688,COMP19,CEO,HIGH_SCHOOL,NONE,WEB,3,73
2,JOB1362684407689,COMP52,VICE_PRESIDENT,DOCTORAL,PHYSICS,HEALTH,10,38
3,JOB1362684407690,COMP38,MANAGER,DOCTORAL,CHEMISTRY,AUTO,8,17
4,JOB1362684407691,COMP7,VICE_PRESIDENT,BACHELORS,PHYSICS,FINANCE,8,16


In [12]:
train_job_postings.shape

(1000000, 8)

In [13]:
train_job_postings.isnull().any()

jobId                  False
companyId              False
jobType                False
degree                 False
major                  False
industry               False
yearsExperience        False
milesFromMetropolis    False
dtype: bool

In [14]:
train_job_postings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 8 columns):
jobId                  1000000 non-null object
companyId              1000000 non-null object
jobType                1000000 non-null object
degree                 1000000 non-null object
major                  1000000 non-null object
industry               1000000 non-null object
yearsExperience        1000000 non-null int64
milesFromMetropolis    1000000 non-null int64
dtypes: int64(2), object(6)
memory usage: 61.0+ MB


From here, we can verify that the data is loaded and see that there are 1,000,000 rows to work with. With .head and .info, we can see that most of the values are categorical, and that there are no missing values (no nulls). We will need to encode a fair amount of our features to get them ready to model. Fortunately, since there are no missing values, we will not need to impute any additional values.  

We can also see that the salaries and additional features are linked by a jobId. This allows us to stitch them together with the id to get them all on one dataframe.

The last dataset will act as our test data, since it contains all of the same features without any salaries. We will use this to determine our best model.

### Cleaning/Pre-processing

The 'job_postings' dataframe contains all of our features of interest. Of those features, 'jobType' and 'degree' are ordinal and could be encoded as such. We can assume that job roles higher in a company will have a higher salary, so CFO, CEO, CTO, and Vice_President would be ordered higher than Junior level positions. We can also assume that higher degrees will likely increase the potential salary, so could also encode them in an ordinal fashion.

For now, we will stitch the dfs together using jobId and make sure that our new df has values we would expect.

In [19]:
train_job_postings['jobType'].unique()

array(['CFO', 'CEO', 'VICE_PRESIDENT', 'MANAGER', 'JUNIOR', 'JANITOR',
       'CTO', 'SENIOR'], dtype=object)

In [20]:
train_job_postings['degree'].unique()

array(['MASTERS', 'HIGH_SCHOOL', 'DOCTORAL', 'BACHELORS', 'NONE'],
      dtype=object)

In [21]:
train_job_postings['industry'].unique()

array(['HEALTH', 'WEB', 'AUTO', 'FINANCE', 'EDUCATION', 'OIL', 'SERVICE'],
      dtype=object)

In [24]:
train_data = pd.merge(train_salaries, train_job_postings, on='jobId')

# Verify the merge occurred properly
train_data.head()

Unnamed: 0,jobId,salary,companyId,jobType,degree,major,industry,yearsExperience,milesFromMetropolis
0,JOB1362684407687,130,COMP37,CFO,MASTERS,MATH,HEALTH,10,83
1,JOB1362684407688,101,COMP19,CEO,HIGH_SCHOOL,NONE,WEB,3,73
2,JOB1362684407689,137,COMP52,VICE_PRESIDENT,DOCTORAL,PHYSICS,HEALTH,10,38
3,JOB1362684407690,142,COMP38,MANAGER,DOCTORAL,CHEMISTRY,AUTO,8,17
4,JOB1362684407691,163,COMP7,VICE_PRESIDENT,BACHELORS,PHYSICS,FINANCE,8,16


Now that we've merged the dataframes, we will examine the whole table and make sure it's clean. The only tables that could be suspect are the numerical values - salary, yearsExperience, and milesFromMetropolis. The categorical features all have unique identifiers that make sense. We can check on the range of numerical values to see if it's normal.

In [46]:
# First we'll split our numerical and catergorical features
cat_train_data = ['companyId', 'jobType', 'degree', 'major', 'industry']
num_train_data = ['jobId', 'salary', 'yearsExperience', 'milesFromMetropolis']
num_df = train_data[num_train_data]

min_and_max(num_df)

jobId                  JOB1362684407687
salary                                0
yearsExperience                       0
milesFromMetropolis                   0
dtype: object
jobId                  JOB1362685407686
salary                              301
yearsExperience                      24
milesFromMetropolis                  99
dtype: object


### ---- 4 Explore the data (EDA) ----

In [3]:
#summarize each feature variable
#summarize the target variable
#look for correlation between each feature and the target
#look for correlation between features

### ---- 5 Establish a baseline ----

In [5]:
#select a reasonable metric (MSE in this case)
#create an extremely simple model and measure its efficacy
#e.g. use "average salary" for each industry as your model and then measure MSE
#during 5-fold cross-validation

### ---- 6 Hypothesize solution ----

In [None]:
#brainstorm 3 models that you think may improve results over the baseline model based
#on your 

Brainstorm 3 models that you think may improve results over the baseline model based on your EDA and explain why they're reasonable solutions here.

Also write down any new features that you think you should try adding to the model based on your EDA, e.g. interaction variables, summary statistics for each group, etc

## Part 3 - DEVELOP

You will cycle through creating features, tuning models, and training/validing models (steps 7-9) until you've reached your efficacy goal

#### Your metric will be MSE and your goal is:
 - <360 for entry-level data science roles
 - <320 for senior data science roles

### ---- 7 Engineer features  ----

In [None]:
#make sure that data is ready for modeling
#create any new features needed to potentially enhance model

### ---- 8 Create models ----

In [15]:
#create and tune the models that you brainstormed during part 2

### ---- 9 Test models ----

In [1]:
#do 5-fold cross validation on models and measure MSE

### ---- 10 Select best model  ----

In [None]:
#select the model with the lowest error as your "prodcuction" model

## Part 4 - DEPLOY

### ---- 11 Automate pipeline ----

In [None]:
#write script that trains model on entire training set, saves model to disk,
#and scores the "test" dataset

### ---- 12 Deploy solution ----

In [16]:
#save your prediction to a csv file or optionally save them as a table in a SQL database
#additionally, you want to save a visualization and summary of your prediction and feature importances
#these visualizations and summaries will be extremely useful to business stakeholders

### ---- 13 Measure efficacy ----

We'll skip this step since we don't have the outcomes for the test data