# Salary Predictions Based on Job Descriptions

## Part 1 - Define

Using a general list of job postings and their associated salaries, I will develop a model to predict a salary dependent on user features. The effectiveness will be evaulated by determining the mean squared error. 

In [8]:
# Libraries used are imported here
import pandas as pd
import sklearn
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Kevin O'Hara
# kohara91@gmail.com

## Part 2 - DISCOVER

### ---- 2 Load the data ----

First, we will need to load the training and testing datasets. The head will be opened and examined to get a general idea of the amount of data and number of features they contain.

In [9]:
salaries = pd.read_csv('data/train_salaries.csv')

In [6]:
salaries.head()

Unnamed: 0,jobId,salary
0,JOB1362684407687,130
1,JOB1362684407688,101
2,JOB1362684407689,137
3,JOB1362684407690,142
4,JOB1362684407691,163


In [10]:
job_postings = pd.read_csv('data/train_features.csv')

In [20]:
job_postings.head()

Unnamed: 0,jobId,companyId,jobType,degree,major,industry,yearsExperience,milesFromMetropolis
0,JOB1362684407687,COMP37,CFO,MASTERS,MATH,HEALTH,10,83
1,JOB1362684407688,COMP19,CEO,HIGH_SCHOOL,NONE,WEB,3,73
2,JOB1362684407689,COMP52,VICE_PRESIDENT,DOCTORAL,PHYSICS,HEALTH,10,38
3,JOB1362684407690,COMP38,MANAGER,DOCTORAL,CHEMISTRY,AUTO,8,17
4,JOB1362684407691,COMP7,VICE_PRESIDENT,BACHELORS,PHYSICS,FINANCE,8,16


In [11]:
new_job_postings = pd.read_csv('data/test_features.csv')

In [22]:
new_job_postings.head()

Unnamed: 0,jobId,companyId,jobType,degree,major,industry,yearsExperience,milesFromMetropolis
0,JOB1362685407687,COMP33,MANAGER,HIGH_SCHOOL,NONE,HEALTH,22,73
1,JOB1362685407688,COMP13,JUNIOR,NONE,NONE,AUTO,20,47
2,JOB1362685407689,COMP10,CTO,MASTERS,BIOLOGY,HEALTH,17,9
3,JOB1362685407690,COMP21,MANAGER,HIGH_SCHOOL,NONE,OIL,14,96
4,JOB1362685407691,COMP36,JUNIOR,DOCTORAL,BIOLOGY,OIL,10,44


### ---- 3 Clean the data ----

Now that the data is loaded, we can start to look for missing values and remove what we find unnecessary for our model.

In [26]:
salaries.isnull().sum()

jobId     0
salary    0
dtype: int64

In [27]:
job_postings.isnull().sum()

jobId                  0
companyId              0
jobType                0
degree                 0
major                  0
industry               0
yearsExperience        0
milesFromMetropolis    0
dtype: int64

In [28]:
new_job_postings.isnull().sum()

jobId                  0
companyId              0
jobType                0
degree                 0
major                  0
industry               0
yearsExperience        0
milesFromMetropolis    0
dtype: int64

Nothing seems to be missing, now let's find out how many unique named features we have. We will then turn those strings into integers to utilize them in our model. 

In [12]:
job_postings['degree'].unique()

array(['MASTERS', 'HIGH_SCHOOL', 'DOCTORAL', 'BACHELORS', 'NONE'],
      dtype=object)

In [15]:
job_postings['industry'].unique()

array(['HEALTH', 'WEB', 'AUTO', 'FINANCE', 'EDUCATION', 'OIL', 'SERVICE'],
      dtype=object)

After examining a few we see that the values all make sense. Converting these into integers should not be a problem and should not require any more cleaning to prepare.

In [16]:
# We need to grab our label encoder from sklearn first
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()

# Then we'll apply it to our columns
job_postings['jobType'] = label_encoder.fit_transform(job_postings['jobType'])
job_postings['degree'] = label_encoder.fit_transform(job_postings['degree'])
job_postings['major'] = label_encoder.fit_transform(job_postings['major'])
job_postings['industry'] = label_encoder.fit_transform(job_postings['industry'])

In [17]:
job_postings.head()

Unnamed: 0,jobId,companyId,jobType,degree,major,industry,yearsExperience,milesFromMetropolis
0,JOB1362684407687,COMP37,1,3,6,3,10,83
1,JOB1362684407688,COMP19,0,2,7,6,3,73
2,JOB1362684407689,COMP52,7,1,8,3,10,38
3,JOB1362684407690,COMP38,5,1,2,0,8,17
4,JOB1362684407691,COMP7,7,0,8,2,8,16


Great, the label encoder has done what we needed it to do. Lastly, we should get rid of the companyId column as it will not help our model, and merge the salaries and job_postings tables via the jobId column. Once they are all in the same dataframe, we can start to look for correlation and determine what model to use.

In [18]:
job_postings = job_postings.drop('companyId', axis = 1)

job_postings = pd.merge(salaries, job_postings, on = 'jobId')

In [19]:
job_postings.head()

Unnamed: 0,jobId,salary,jobType,degree,major,industry,yearsExperience,milesFromMetropolis
0,JOB1362684407687,130,1,3,6,3,10,83
1,JOB1362684407688,101,0,2,7,6,3,73
2,JOB1362684407689,137,7,1,8,3,10,38
3,JOB1362684407690,142,5,1,2,0,8,17
4,JOB1362684407691,163,7,0,8,2,8,16


In [23]:
job_postings = job_postings.drop('jobId', axis = 1)

In [24]:
job_postings.head()

Unnamed: 0,salary,jobType,degree,major,industry,yearsExperience,milesFromMetropolis
0,130,1,3,6,3,10,83
1,101,0,2,7,6,3,73
2,137,7,1,8,3,10,38
3,142,5,1,2,0,8,17
4,163,7,0,8,2,8,16


### ---- 4 Explore the data (EDA) ----

In [3]:
#summarize each feature variable
#summarize the target variable
#look for correlation between each feature and the target
#look for correlation between features

### ---- 5 Establish a baseline ----

In [5]:
#select a reasonable metric (MSE in this case)
#create an extremely simple model and measure its efficacy
#e.g. use "average salary" for each industry as your model and then measure MSE
#during 5-fold cross-validation

### ---- 6 Hypothesize solution ----

In [None]:
#brainstorm 3 models that you think may improve results over the baseline model based
#on your 

Brainstorm 3 models that you think may improve results over the baseline model based on your EDA and explain why they're reasonable solutions here.

Also write down any new features that you think you should try adding to the model based on your EDA, e.g. interaction variables, summary statistics for each group, etc

## Part 3 - DEVELOP

You will cycle through creating features, tuning models, and training/validing models (steps 7-9) until you've reached your efficacy goal

#### Your metric will be MSE and your goal is:
 - <360 for entry-level data science roles
 - <320 for senior data science roles

### ---- 7 Engineer features  ----

In [None]:
#make sure that data is ready for modeling
#create any new features needed to potentially enhance model

### ---- 8 Create models ----

In [15]:
#create and tune the models that you brainstormed during part 2

### ---- 9 Test models ----

In [1]:
#do 5-fold cross validation on models and measure MSE

### ---- 10 Select best model  ----

In [None]:
#select the model with the lowest error as your "prodcuction" model

## Part 4 - DEPLOY

### ---- 11 Automate pipeline ----

In [None]:
#write script that trains model on entire training set, saves model to disk,
#and scores the "test" dataset

### ---- 12 Deploy solution ----

In [16]:
#save your prediction to a csv file or optionally save them as a table in a SQL database
#additionally, you want to save a visualization and summary of your prediction and feature importances
#these visualizations and summaries will be extremely useful to business stakeholders

### ---- 13 Measure efficacy ----

We'll skip this step since we don't have the outcomes for the test data