# Salary Predictions Based on Job Descriptions

# Part 1 - DEFINE

### ---- 1 Define the problem ----
Define salaries based upon the job description.  

The HR department doesn't quiet understand how to set salaries for each job.  They would like to predict the job salary based upon the job description. The supplied data will be used to determine how to appropriate set (predict) the salary range for a given job description.

In [25]:
#import your libraries
import pandas as pd
import sklearn as sk
import numpy as np
import matplotlib as plt
#etc


#your info here
__author__ = "Nicholas Arquette"
__email__ = "nicholas.arquette@gmail.com"

## Part 2 - DISCOVER

### ---- 2 Load the data ----



In [26]:
#load the data into a Pandas dataframe
test_features = pd.read_csv('data/test_features.csv', sep=',', low_memory=False)
test_features.head()


Unnamed: 0,jobId,companyId,jobType,degree,major,industry,yearsExperience,milesFromMetropolis
0,JOB1362685407687,COMP33,MANAGER,HIGH_SCHOOL,NONE,HEALTH,22,73
1,JOB1362685407688,COMP13,JUNIOR,NONE,NONE,AUTO,20,47
2,JOB1362685407689,COMP10,CTO,MASTERS,BIOLOGY,HEALTH,17,9
3,JOB1362685407690,COMP21,MANAGER,HIGH_SCHOOL,NONE,OIL,14,96
4,JOB1362685407691,COMP36,JUNIOR,DOCTORAL,BIOLOGY,OIL,10,44


In [27]:
test_features.describe()

Unnamed: 0,yearsExperience,milesFromMetropolis
count,1000000.0,1000000.0
mean,12.002104,49.526414
std,7.213179,28.889713
min,0.0,0.0
25%,6.0,25.0
50%,12.0,50.0
75%,18.0,75.0
max,24.0,99.0


In [28]:
test_features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 8 columns):
jobId                  1000000 non-null object
companyId              1000000 non-null object
jobType                1000000 non-null object
degree                 1000000 non-null object
major                  1000000 non-null object
industry               1000000 non-null object
yearsExperience        1000000 non-null int64
milesFromMetropolis    1000000 non-null int64
dtypes: int64(2), object(6)
memory usage: 61.0+ MB


In [30]:
test_features.jobType.value_counts(ascending=False)

VICE_PRESIDENT    125434
JANITOR           125253
SENIOR            125202
CFO               125092
JUNIOR            125022
CEO               124941
CTO               124665
MANAGER           124391
Name: jobType, dtype: int64

In [31]:
test_features.companyId.value_counts()

COMP13    16130
COMP41    16127
COMP54    16098
COMP56    16058
COMP61    16035
          ...  
COMP28    15670
COMP37    15644
COMP14    15638
COMP15    15611
COMP17    15595
Name: companyId, Length: 63, dtype: int64

In [32]:
test_features.degree.value_counts()

HIGH_SCHOOL    238255
NONE           237467
MASTERS        175236
DOCTORAL       175105
BACHELORS      173937
Name: degree, dtype: int64

In [33]:
test_features.major.value_counts()

NONE           534068
BIOLOGY         58804
ENGINEERING     58496
COMPSCI         58385
PHYSICS         58248
CHEMISTRY       58159
LITERATURE      58062
BUSINESS        57961
MATH            57817
Name: major, dtype: int64

In [34]:
test_features.industry.value_counts()

SERVICE      143161
FINANCE      143101
WEB          143012
HEALTH       142978
EDUCATION    142731
OIL          142535
AUTO         142482
Name: industry, dtype: int64

In [40]:
len(test_features.jobId.unique())

1000000

In [53]:
duplicate = test_features[test_features.duplicated() == True]
duplicate

Unnamed: 0,jobId,companyId,jobType,degree,major,industry,yearsExperience,milesFromMetropolis


In [41]:
train_salaries = pd.read_csv('data/train_salaries.csv', sep=',', low_memory=False)
train_salaries.head()

Unnamed: 0,jobId,salary
0,JOB1362684407687,130
1,JOB1362684407688,101
2,JOB1362684407689,137
3,JOB1362684407690,142
4,JOB1362684407691,163


In [54]:
train_salaries[train_salaries.duplicated() == True]

Unnamed: 0,jobId,salary


In [62]:
less_than_zero_salaries = train_salaries[train_salaries['salary'] <= 0]
less_than_zero_salaries

Unnamed: 0,jobId,salary
30559,JOB1362684438246,0
495984,JOB1362684903671,0
652076,JOB1362685059763,0
816129,JOB1362685223816,0
828156,JOB1362685235843,0


In [8]:
train_features = pd.read_csv('data/train_features.csv', sep=',', low_memory=False)
train_features.head()

Unnamed: 0,jobId,companyId,jobType,degree,major,industry,yearsExperience,milesFromMetropolis
0,JOB1362684407687,COMP37,CFO,MASTERS,MATH,HEALTH,10,83
1,JOB1362684407688,COMP19,CEO,HIGH_SCHOOL,NONE,WEB,3,73
2,JOB1362684407689,COMP52,VICE_PRESIDENT,DOCTORAL,PHYSICS,HEALTH,10,38
3,JOB1362684407690,COMP38,MANAGER,DOCTORAL,CHEMISTRY,AUTO,8,17
4,JOB1362684407691,COMP7,VICE_PRESIDENT,BACHELORS,PHYSICS,FINANCE,8,16


In [63]:
train_features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 8 columns):
jobId                  1000000 non-null object
companyId              1000000 non-null object
jobType                1000000 non-null object
degree                 1000000 non-null object
major                  1000000 non-null object
industry               1000000 non-null object
yearsExperience        1000000 non-null int64
milesFromMetropolis    1000000 non-null int64
dtypes: int64(2), object(6)
memory usage: 61.0+ MB


### ---- 3 Clean the data ----

In [74]:
#look for duplicate data, invalid data (e.g. salaries <=0), or corrupt data and remove it
less_than_zero = train_salaries.loc[train_salaries['salary'] <= 0]
train_salaries_clean = train_salaries.drop(less_than_zero.index)
train_salaries_clean.count()

jobId     999995
salary    999995
dtype: int64

### ---- 4 Explore the data (EDA) ----

In [3]:
#summarize each feature variable
#summarize the target variable
#look for correlation between each feature and the target
#look for correlation between features

### ---- 5 Establish a baseline ----

In [5]:
#select a reasonable metric (MSE in this case)
#create an extremely simple model and measure its efficacy
#e.g. use "average salary" for each industry as your model and then measure MSE
#during 5-fold cross-validation

### ---- 6 Hypothesize solution ----

In [None]:
#brainstorm 3 models that you think may improve results over the baseline model based
#on your 

Brainstorm 3 models that you think may improve results over the baseline model based on your EDA and explain why they're reasonable solutions here.

Also write down any new features that you think you should try adding to the model based on your EDA, e.g. interaction variables, summary statistics for each group, etc

## Part 3 - DEVELOP

You will cycle through creating features, tuning models, and training/validing models (steps 7-9) until you've reached your efficacy goal

#### Your metric will be MSE and your goal is:
 - <360 for entry-level data science roles
 - <320 for senior data science roles

### ---- 7 Engineer features  ----

In [None]:
#make sure that data is ready for modeling
#create any new features needed to potentially enhance model

### ---- 8 Create models ----

In [15]:
#create and tune the models that you brainstormed during part 2

### ---- 9 Test models ----

In [1]:
#do 5-fold cross validation on models and measure MSE

### ---- 10 Select best model  ----

In [None]:
#select the model with the lowest error as your "prodcuction" model

## Part 4 - DEPLOY

### ---- 11 Automate pipeline ----

In [None]:
#write script that trains model on entire training set, saves model to disk,
#and scores the "test" dataset

### ---- 12 Deploy solution ----

In [16]:
#save your prediction to a csv file or optionally save them as a table in a SQL database
#additionally, you want to save a visualization and summary of your prediction and feature importances
#these visualizations and summaries will be extremely useful to business stakeholders

### ---- 13 Measure efficacy ----

We'll skip this step since we don't have the outcomes for the test data