# Salary Predictions Based on Job Descriptions

# Part 1 - DEFINE

### ---- 1 Define the problem ----

Predict Salary based on the job descriptions. 

Job descriptions could have various attributes for the same : 

    1.) Job Type : Role /  Position for the Job ( Ex : Executive / Management / Fresher / Mid Level Roles etc.) 
    2.) Degree of Education required for the job ( Ex :  Masters / PHD / Bachelors etc. )
    3.) Field of Study / Major ( Ex : Math /  Computer Science / Biology etc. )
    4.) Industry ( Ex : Textile /  Automobile / Technology / Health etc. )
    5.) Years of Experience
    6.) Miles from Metropolis / How far from the main city  

In [24]:
#import your libraries
import pandas as pd
import sklearn as sk
import numpy as np
import matplotlib.pyplot as plt
#etc

#your info here
__author__ = "Viswanathan K S"
__email__ = "viswasiva2003@gmail.com"

## Part 2 - DISCOVER

In [25]:
# Find the Current Directory
import os
print(os.getcwd())

/home/ksviswa/DSDJ_Portfolio/salary_prediction


### ---- 2 Load the data ----

In [26]:
#load the data into a Pandas dataframe
file_path = "/home/ksviswa/DSDJ_Portfolio/salary_prediction/"

train_features_data = pd.read_csv(file_path + "data/train_features.csv")
print(train_features_data.head())

              jobId companyId         jobType       degree      major  \
0  JOB1362684407687    COMP37             CFO      MASTERS       MATH   
1  JOB1362684407688    COMP19             CEO  HIGH_SCHOOL       NONE   
2  JOB1362684407689    COMP52  VICE_PRESIDENT     DOCTORAL    PHYSICS   
3  JOB1362684407690    COMP38         MANAGER     DOCTORAL  CHEMISTRY   
4  JOB1362684407691     COMP7  VICE_PRESIDENT    BACHELORS    PHYSICS   

  industry  yearsExperience  milesFromMetropolis  
0   HEALTH               10                   83  
1      WEB                3                   73  
2   HEALTH               10                   38  
3     AUTO                8                   17  
4  FINANCE                8                   16  


In [27]:
train_salaries_data = pd.read_csv(file_path + "data/train_salaries.csv")
print(train_salaries_data.head())

              jobId  salary
0  JOB1362684407687     130
1  JOB1362684407688     101
2  JOB1362684407689     137
3  JOB1362684407690     142
4  JOB1362684407691     163


In [28]:
test_features_data = pd.read_csv(file_path + "data/test_features.csv")
print(test_features_data.head())

              jobId companyId  jobType       degree    major industry  \
0  JOB1362685407687    COMP33  MANAGER  HIGH_SCHOOL     NONE   HEALTH   
1  JOB1362685407688    COMP13   JUNIOR         NONE     NONE     AUTO   
2  JOB1362685407689    COMP10      CTO      MASTERS  BIOLOGY   HEALTH   
3  JOB1362685407690    COMP21  MANAGER  HIGH_SCHOOL     NONE      OIL   
4  JOB1362685407691    COMP36   JUNIOR     DOCTORAL  BIOLOGY      OIL   

   yearsExperience  milesFromMetropolis  
0               22                   73  
1               20                   47  
2               17                    9  
3               14                   96  
4               10                   44  


In [37]:
print("Train Features Data has " + str(train_features_data.shape[0]) + " rows and " + str(train_features_data.shape[1]) + " columns" )
print("Train Salaries Data has " + str(train_salaries_data.shape[0]) + " rows and " + str(train_salaries_data.shape[1]) + " columns" )

#Looks like both the train data has equal number of records , we can join them.

train_data = pd.merge(train_features_data , train_salaries_data , on = 'jobId')

print("\n")
print(train_data.head())
print("\n")
print(train_data.shape)

Train Features Data has 1000000 rows and 8 columns
Train Salaries Data has 1000000 rows and 2 columns


              jobId companyId         jobType       degree      major  \
0  JOB1362684407687    COMP37             CFO      MASTERS       MATH   
1  JOB1362684407688    COMP19             CEO  HIGH_SCHOOL       NONE   
2  JOB1362684407689    COMP52  VICE_PRESIDENT     DOCTORAL    PHYSICS   
3  JOB1362684407690    COMP38         MANAGER     DOCTORAL  CHEMISTRY   
4  JOB1362684407691     COMP7  VICE_PRESIDENT    BACHELORS    PHYSICS   

  industry  yearsExperience  milesFromMetropolis  salary  
0   HEALTH               10                   83     130  
1      WEB                3                   73     101  
2   HEALTH               10                   38     137  
3     AUTO                8                   17     142  
4  FINANCE                8                   16     163  


(1000000, 9)


### ---- 3 Clean the data ----

In [53]:
#look for duplicate data, invalid data (e.g. salaries <=0), or corrupt data and remove it

# Check for any salary less than 0 
train_data_salary_less_than_0 = train_data[train_data['salary'] <= 0]

print(len(train_data_salary_less_than_0))

# There are 5 records where salary is 0 , this could be corrupt data , we can drop them as they are very less compared 
# to the entire training set of 1 million records.

train_data = train_data.drop(train_data_salary_less_than_0.index.tolist())
print(train_data.shape)
      
# Check for duplicates 
train_data['is_duplicated'] = train_data.duplicated()
print(train_data['is_duplicated'].sum())

# The training data has no duplicates , hnece we need not drop any
# If duplicates , we could use df.drop_duplicates() to remove duplicates.

# Check for any other corrupt data / missing data
print("Missing Values : ")
print("\n")
print(train_data.isnull().sum())

0
(999995, 10)
0
Missing Values : 


jobId                  0
companyId              0
jobType                0
degree                 0
major                  0
industry               0
yearsExperience        0
milesFromMetropolis    0
salary                 0
is_duplicated          0
dtype: int64


### ---- 4 Explore the data (EDA) ----

In [54]:
#summarize each feature variable
#summarize the target variable

# we can quickly check the summary for all the columns using the describe method

print('\n Summary :  \n')
print(train_data.describe(include='all'))

#look for correlation between each feature and the target



#look for correlation between features
print("\n")
print(train_data.corr())


 Summary :  

                   jobId companyId jobType       degree   major industry  \
count             999995    999995  999995       999995  999995   999995   
unique            999995        63       8            5       9        7   
top     JOB1362684953073    COMP39  SENIOR  HIGH_SCHOOL    NONE      WEB   
freq                   1     16193  125886       236975  532353   143205   
mean                 NaN       NaN     NaN          NaN     NaN      NaN   
std                  NaN       NaN     NaN          NaN     NaN      NaN   
min                  NaN       NaN     NaN          NaN     NaN      NaN   
25%                  NaN       NaN     NaN          NaN     NaN      NaN   
50%                  NaN       NaN     NaN          NaN     NaN      NaN   
75%                  NaN       NaN     NaN          NaN     NaN      NaN   
max                  NaN       NaN     NaN          NaN     NaN      NaN   

        yearsExperience  milesFromMetropolis         salary is_duplicate

### ---- 5 Establish a baseline ----

In [5]:
#select a reasonable metric (MSE in this case)
#create an extremely simple model and measure its efficacy
#e.g. use "average salary" for each industry as your model and then measure MSE
#during 5-fold cross-validation

### ---- 6 Hypothesize solution ----

In [None]:
#brainstorm 3 models that you think may improve results over the baseline model based
#on your 

Brainstorm 3 models that you think may improve results over the baseline model based on your EDA and explain why they're reasonable solutions here.

Also write down any new features that you think you should try adding to the model based on your EDA, e.g. interaction variables, summary statistics for each group, etc

## Part 3 - DEVELOP

You will cycle through creating features, tuning models, and training/validing models (steps 7-9) until you've reached your efficacy goal

#### Your metric will be MSE and your goal is:
 - <360 for entry-level data science roles
 - <320 for senior data science roles

### ---- 7 Engineer features  ----

In [None]:
#make sure that data is ready for modeling
#create any new features needed to potentially enhance model

### ---- 8 Create models ----

In [15]:
#create and tune the models that you brainstormed during part 2

### ---- 9 Test models ----

In [1]:
#do 5-fold cross validation on models and measure MSE

### ---- 10 Select best model  ----

In [None]:
#select the model with the lowest error as your "prodcuction" model

## Part 4 - DEPLOY

### ---- 11 Automate pipeline ----

In [None]:
#write script that trains model on entire training set, saves model to disk,
#and scores the "test" dataset

### ---- 12 Deploy solution ----

In [16]:
#save your prediction to a csv file or optionally save them as a table in a SQL database
#additionally, you want to save a visualization and summary of your prediction and feature importances
#these visualizations and summaries will be extremely useful to business stakeholders

### ---- 13 Measure efficacy ----

We'll skip this step since we don't have the outcomes for the test data