# Salary Predictions Based on Job Descriptions

## DEFINE

This is a project aimed for predicting future salaries of job postings based on salaries of current job postings. The language of choice to tackle this problem is Python.

In [None]:
#importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
__author__ = "Lukas Barbuscak"
__email__ = "lukas.barbuscak@gmail.com"

## DISCOVER

In [None]:
#loading the data
df_features = pd.read_csv("data\\train_features.csv")
df_salaries = pd.read_csv("data\\train_salaries.csv")
df_test = pd.read_csv("data\\test_features.csv")

#merging the features and salaries datasets based on JobId
df = pd.merge(df_features, df_salaries, on="jobId")

#releasing memory
del df_features, df_salaries

#examining the dataset
df.head(10)

### Cleaning the dataset

In [None]:
#checking the dataset
df.info()

Overall, the dataset looks very clean. Numeric values are stored as floats and string values are stored as objects. Let's check for any possible irregularities in the data.

In [None]:
#checking for the total number of missing values
df.isnull().sum()

In [None]:
#checking if missing values are encoded as "0"
df.eq(0).sum()

I assume that 0 in years of experience is not a sign of missing values, since someone can have experience less than 1 year. I also assume that miles from Metropolis being 0 is not an issue, since being from Metropolis means distance to Metropolis would indeed be 0. However, salary being 0 either indicates a missing value, or volunteering, and that is not relevant for the model. I will drop the observations.

In [None]:
#dropping observations which have salary as 0
df.drop(df[df.salary == 0].index, inplace=True)

In [None]:
#checking for possible irregularities for object dtypes
print("Job type values:", df.jobType.unique())
print("Degree values:", df.degree.unique())
print("Major values:", df.major.unique())
print("Industry values:", df.industry.unique())

In [None]:
#checking for possible irregularities for numeric dtypes
df.describe()

There seem to be no issues with the first two columns, but salary might have possible outliers. Let's check this using a box plot.

In [None]:
#creating a box plot
df.boxplot(column=["salary"])
plt.show()

In [None]:
#checking the percentage of outliers
salary_outliers = np.sum(df["salary"]>=210)
print("The percentage of outliers is:", salary_outliers/df["salary"].count()*100)

Because the percentage is relatively large and they do not appear to be errors, I will keep the outliers in the dataset.

In [None]:
#checking for duplicate Job ID values
df_no_duplicates = df.drop('jobId', axis=1).drop_duplicates()
print(df.shape)
print(df_no_duplicates.shape)

In [None]:
#applying the changes to the original dataset
df = df_no_duplicates
print(df.shape)

del df_no_duplicates

### Exploratory Data Analysis

In [None]:
#checking the differences between groups, and plotting them if necessary
companyId_summary=df.groupby("companyId")
companyId_summary.mean().head(10)

In [None]:
degree_summary = df.groupby("degree")
degree_summary.mean()

In [None]:
plt.figure(figsize=(9,4))
sns.boxplot(x=df["degree"],y=df["salary"], 
            palette=("GnBu_d"), order=["DOCTORAL","MASTERS","BACHELORS","HIGH_SCHOOL","NONE"])
sns.set(style="whitegrid")
plt.xlabel("Degree")
plt.ylabel("Salary")
plt.show()

In [None]:
jobType_summary = df.groupby("jobType")
jobType_summary.mean()

In [None]:
plt.figure(figsize=(11,4))
sns.boxplot(x=df["jobType"],y=df["salary"], 
            palette=("GnBu_d"), order=["CEO","CTO","CFO","VICE_PRESIDENT","MANAGER","SENIOR","JUNIOR","JANITOR"])
sns.set(style="whitegrid")
plt.xlabel("Job Type")
plt.ylabel("Salary")
plt.show()

In [None]:
major_summary = df.groupby("major")
major_summary.mean()

In [None]:
industry_summary = df.groupby("industry")
industry_summary.mean()

In [None]:
plt.figure(figsize=(11,4))
sns.boxplot(x=df["industry"],y=df["salary"], palette=("GnBu_d"),
           order=["OIL","FINANCE","WEB","HEALTH","AUTO","SERVICE","EDUCATION"])
sns.set(style="whitegrid")
plt.xlabel("Industry")
plt.ylabel("Salary")
plt.show()

Overall, all categorical variables show no mean differences between the categories and years of experience, or number of miles from Metropolis. The only differences in general seem to be in differences in salary.

We can see there is a difference between observations with high school/no education, and observations with master's/doctoral/bachelor's degrees, which have slight differences between each other as well. The job type bar graph does not say anything surprising about the data again, CEO observations have the highest salary, and janitor observations have the lowest. The mean differences between the majors are minimal, with the only significant visible difference is having a major or not having a major. There are differences between industries, with oil and finance being the with the highest mean, and service and education the lowest.

In [None]:
#creating a countplot for each categorical variable presented as a subplot
sns.set(style="whitegrid")
fig, ax = plt.subplots(5, 1, figsize=(15, 20))
plt.setp(plt.gcf().get_axes(), xticks=[], yticks=[]);
ax = fig.add_subplot(5, 1, 1)
sns.countplot(df['companyId'])
ax.xaxis.set_major_formatter(plt.NullFormatter())
plt.xlabel("Company ID")
ax = fig.add_subplot(5, 1, 2)
sns.countplot(df['jobType'])
plt.xlabel("Job Type")
ax = fig.add_subplot(5, 1, 3)
sns.countplot(df['major'])
plt.xlabel("Major")
ax = fig.add_subplot(5, 1, 4)
sns.countplot(df['industry'])
plt.xlabel("Industry")
ax = fig.add_subplot(5, 1, 5)
sns.countplot(df['degree'])
plt.xlabel("Degree")
plt.subplots_adjust(top = 0.9)
plt.show()

This dataset seems extremely balanced in terms of distribution of each categorical variable: the job type count is roughly the same for all job types, there is only slightly more people with no major/without post-secondary education in the dataset than their counterparts, and all industries and companies are represented roughly equally. This is a good sample in no need for applying any resampling methods.

Let's take a look at numerical variables.

In [None]:
#defining a group of numerical variables
num = df.select_dtypes(include=[np.int64])

#distribution of numerical variables
num.hist(bins=50, figsize=(15, 6), layout=(2, 3))
plt.show()

This compact representation of distribution of the numerical variables shows some valuable information. The distance from Metropolis seems very similar for all the observations, and shows a uniform distribution. The salary graph shows slight skewness to the right, but the shape overall follows normal distribution. The years of experience variable shows again a uniform shape. This is explained by good sampling, and again, there is no need for any resampling methods.

In [None]:
#drawing lineplots
sns.lineplot(x=df['yearsExperience'], y=df['salary'], data=df)
plt.show()

sns.lineplot(x=df['milesFromMetropolis'], y=df['salary'], data=df, color="red")
plt.show()

sns.lineplot(x=df['milesFromMetropolis'], y=df['yearsExperience'], data=df, color="yellow")
plt.show()

From the above graphs, we can see that salaries generally decrease with larger distance from Metropolis. Also, salaries tend to increase with years of experience in general. There does not seem to be a relationship between years of experience and miles from Metropolis. 

In [None]:
#creating a correlation heatmap and a correlation table
sns.heatmap(df.corr())
plt.show()
df.corr()

As expected, there is a positive correlation between years of experience and salary, and a slight negative correlation between miles from Metropolis and salary. There is no correlation between years of experience and miles from Metropolis.

The EDA has shown valuable information: I am dealing with a balanced dataset, and the relationship between the dependent variable and the predictors seems to be linear.

Because this is a regression problem with the dependent variable being a continuous one, MSE is a simple and fitting choice to use it as a metric for my model. As my baseline model, I will use difference from average salary, since every ML model should be able to outperform differences from the mean.

In [None]:
#creating dummies, appending them to the original dataset, and dropping the original columns
dummy_1 = pd.get_dummies(df["jobType"])
df = pd.concat([df, dummy_1], axis=1)
df.drop("jobType", axis = 1, inplace=True)

dummy_2 = pd.get_dummies(df["industry"])
df = pd.concat([df, dummy_2], axis=1)
df.drop("industry", axis = 1, inplace=True)

dummy_3 = pd.get_dummies(df["degree"], prefix='degree')
df = pd.concat([df, dummy_3], axis=1)
df.drop("degree", axis = 1, inplace=True)

dummy_4 = pd.get_dummies(df["major"], prefix='major')
df = pd.concat([df, dummy_4], axis=1)
df.drop("major", axis = 1, inplace=True)

dummy_5 = pd.get_dummies(df["companyId"])
df = pd.concat([df, dummy_5], axis=1)
df.drop("companyId", axis = 1, inplace=True)

In [None]:
#creating a new dataset only containing "salary"
target = df["salary"]

#creating a baseline model, using just average, manually computing MSE
target_mean = target.mean()
df["salary_pred"]= target_mean
df["salary_dif"]=df["salary"]-df["salary_pred"]
df["salary_dif_squared"]=df["salary_dif"]**2
df["salary_sum"]=df["salary_dif_squared"].sum()
df["salary_MSE"]=df["salary_sum"]/df["salary_sum"].count()
print("The MSE of the model using just means is:", df["salary_MSE"][0])

#dropping the variables from the dataset
df.drop(["salary","salary_pred","salary_dif","salary_dif_squared","salary_sum","salary_MSE"], axis=1,inplace=True)

The MSE from my simple model using only means is very high, I need to come up with models that will improve the MSE. My models of choice are:

- Linear Regression: as seen in the EDA, our data follows a relatively linear shape
- Decision Trees: just like linear regression, it is a basic and fast approach for modeling, and performs well with linear relationships
- Gradient Boosting: because this is a regression problem, gradient boosting offers great way for weak learners to improve their performance, and is often used to minimize the MSE

## DEVELOP

In [None]:
#calculating MSE for Linear Regression

#importing packages
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.model_selection import cross_val_score

#fitting the model
lm = LinearRegression()
lm.fit(df,target)

#measuring MSE during 5-fold cross-validation and printing the result
lm_scores = cross_val_score(lm,df,target,scoring="neg_mean_squared_error")
lm_mse = -1*lm_scores.mean()
print("The average MSE of the linear regression:", lm_mse)


In [None]:
#calculating MSE for Decision Trees, repeating the process
from sklearn import tree

dt = tree.DecisionTreeRegressor()
dt.fit(df,target)

dt_scores = cross_val_score(dt,df,target,scoring="neg_mean_squared_error")
dt_mse = -1*dt_scores.mean()
print("Average MSE of the decision tree model:", dt_mse)

In [None]:
#calculating MSE for Gradient Boosting, repeating the process
from sklearn.ensemble import GradientBoostingRegressor

gb = GradientBoostingRegressor(n_estimators=150, max_depth=5)
gb.fit(df,target)

gb_scores = cross_val_score(gb,df,target,scoring="neg_mean_squared_error")
gb_mse = -1*gb_scores.mean()
print("Average MSE of the gradient boosting model:", gb_mse)

The lowest average MSE has been reached by the gradient boosting model. I will use this result to score the test dataset, and analyze which features are the most important for the prediction.

## DEPLOY

### ---- 11 Automate pipeline ----

In [None]:
#write script that trains model on entire training set, saves model to disk,
#and scores the "test" dataset
#this means starting from the beginning entirely, and simplify the code written

#saving the gradient boosting model
import joblib
joblib_file = "GB_salary_model.pkl"
joblib.dump(gb, joblib_file)

#loading the saved model
model = joblib.load("GB_salary_model.pkl")
print("The model used is:", model)

In [None]:
#pre-processing the test dataset
df_test_no_duplicates = df_test.drop('jobId', axis=1).drop_duplicates()
df_test = df_test_no_duplicates
del df_test_no_duplicates

In [None]:
#creating dummies
dummy_test_1 = pd.get_dummies(df_test["jobType"])
df_test = pd.concat([df_test, dummy_test_1], axis=1)
df_test.drop("jobType", axis = 1, inplace=True)

dummy_test_2 = pd.get_dummies(df_test["industry"])
df_test = pd.concat([df_test, dummy_test_2], axis=1)
df_test.drop("industry", axis = 1, inplace=True)

dummy_test_3 = pd.get_dummies(df_test["degree"], prefix="degree")
df_test = pd.concat([df_test, dummy_test_3], axis=1)
df_test.drop("degree", axis = 1, inplace=True)

dummy_test_4 = pd.get_dummies(df_test["major"], prefix="major")
df_test = pd.concat([df_test, dummy_test_4], axis=1)
df_test.drop("major", axis = 1, inplace=True)

dummy_test_5 = pd.get_dummies(df_test["companyId"])
df_test = pd.concat([df_test, dummy_test_5], axis=1)
df_test.drop("companyId", axis = 1, inplace=True)

In [None]:
#scoring the test dataset
salary_predictions = model.predict(df_test)

#saving the predictions
np.savetxt('salary_predictions.csv', salary_predictions, delimiter=',')

In [None]:
#showing feature importances
importances = pd.Series(model.feature_importances_, index=df.columns)
importances.nlargest(10).plot(kind='bar', figsize=(12,6))
plt.show()

#saving feature importances
np.savetxt('salary_importances.csv', importances, delimiter=',')

## SUMMARY

I have developed a model that is predicting future salaries of job postings based on salaries of current job postings. After performing the exploratory data analysis, I  fitted three models and compared their mean squared errors, basically comparing how well the models performed comparing to the baseline model, and also each other. The model performing the best with the train data was the gradient boosting model. I saved the model, scored the test data with it, and saved the results of the prediction in a csv file. I also included the analysis of feature importances, and saved it in a separate file.