# Group work - Assessment 2

In this assignment, we will focus on salary prediction. The data set for this assignment includes information on job descriptions and salaries. Use this data set to see if you can predict the salary of a job posting (i.e., the `Salary` column in the data set) based on the job description. This is important, because this model can make a salary recommendation as soon as a job description is entered into a system.

## Description of Variables

The description of variables are provided in "Jobs - Data Dictionary.docx"

## Goal

Use the **jobs_alldata.csv** data set and build models to predict **salary**.

**Be careful: this is a REGRESSION task**

## Submission:

Please save and submit this Jupyter notebook file. The correctness of the code matters for your grade. **Readability and organization of your code is also important.** You may lose points for submitting unreadable/undecipherable code. Therefore, use markdown cells to create sections, and use comments where necessary.


## Recommended roles for group members:

**Section 1:** to be completed by both group members

**Section 2:** first three models to be completed by the first group member and checked by the second; last two models to be completed by the second group members and checked by the first group member.

**Discussion:** to be completed by both group members

**Important notes:**
- Both group members will get the same grade. Therefore, you should check the work of your group member. If they make a mistake, you will be responsible for that mistake too.
- Both group members must put in their fair share of effort. Otherwise, those who don't contribute to the assignment will not receive any grade.


# Section 1: (8 points in total)

## Data Prep (6 points)

In [1]:
! Pip install tensorflow
! Pip install xgboost



In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import tensorflow as tf
import scipy
from sklearn.feature_extraction.text import TfidfVectorizer
import matplotlib.pyplot as plt

In [3]:
data_df = pd.read_csv('jobs_alldata.csv')[['Job Description', 'Salary']]
display(data_df.head())

print('NAs in data:',data_df.isna().sum())
data_df.describe()

Unnamed: 0,Job Description,Salary
0,Civil Service Title: Regional Director Mental ...,67206
1,The New York City Comptrollerâ€™s Office Burea...,88313
2,With minimal supervision from the Deputy Commi...,81315
3,OPEN TO CURRENT BUSINESS PROMOTION COORDINATOR...,76426
4,Only candidates who are permanent in the Princ...,55675


NAs in data: Job Description    0
Salary             0
dtype: int64


Unnamed: 0,Salary
count,2413.0
mean,77990.330294
std,29202.739636
min,3624.0
25%,58064.0
50%,72689.0
75%,90518.0
max,224351.0


In [4]:
input_text = data_df['Job Description'].to_numpy()
input_text = input_text.astype(str)
input_text[0]

'Civil Service Title: Regional Director Mental Health Services, M-2  Full Time; Evenings and Weekends as needed  New York City Department of Health and Mental Hygiene, Division of Mental Hygiene seeks one fulltime Executive Director for the Co-Response/Heat Program. The unit employs law enforcement, clinical and non-clinical professionals to engage and support individuals with mental health issues, substance use, co-occurring disorders and health issues who can benefit from short-term engagement, support and linkage to services in the promotion of better health and criminal justice outcomes.   Program Background:   The Unit is comprised of three primary components:   -\tTriage Desk: act as Teamâ€™s â€œair traffic control,â€\x9d receiving and processing all incoming referrals to determine the most appropriate response, and provide clinical consultation to members of the force.   -\tCo-Response team: Clinician and NYPD officer teams conducting community deployments.   -\tHealth Engagemen

In [5]:
x = input_text
y = data_df['Salary'].to_numpy().astype(int)

## Removing outliers in salary variable: Less than 0.01 percentile of salary
min_quant = np.quantile(y, 0.01)
# max_quant = np.quantile(y, 0.99)
# inds = np.where((y >= min_quant) & (y <= max_quant))[0]
inds = np.where((y >= min_quant))[0]
y = y[inds]
x = x[inds]

## Feature Engineering (1 points)

Create one NEW feature from existing data. You either transform a single variable, or create a new variable from existing ones. 

Grading: 
- 0.5 points for creating the new feature correctly
- 0.5 points for the justification of the new feature (i.e., why did you create this new feature)

In [6]:
#TF-IDF VECTORIZATION
# The reason we created this new feature is most of the ML algorithms can't process the raw strings directly.
#so we have to convert the string into vector representation.Tf-idf is one such method .
# TF-IDF uses both aspects : The individual frequencies of words  and the context of each term in the whole document.
#TF (Term frequency), the count of the word occurring in a document and IDF (inverse document frequency), 
#the weight component that gives higher importance to words occurring in lesser documents and lower importance to more common words.
from sklearn import decomposition
import sklearn.utils.sparsefuncs

# Transforming words to vector embeddings using tf-idf
tfidf_vect = TfidfVectorizer(stop_words='english', strip_accents='unicode')
tf_idf_x = tfidf_vect.fit_transform(x)
tf_idf_x1 = tf_idf_x

#Limiting features to 750 by retaining 90% variance in data
pca = decomposition.TruncatedSVD(n_components=750)
tf_idf_x = pca.fit_transform(tf_idf_x)

# Calculating variance after PCA
exp = np.var(tf_idf_x, axis=0)
full = sklearn.utils.sparsefuncs.mean_variance_axis(tf_idf_x1, axis = 0)[1].sum()
explained_variance_ratios = exp / full
confidence = sum(explained_variance_ratios)
print('%Variance retained after PCA:',round(confidence*100, 2))

%Variance retained after PCA: 91.57


In [7]:
# NORMALIZING TARGET VALUES
y_norm = (y - min(y)) / (max(y) - min(y))

# Train, Validation Split
x_train, x_test, y_train, y_test = train_test_split(tf_idf_x, y_norm, test_size=0.2)

## Find the Baseline (1 point)

In [8]:
from sklearn.metrics import mean_squared_error

mean_value = np.mean(y_train)
baseline_pred = np.repeat(mean_value, len(y_test))

baseline_pred_rescaled = baseline_pred * (max(y) - min(y)) + min(y)
y_test_rescaled = y_test * (max(y) - min(y)) + min(y)

print('Baseline RMSE:', round(np.sqrt(sklearn.metrics.mean_squared_error(y_test_rescaled, baseline_pred_rescaled)),2))


Baseline RMSE: 27552.34


# Section 2: (7 points in total)

Build the following models:


## Decision Tree: (1 point)

In [9]:
from sklearn.tree import DecisionTreeRegressor

regressor = DecisionTreeRegressor(min_samples_leaf = 10, max_depth=5, random_state=0)
regressor.fit(x_train, y_train)

train_preds = regressor.predict(x_train)
train_preds_rescaled = train_preds * (max(y) - min(y)) + min(y)
y_train_rescaled = y_train * (max(y) - min(y)) + min(y)
dtr_train_rmse = round(np.sqrt(sklearn.metrics.mean_squared_error(y_train_rescaled, train_preds_rescaled)),2)
print('Train RMSE:', dtr_train_rmse)

test_preds = regressor.predict(x_test)
test_preds_rescaled = test_preds * (max(y) - min(y)) + min(y)
y_test_rescaled = y_test * (max(y) - min(y)) + min(y)
dtr_test_rmse = round(np.sqrt(sklearn.metrics.mean_squared_error(y_test_rescaled, test_preds_rescaled)),2)
print('Test RMSE:', dtr_test_rmse)


Train RMSE: 20718.23
Test RMSE: 24427.48


## Voting regressor (2 points):

The voting regressor should have at least 3 individual models

In [10]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import VotingRegressor
from xgboost.sklearn import XGBRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression

reg1 = DecisionTreeRegressor(random_state=0)
reg2 = XGBRegressor()
reg3 = RandomForestRegressor(random_state=1)
reg4 = LinearRegression()
er = VotingRegressor([('rf', reg3), ('dt',reg1), ('lr', reg4)])
er.fit(x_train, y_train)

train_preds = er.predict(x_train)
train_preds_rescaled = train_preds * (max(y) - min(y)) + min(y)
y_train_rescaled = y_train * (max(y) - min(y)) + min(y)
vr_train_rmse = round(np.sqrt(sklearn.metrics.mean_squared_error(y_train_rescaled, train_preds_rescaled)),2)
print('Train RMSE:', vr_train_rmse)

test_preds = er.predict(x_test)
test_preds_rescaled = test_preds * (max(y) - min(y)) + min(y)
y_test_rescaled = y_test * (max(y) - min(y)) + min(y)
vr_test_rmse = round(np.sqrt(sklearn.metrics.mean_squared_error(y_test_rescaled, test_preds_rescaled)),2)
print('Test RMSE:', vr_test_rmse)

Train RMSE: 5742.57
Test RMSE: 15372.86


## A Boosting model: (1 point)

Build either an Adaboost or a GradientBoost model

In [11]:
from xgboost.sklearn import XGBRegressor

xgb_regressor = XGBRegressor()
xgb_regressor.fit(X=x_train, y=y_train)

#Train RMSE
train_preds = xgb_regressor.predict(x_train)
train_preds_rescaled = train_preds * (max(y) - min(y)) + min(y)
y_train_rescaled = y_train * (max(y) - min(y)) + min(y)
xgbr_train_rmse = round(np.sqrt(sklearn.metrics.mean_squared_error(y_train_rescaled, train_preds_rescaled)),2)
print('Train RMSE:', xgbr_train_rmse)

#Test RMSE
test_preds = xgb_regressor.predict(x_test)
test_preds_rescaled = test_preds * (max(y) - min(y)) + min(y)
y_test_rescaled = y_test * (max(y) - min(y)) + min(y)
xgbr_test_rmse = round(np.sqrt(sklearn.metrics.mean_squared_error(y_test_rescaled, test_preds_rescaled)),2)
print('Test RMSE:', xgbr_test_rmse)

Train RMSE: 1558.34
Test RMSE: 14528.65


## Neural network: (1 point)

In [12]:
from sklearn.neural_network import MLPRegressor

model = MLPRegressor(hidden_layer_sizes=(50,50,50,50,50),
                       max_iter=1000,
                       early_stopping=True,
                       alpha = 0.001
                     )
model.fit(x_train, y_train)

#Train RMSE
train_pred = model.predict(x_train)
train_preds_rescaled = train_pred * (max(y) - min(y)) + min(y)
y_train_rescaled = y_train * (max(y) - min(y)) + min(y)
train_mse = mean_squared_error(y_train_rescaled, train_preds_rescaled)
nn_train_rmse = round(np.sqrt(train_mse), 2)

print('Train RMSE: {}' .format(nn_train_rmse))

#Test RMSE
test_pred = model.predict(x_test)
preds_rescaled = test_pred * (max(y) - min(y)) + min(y)
y_test_rescaled = y_test * (max(y) - min(y)) + min(y)
test_mse = mean_squared_error(y_test_rescaled, preds_rescaled)
nn_test_rmse = round(np.sqrt(test_mse), 2)
print('Test RMSE: {}' .format(nn_test_rmse))

Train RMSE: 7218.91
Test RMSE: 15199.39


## Grid search (2 points)

Perform either a full or randomized grid search on any model you want. There has to be at least two parameters for the search. 

In [13]:
model2 = MLPRegressor(hidden_layer_sizes=(50,50,50,50,50),
                       max_iter=1000,
                       early_stopping=True,
                       alpha = 0.001
                     )

nn_params ={
    'learning_rate': [0.001, 0.005, 0.01, 0.05],
    'hidden0__units': [25, 50, 100],
    'hidden1__units': [25, 50, 100],
    'hidden2__units': [25, 50, 100],
    'hidden3__units': [25, 50, 100],
    'hidden4__units': [25, 50, 100]}
model2.fit(x_train, y_train)

#Train RMSE
train_pred = model2.predict(x_train)
train_preds_rescaled = train_pred * (max(y) - min(y)) + min(y)
y_train_rescaled = y_train * (max(y) - min(y)) + min(y)
train_mse = mean_squared_error(y_train_rescaled, train_preds_rescaled)
grid_search_nn_train_rmse = round(np.sqrt(train_mse), 2)
print('Train RMSE: {}' .format(grid_search_nn_train_rmse))

#Test RMSE
test_pred = model2.predict(x_test)
preds_rescaled = test_pred * (max(y) - min(y)) + min(y)
y_test_rescaled = y_test * (max(y) - min(y)) + min(y)
test_mse = mean_squared_error(y_test_rescaled, preds_rescaled)
grid_search_nn_test_rmse = round(np.sqrt(test_mse), 2)
print('Test RMSE: {}' .format(grid_search_nn_test_rmse))

Train RMSE: 11343.88
Test RMSE: 16106.29


# Discussion (5 points in total)


## List the train and test values of each model you built (2 points)

In [15]:
print('Decision tree train rmse: ', dtr_train_rmse, '\t\t\t', 'Decision tree test rmse: ', dtr_test_rmse)
print('Voting regressor train rmse:', vr_train_rmse, '\t\t\t', 'Voting regressor test rmse:', vr_test_rmse)
print('XGBoost train rmse: ', xgbr_train_rmse, '\t\t\t\t', 'XGBoost test rmse: ', xgbr_test_rmse)
print('Neural Network train rmse: ', round(nn_train_rmse, 2), '\t\t\t', 'Neural Network test rmse: ', round(nn_test_rmse, 2))
print('Grid search Neural Network train rmse: ', round(grid_search_nn_train_rmse, 2), '\t', 'Grid search Neural Network test rmse: ', round(grid_search_nn_test_rmse, 2))


Decision tree train rmse:  20718.23 			 Decision tree test rmse:  24427.48
Voting regressor train rmse: 5742.57 			 Voting regressor test rmse: 15372.86
XGBoost train rmse:  1558.34 				 XGBoost test rmse:  14528.65
Neural Network train rmse:  7218.91 			 Neural Network test rmse:  15199.39
Grid search Neural Network train rmse:  11343.88 	 Grid search Neural Network test rmse:  16106.29


## Which model performs the best and why? (0.5 points) 
## How does it compare to baseline? (0.5 points)

Hint: The best model is the one that has the highest TEST score (regardless of any of the training values). If you select your model based on TRAIN values, you will lose points.



## Is there any evidence of overfitting in the best model, why or why not? If there is, what did you do about it? (1 point)

## Is there any overfitting in the other models (besides the best model), why or why not? If there is, what did you do about it? (1 point)