In this project, we will focus on salary prediction. The data set includes information on job descriptions and salaries. Using this data set we can predict the salary of a job posting (i.e., the `Salary` column in the data set) based on the job description. This is important, because this model can make a salary recommendation as soon as a job description is entered into a system.

## Description of Variables

The description of variables are provided in "Jobs - Data Dictionary.docx"

## Goal

Use the **jobs_alldata.csv** data set and build models to predict **salary**.




## Setup


In [1]:
# Common imports
import numpy as np
import pandas as pd

np.random.seed(36926175)

## Get the data

In [2]:
#We will predict the "salary" value in the data set:

jobs = pd.read_csv("jobs_alldata.csv")
jobs.head()

Unnamed: 0,Salary,Job Description,Location,Min_years_exp,Technical,Comm,Travel
0,67206,Civil Service Title: Regional Director Mental ...,Remote,5,2,3,0
1,88313,The New York City Comptrollerâ€™s Office Burea...,Remote,5,2,4,10-15
2,81315,With minimal supervision from the Deputy Commi...,East campus,5,3,3,5-10
3,76426,OPEN TO CURRENT BUSINESS PROMOTION COORDINATOR...,East campus,1,1,3,0
4,55675,Only candidates who are permanent in the Princ...,Southeast campus,1,1,3,5-10


## Split the data

In [3]:
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(jobs, test_size=0.3)

In [4]:
train_set.isna().sum()

Salary             0
Job Description    0
Location           0
Min_years_exp      0
Technical          0
Comm               0
Travel             0
dtype: int64

## Data Prep

In [5]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

from sklearn.preprocessing import FunctionTransformer

## Seperate the target variable

In [6]:
train_t = train_set['Salary']
test_t = test_set['Salary']

train_inputs = train_set.drop(['Salary'], axis=1)
test_inputs = test_set.drop(['Salary'], axis=1)

## Seperate the text column from train and test data set

In [7]:
train_text = train_inputs['Job Description']
train_othr = train_inputs.drop('Job Description',axis=1)

test_text = test_inputs['Job Description']
test_othr = test_inputs.drop('Job Description',axis=1)


In [8]:
train_text

544     NYC DOT Division of Traffic Operations is seek...
2389    OATA IT Services division is committed to tran...
1366    ACS is establishing a new case consultation fu...
93      Your Team:  The Division Tenant Resources HPD'...
1257    DoITT provides for the sustained, efficient an...
                              ...                        
45      The New York City Department of Health and Men...
22      The Division of Environmental Health works to ...
1520    The Commission on Human Rights (the Commission...
1759    ***PLEASE NOTE APPLICANTS MUST BE PERMANENT IN...
310     The mission of Forestry, Horticulture, and Nat...
Name: Job Description, Length: 1689, dtype: object

In [9]:
test_text

1503    Please read this posting carefully to make cer...
1962    The NYC Department of Environmental Protection...
656     The New York City Department of Transportation...
2098    The Comptroller's Bureau of Contract Administr...
984     The mission of Forestry, Horticulture and Natu...
                              ...                        
413     Please read this posting carefully to make cer...
714     The New York City Department of Environmental ...
532     Please read this posting carefully to make cer...
766     The New York City Housing Authority (NYCHA) is...
1517    The NYC Department of Environmental Protection...
Name: Job Description, Length: 724, dtype: object

In [10]:
train_othr.shape

(1689, 5)

In [11]:
test_othr

Unnamed: 0,Location,Min_years_exp,Technical,Comm,Travel
1503,HQ,2,1,3,0
1962,HQ,5,5,3,0
656,HQ,5,2,3,0
2098,East campus,5,4,2,0
984,West campus,4,1,3,1-5
...,...,...,...,...,...
413,East campus,1,2,4,0
714,Remote,5,2,3,0
532,HQ,5,2,4,0
766,Remote,1,1,4,0


## Feature Engineering Column: We have created feature engg. column "location_job", which uses "Location" column.If "location" is "Remote" value would be 1 else 0 for all others.

In [12]:
train_othr['Location'].describe()

count     1689
unique       5
top         HQ
freq       648
Name: Location, dtype: object

In [13]:
train_othr['Location']

544                   HQ
2389    Southeast campus
1366         East campus
93           West campus
1257         East campus
              ...       
45                    HQ
22      Southeast campus
1520              Remote
1759         West campus
310                   HQ
Name: Location, Length: 1689, dtype: object

In [14]:
def new_col(df):
    #Create a copy so that we don't overwrite the existing dataframe
    df1 = df.copy()
    
    df1['location_job'] = np.where(df1['Location'] == 'Remote', 1, 0)
    
    return df1[['location_job']]
    # You can use this to check whether the calculation is made correctly:
    #return df1

In [15]:
#Let's test the new function:

# Send train set to the function we created
new_col(train_othr)

Unnamed: 0,location_job
544,0
2389,0
1366,0
93,0
1257,0
...,...
45,0
22,0
1520,1
1759,0


In [16]:
train_inputs.dtypes

Job Description    object
Location           object
Min_years_exp       int64
Technical           int64
Comm                int64
Travel             object
dtype: object

In [17]:
train_othr.dtypes

Location         object
Min_years_exp     int64
Technical         int64
Comm              int64
Travel           object
dtype: object

In [18]:
# Identify the numerical columns
numeric_columns = ['Min_years_exp','Technical','Comm']

# Identify the categorical columns
categorical_columns = ['Travel']

## Identify feature engineering column
feat_eng_columns = ['Location']

## Pipeline

In [19]:
numeric_transformer = Pipeline(steps=[
                ('imputer', SimpleImputer(strategy='median')),
                ('scaler', StandardScaler())])

In [20]:
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [21]:
my_new_column = Pipeline(steps=[('my_new_column', FunctionTransformer(new_col))])

In [22]:

preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_columns),
        ('cat', categorical_transformer, categorical_columns),
        ('trans', my_new_column, feat_eng_columns)],
        remainder='passthrough')

#passtrough is an optional step. You don't have to use it.

## Transform: fit_transform() for TRAIN

In [23]:
#Fit and transform the train data
train_othr_trans = preprocessor.fit_transform(train_othr)

train_othr_trans

array([[-1.08660042, -1.04304216,  0.98576743, ...,  0.        ,
         0.        ,  0.        ],
       [-1.08660042,  1.41057651,  2.10998159, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.58283691, -1.04304216,  0.98576743, ...,  0.        ,
         1.        ,  0.        ],
       ...,
       [ 0.58283691, -0.22516927, -0.13844674, ...,  0.        ,
         0.        ,  1.        ],
       [-1.08660042,  0.59270362, -1.26266091, ...,  1.        ,
         0.        ,  0.        ],
       [ 1.13931602,  0.59270362,  0.98576743, ...,  0.        ,
         0.        ,  0.        ]])

In [24]:
train_othr_trans.shape

(1689, 8)

# Tranform: transform() for TEST

In [25]:
# Transform the test data
test_othr_trans = preprocessor.transform(test_othr)

test_othr_trans

array([[-0.53012131, -1.04304216, -0.13844674, ...,  0.        ,
         0.        ,  0.        ],
       [ 1.13931602,  2.22844941, -0.13844674, ...,  0.        ,
         0.        ,  0.        ],
       [ 1.13931602, -0.22516927, -0.13844674, ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [ 1.13931602, -0.22516927,  0.98576743, ...,  0.        ,
         0.        ,  0.        ],
       [-1.08660042, -1.04304216,  0.98576743, ...,  0.        ,
         0.        ,  1.        ],
       [-1.08660042,  2.22844941, -0.13844674, ...,  0.        ,
         0.        ,  0.        ]])

In [26]:
test_othr_trans.shape

(724, 8)

## Text Mining

In [27]:
#TfidfVectorizer includes pre-processing, tokenization, filtering stop words
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(stop_words='english')

train_x_tr = tfidf_vect.fit_transform(train_text)

In [28]:
# Perform the TfidfVectorizer transformation
# Be careful: We are using the train fit to transform the test data set. Otherwise, the test data 
# features will be very different and match the train set!!!

test_x_tr = tfidf_vect.transform(test_text)


In [29]:
train_x_tr.shape, test_x_tr.shape

((1689, 9802), (724, 9802))

In [30]:
# These data sets are "sparse matrix". We can't see them unless we convert using toarray()
train_x_tr

<1689x9802 sparse matrix of type '<class 'numpy.float64'>'
	with 249060 stored elements in Compressed Sparse Row format>

In [31]:
# These data sets are "sparse matrix". We can't see them unless we convert using toarray()
train_x_tr.toarray()

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.05220597, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [32]:
tfidf_vect.vocabulary_

{'nyc': 5903,
 'dot': 2998,
 'division': 2932,
 'traffic': 8950,
 'operations': 6090,
 'seeking': 7929,
 'experienced': 3602,
 'candidate': 1476,
 'serve': 7982,
 'highway': 4309,
 'transportation': 9007,
 'specialist': 8252,
 'signals': 8097,
 'unit': 9189,
 'involved': 4880,
 'development': 2742,
 'maintenance': 5360,
 'cityâ': 1718,
 'extensive': 3639,
 'network': 5808,
 'transit': 8981,
 'signal': 8095,
 'priority': 6799,
 'tsp': 9073,
 'real': 7149,
 'time': 8863,
 'passenger': 6321,
 'information': 4649,
 'rtpi': 7742,
 'wayfinder': 9524,
 'systems': 8673,
 'supervision': 8594,
 'develop': 2736,
 'complex': 1958,
 'simulation': 8117,
 'models': 5647,
 'aimsun': 651,
 'vissim': 9430,
 'similar': 8107,
 'software': 8191,
 'support': 8608,
 'vision': 9421,
 'zero': 9727,
 'projects': 6891,
 'new': 5813,
 'york': 9717,
 'city': 1713,
 'additional': 493,
 'responsibilities': 7560,
 'include': 4562,
 'project': 6886,
 'scoping': 7856,
 'cost': 2275,
 'proposal': 6922,
 'contract': 2177

## Latent Semantic Analysis (Singular Value Decomposition)

In [33]:
from sklearn.decomposition import TruncatedSVD

In [34]:
#If you are performing Latent Semantic Analysis, recommended number of components is 100

svd = TruncatedSVD(n_components=1100, n_iter=10)

In [35]:
train_x_lsa = svd.fit_transform(train_x_tr)

In [36]:
train_x_lsa.shape

(1689, 1100)

In [37]:
train_x_lsa

array([[ 1.60378495e-01, -8.78083173e-02,  5.32549366e-03, ...,
         8.18765282e-04,  4.40826058e-04, -3.82985697e-04],
       [ 1.19773596e-01, -1.01424142e-01,  2.50156424e-02, ...,
         1.19601880e-04, -6.38791401e-04, -3.51819842e-04],
       [ 1.28522211e-01, -1.35634113e-01, -8.24286173e-02, ...,
        -3.03807259e-04, -1.18854827e-03,  1.17230333e-03],
       ...,
       [ 1.78041070e-01, -1.48291560e-01, -2.35037276e-01, ...,
         2.85718503e-02,  4.06183865e-02,  1.27626303e-03],
       [ 4.64270193e-01,  1.06468559e-01,  1.82868320e-01, ...,
        -1.39284640e-03, -8.73721586e-05,  9.53114480e-04],
       [ 1.40325566e-01, -6.36563698e-02, -2.05515473e-02, ...,
        -1.61297740e-03, -2.60755435e-04, -7.96640739e-04]])

In [38]:
svd.explained_variance_.sum()

0.9366861654558561

### Let's transform the test data set

In [39]:
test_x_lsa = svd.transform(test_x_tr)

In [40]:
test_x_lsa.shape

(724, 1100)

In [41]:
train = np.hstack((train_othr_trans,train_x_lsa))

In [42]:
train

array([[-1.08660042e+00, -1.04304216e+00,  9.85767426e-01, ...,
         8.18765282e-04,  4.40826058e-04, -3.82985697e-04],
       [-1.08660042e+00,  1.41057651e+00,  2.10998159e+00, ...,
         1.19601880e-04, -6.38791401e-04, -3.51819842e-04],
       [ 5.82836913e-01, -1.04304216e+00,  9.85767426e-01, ...,
        -3.03807259e-04, -1.18854827e-03,  1.17230333e-03],
       ...,
       [ 5.82836913e-01, -2.25169269e-01, -1.38446742e-01, ...,
         2.85718503e-02,  4.06183865e-02,  1.27626303e-03],
       [-1.08660042e+00,  5.92703623e-01, -1.26266091e+00, ...,
        -1.39284640e-03, -8.73721586e-05,  9.53114480e-04],
       [ 1.13931602e+00,  5.92703623e-01,  9.85767426e-01, ...,
        -1.61297740e-03, -2.60755435e-04, -7.96640739e-04]])

In [43]:
train.shape

(1689, 1108)

In [44]:
test = np.hstack((test_othr_trans,test_x_lsa))

In [45]:
test

array([[-5.30121308e-01, -1.04304216e+00, -1.38446742e-01, ...,
         1.06706775e-02, -8.33398060e-03, -2.13247438e-02],
       [ 1.13931602e+00,  2.22844941e+00, -1.38446742e-01, ...,
        -2.33828121e-04, -3.09336460e-04,  7.10953032e-04],
       [ 1.13931602e+00, -2.25169269e-01, -1.38446742e-01, ...,
        -9.71341872e-03, -5.71303982e-03,  8.70737520e-03],
       ...,
       [ 1.13931602e+00, -2.25169269e-01,  9.85767426e-01, ...,
        -1.27800469e-03,  1.03265449e-02,  7.00969462e-03],
       [-1.08660042e+00, -1.04304216e+00,  9.85767426e-01, ...,
        -3.35768007e-03, -5.53803043e-04, -4.20166505e-03],
       [-1.08660042e+00,  2.22844941e+00, -1.38446742e-01, ...,
         1.10190757e-03,  1.15642365e-03, -1.56375139e-03]])

In [46]:
test.shape

(724, 1108)

## Find the Baseline 

In [47]:
from sklearn.metrics import mean_squared_error

In [48]:
#First find the average value of the target

mean_value = np.mean(train_t)

mean_value

77585.89698046181

In [49]:
# Predict all values as the mean

baseline_pred = np.repeat(mean_value, len(test_t))

baseline_pred

array([77585.89698046, 77585.89698046, 77585.89698046, 77585.89698046,
       77585.89698046, 77585.89698046, 77585.89698046, 77585.89698046,
       77585.89698046, 77585.89698046, 77585.89698046, 77585.89698046,
       77585.89698046, 77585.89698046, 77585.89698046, 77585.89698046,
       77585.89698046, 77585.89698046, 77585.89698046, 77585.89698046,
       77585.89698046, 77585.89698046, 77585.89698046, 77585.89698046,
       77585.89698046, 77585.89698046, 77585.89698046, 77585.89698046,
       77585.89698046, 77585.89698046, 77585.89698046, 77585.89698046,
       77585.89698046, 77585.89698046, 77585.89698046, 77585.89698046,
       77585.89698046, 77585.89698046, 77585.89698046, 77585.89698046,
       77585.89698046, 77585.89698046, 77585.89698046, 77585.89698046,
       77585.89698046, 77585.89698046, 77585.89698046, 77585.89698046,
       77585.89698046, 77585.89698046, 77585.89698046, 77585.89698046,
       77585.89698046, 77585.89698046, 77585.89698046, 77585.89698046,
      

In [50]:
baseline_mse = mean_squared_error(test_t, baseline_pred)

baseline_rmse = np.sqrt(baseline_mse)

print('Baseline RMSE: {}' .format(baseline_rmse))

Baseline RMSE: 30239.657830851982


## Decision Tree

In [51]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(min_samples_leaf = 10) 

tree_reg.fit(train, train_t)

DecisionTreeRegressor(min_samples_leaf=10)

In [52]:
#Train RMSE
train_pred = tree_reg.predict(train)

train_mse = mean_squared_error(train_t, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 14158.535084490719


In [53]:
#Test RMSE
test_pred = tree_reg.predict(test)

test_mse = mean_squared_error(test_t, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 23919.07475723277


## Handling Overfitting in Decision Tree Model

In [238]:
## from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(max_depth=3)

tree_reg.fit(train, train_t)

DecisionTreeRegressor(max_depth=3)

In [239]:
#Train RMSE
train_pred = tree_reg.predict(train)

train_mse = mean_squared_error(train_t, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 25050.834990378513


In [240]:
#Test RMSE
test_pred = tree_reg.predict(test)

test_mse = mean_squared_error(test_t, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 27933.29009320444


## Voting regressor 

The voting regressor with 3 individual models

In [57]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import SGDRegressor 
from sklearn.svm import SVR 
from sklearn.ensemble import VotingRegressor


dtree_reg = DecisionTreeRegressor(max_depth=20)
svm_reg = SVR(kernel="rbf", C=10, epsilon=0.01, gamma='scale') 
sgd_reg = SGDRegressor(max_iter=10000, tol=1e-3)

voting_reg = VotingRegressor(
            estimators=[('dt', dtree_reg), 
                        ('svr', svm_reg), 
                        ('sgd', sgd_reg)])

voting_reg.fit(train, train_t)



VotingRegressor(estimators=[('dt', DecisionTreeRegressor(max_depth=20)),
                            ('svr', SVR(C=10, epsilon=0.01)),
                            ('sgd', SGDRegressor(max_iter=10000))])

In [58]:
#Train RMSE
train_pred = voting_reg.predict(train)

train_mse = mean_squared_error(train_t, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 11779.44884153116


In [59]:
#Test RMSE

test_pred = voting_reg.predict(test)

test_mse = mean_squared_error(test_t, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 20379.785102898302


## Handling Overfitting in Voting Regressor Model

In [147]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import SGDRegressor 
from sklearn.svm import SVR 
from sklearn.ensemble import VotingRegressor


dtree_reg = DecisionTreeRegressor(min_samples_leaf=40)
svm_reg = SVR(kernel="rbf", C=10, epsilon=0.01, gamma='scale') 
sgd_reg = SGDRegressor(max_iter=10000, tol=1e-3)

voting_reg = VotingRegressor(
            estimators=[('dt', dtree_reg), 
                        ('svr', svm_reg), 
                        ('sgd', sgd_reg)])

voting_reg.fit(train, train_t)



VotingRegressor(estimators=[('dt', DecisionTreeRegressor(min_samples_leaf=40)),
                            ('svr', SVR(C=10, epsilon=0.01)),
                            ('sgd', SGDRegressor(max_iter=10000))])

In [148]:
#Train RMSE
train_pred = voting_reg.predict(train)

train_mse = mean_squared_error(train_t, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 18070.87359955238


In [149]:
#Test RMSE

test_pred = voting_reg.predict(test)

test_mse = mean_squared_error(test_t, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 22994.885591697108


## A Boosting model

Build either an Adaboost or a GradientBoost model

In [129]:
from sklearn.ensemble import AdaBoostRegressor 

#Create Adapative Boosting with Decision Stumps (depth=1)
ada_reg = AdaBoostRegressor( 
            DecisionTreeRegressor(max_depth=1), n_estimators=500, 
            learning_rate=0.1) 

ada_reg.fit(train, train_t)

AdaBoostRegressor(base_estimator=DecisionTreeRegressor(max_depth=1),
                  learning_rate=0.1, n_estimators=500)

In [130]:
#Train RMSE
train_pred = ada_reg.predict(train)

train_mse = mean_squared_error(train_t, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 28897.44095790657


In [131]:
#Test RMSE
test_pred = ada_reg.predict(test)

test_mse = mean_squared_error(test_t, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 29995.868021861857


## Neural network

In [66]:
from sklearn.neural_network import MLPRegressor

#Default settings create 1 hidden layer with 100 neurons
mlp_reg = MLPRegressor(hidden_layer_sizes=(900,))

mlp_reg.fit(train, train_t)



MLPRegressor(hidden_layer_sizes=(900,))

In [67]:
#Train RMSE
train_pred = mlp_reg.predict(train)

train_mse = mean_squared_error(train_t, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 71899.03111003427


In [68]:
#Test RMSE
test_pred = mlp_reg.predict(test)

test_mse = mean_squared_error(test_t, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 73715.07679870729


## Changing hyperparameters for Neural Network to handle underfitting

In [69]:
#Default settings create 1 hidden layer with 100 neurons
mlp_reg = MLPRegressor(hidden_layer_sizes=(700,), max_iter=1000)

mlp_reg.fit(train, train_t)



MLPRegressor(hidden_layer_sizes=(700,), max_iter=1000)

In [70]:
#Train RMSE
train_pred = mlp_reg.predict(train)

train_mse = mean_squared_error(train_t, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 19850.858254313574


In [71]:
#Test RMSE
test_pred = mlp_reg.predict(test)

test_mse = mean_squared_error(test_t, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 24012.624618942526


## Grid search 



In [84]:
from sklearn.model_selection import RandomizedSearchCV

gridParameters = [ {'min_samples_leaf': np.arange(10,30), 'max_depth': np.arange(10,30)}]

tree_regress = DecisionTreeRegressor()

grid_search = RandomizedSearchCV(tree_regress, gridParameters, cv=5, n_iter=10, scoring='neg_mean_squared_error',
                                verbose=1, return_train_score=True)
grid_search.fit(train, train_t)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


RandomizedSearchCV(cv=5, estimator=DecisionTreeRegressor(),
                   param_distributions=[{'max_depth': array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
       27, 28, 29]),
                                         'min_samples_leaf': array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
       27, 28, 29])}],
                   return_train_score=True, scoring='neg_mean_squared_error',
                   verbose=1)

In [85]:
cvresult = grid_search.cv_results_

for mean_score, params in zip(cvresult["mean_test_score"], cvresult["params"]):
    print(np.sqrt(-mean_score), params)

26296.095218649214 {'min_samples_leaf': 16, 'max_depth': 25}
26403.11115356089 {'min_samples_leaf': 17, 'max_depth': 29}
26371.0979346163 {'min_samples_leaf': 23, 'max_depth': 16}
26096.25299482654 {'min_samples_leaf': 22, 'max_depth': 27}
26371.0979346163 {'min_samples_leaf': 23, 'max_depth': 26}
26096.25299482654 {'min_samples_leaf': 22, 'max_depth': 15}
25656.609354189535 {'min_samples_leaf': 14, 'max_depth': 21}
25942.619342300688 {'min_samples_leaf': 12, 'max_depth': 18}
25804.043423318624 {'min_samples_leaf': 13, 'max_depth': 11}
26022.731341735183 {'min_samples_leaf': 25, 'max_depth': 22}


In [86]:
grid_search.best_params_

{'min_samples_leaf': 14, 'max_depth': 21}

In [87]:
grid_search.best_estimator_

DecisionTreeRegressor(max_depth=21, min_samples_leaf=14)

In [88]:
#Train RMSE
train_pred = grid_search.best_estimator_ .predict(train)
train_mse = mean_squared_error(train_t, train_pred)
train_rmse = np.sqrt(train_mse)
print ('Train RMSE: {}'.format (train_rmse))

Train RMSE: 16014.812565136084


In [89]:
#Test RMSE
test_pred = grid_search.best_estimator_ .predict(test)
test_mse = mean_squared_error(test_t, test_pred)
test_rmse = np.sqrt(test_mse)
print ('Test RMSE: {}'.format (test_rmse))

Test RMSE: 24453.915989836583


## List the train and test values of each model:

## Which model performs the best and why? 
## How does it compare to baseline? 

Hint: The best model is the one that has the highest TEST score (regardless of any of the training values). If you select your model based on TRAIN values, you will lose points.

## Is there any evidence of overfitting in the best model, why or why not? If there is, what steps were taken:

## Is there any overfitting in the other models (besides the best model), why or why not? If there is, steps taken: