In [None]:
import pandas as pd 
import numpy as np 
from sklearn.model_selection import train_test_split

In [None]:
data = pd.read_csv(r'C:\Users\User\Desktop\midterm-project\midterm-project\data\cleaned_data.csv')

## Preparation of Data 

This is a continuation of the previous notebook. If you have not seen the previous data, please refer to [the previous notebook!](https://www.kaggle.com/kwangyangchia/midterm-project-for-mle) Similarly, there is a notebook that is stored in the github repository that you can look at. The previous notebook was mostly for cleaning some data and visualisation, but I have decided to encode the data here instead to make the workflow a bit cleaner and so that I can control the modules' versions on the environment that I'm working in. 

If you recall from the dataset, there are numerical and categorical variables. The categorical variables need to be encoded in order for it to be used in the dataset itself.

However, for certain categorical variables, we will be using LabelEncoder isntead of using a OneHotEncoder (or in this case, a DictVectorizer). 

In [None]:
# Reusing code from the previous notebook
categorical = []
numerical = []

for column in data.columns:
    if data[column].dtype == 'object':
        categorical.append(column)
    else:
        numerical.append(column)
        
# Finding categorical variables with more than 20 unique values

for column in categorical:
    if data[column].nunique() > 20:
        print(column)
    else:
        continue

## Difference between a label encoder and a dict vectorizer/one-hot encoder
A label encoder associates one value to a number and is suitable for some variables that have more than a number of unique values. This is compares to a One Hot Encoder instead, whereby new variables are created based on whether they have that value or not. This will only change the values of the categorical data, and not the whole dataset. 

We will be using the dict vectorizer once we have split the data. 

e.g. If we have datapoints ['Amsterdam', 'Paris', 'Amsterdam', 'Berlin'], we will be converting that to [0,1,0,2] using a label encoder. This is compared to a OneHotEncoder where there will be columns "Amsterdam", "Paris" and "Berlin" with either 0/1.

For more information, do read the [Label Encoder documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) for details! 

In [None]:
# Label Encoding
from sklearn.preprocessing import LabelEncoder

label_encoded_columns = ['area', 'country']
le = LabelEncoder()

for column in label_encoded_columns:
    data[column] = le.fit_transform(data[column])

## Train Test Split of the Data 

We will be splitting the dataset into 60%/20%/20%, in terms of the training dataset, the validation dataset, and the test dataset.

In [None]:
data_full_train, data_test = train_test_split(data, random_state = 25, test_size = 0.2, shuffle = True)
data_train, data_val = train_test_split(data_full_train, random_state = 25, test_size = 0.25, shuffle = True)

In [None]:
data_train = data_train.reset_index(drop = True)
data_val = data_val.reset_index(drop = True)
data_test = data_test.reset_index(drop = True)

In [None]:
y_train = data_train['totalyearlycompensation']
y_val = data_val['totalyearlycompensation']
y_test = data_test['totalyearlycompensation']

del data_train['totalyearlycompensation']
del data_val['totalyearlycompensation']
del data_test['totalyearlycompensation']

## Training the models - 1st round

We will be training the data on a few different regression models and judging the models based on root mean squared error (RMSE). 

There will not be any form of hyperparameter tuning just yet, we're just looking for the best regression model first. 

Since the dataset itself isn't too big, we will be using cross-validation in order to better judge the results.

I have decided to round off the RMSE and the standard deviation (SD) to 5 digits mainly to highlight any difference between the normal linear regression and Ridge Regression. 

In [None]:
from sklearn.feature_extraction import DictVectorizer

In [None]:
# Training function 
def train(data_train, y_train, model):
    dicts = data_train.to_dict(orient = 'records')
    
    dv = DictVectorizer(sparse = False)
    X_train = dv.fit_transform(dicts)
    
    model.fit(X_train, y_train)
    return dv, model 

In [None]:
# Prediction function
def predict(data_val, y_val, model):
    dicts = data_val.to_dict(orient = 'records')
    
    X_val = dv.fit_transform(dicts)
    
    y_pred = model.predict(X_val)
    
    return y_pred

In [None]:
# RMSE function
from sklearn.metrics import mean_squared_error

def rmse(y_pred, y_val):
    score = float(mean_squared_error(y_pred, y_val))** 0.5 
    return score

## Basic Linear Regression
We will start off with basic linear regression as a baseline for the rest of the models. After all, it is the simplest regression.

In [None]:
from sklearn.linear_model import LinearRegression

linreg = LinearRegression()

In [None]:
from sklearn.model_selection import KFold 
n_splits = 5
scores = []

kfold = KFold(n_splits=n_splits, shuffle=True, random_state=1)
    
for train_idx, val_idx in kfold.split(data_full_train):
    # data splitting
    data_train = data_full_train.iloc[train_idx]
    data_val = data_full_train.iloc[val_idx]
        
    # y values
    y_train = data_train['totalyearlycompensation']
    y_val = data_val['totalyearlycompensation']
    
    del data_train['totalyearlycompensation']
    del data_val['totalyearlycompensation']
        
    # training and predicting
    dv, model = train(data_train, y_train, linreg)
    y_pred = predict(data_val, dv, model)
    
    score = rmse(y_pred, y_val)
    scores.append(score)
    
print('RMSE for model %s: %.5f +- %.5f' % (model, np.mean(scores), np.std(scores)))

The baseline for the regression is having a RMSE of 0.41078, and having a SD of about 0.003. I think that's a pretty good result, even for a base model!

## Ridge Regression
Ridge Regression is basically Linear Regression with regularization. To find out more about regularization, you can refer to [my notebook under 2.13](https://www.kaggle.com/kwangyangchia/notebook-for-lesson-2-mle) or refer to the [wikipedia page](https://en.wikipedia.org/wiki/Regularization_(mathematics))

(yes i know i plugged my own notebook XD) 

In [None]:
from sklearn.linear_model import Ridge 
from sklearn.model_selection import KFold 

ridge = Ridge()
n_splits = 5
scores = []

kfold = KFold(n_splits=n_splits, shuffle=True, random_state=1)
    
for train_idx, val_idx in kfold.split(data_full_train):
    # data splitting
    data_train = data_full_train.iloc[train_idx]
    data_val = data_full_train.iloc[val_idx]
        
    # y values
    y_train = data_train['totalyearlycompensation']
    y_val = data_val['totalyearlycompensation']
    
    del data_train['totalyearlycompensation']
    del data_val['totalyearlycompensation']
        
    # training and predicting
    dv, model = train(data_train, y_train, ridge)
    y_pred = predict(data_val, dv, model)
    
    score = rmse(y_pred, y_val)
    scores.append(score)
    
print('RMSE for model %s: %.5f +- %.5f' % (model, np.mean(scores), np.std(scores)))

As we can tell, even with cross-validation, the RMSE and SD is exactly the same when rounded off to 5 significant figures, so we can tell that regularization won't significantly change any results.

## Random Forest Regressor 

If you recall what a Random Forest Classifier is, a Random Forest Regressor works the same way but instead of predicting a probability, it predicts a value instead.

In [None]:
from sklearn.ensemble import RandomForestRegressor

rfr = RandomForestRegressor()

In [None]:
from sklearn.model_selection import KFold 
n_splits = 5
scores = []

kfold = KFold(n_splits=n_splits, shuffle=True, random_state=1)
    
for train_idx, val_idx in kfold.split(data_full_train):
    # data splitting
    data_train = data_full_train.iloc[train_idx]
    data_val = data_full_train.iloc[val_idx]
        
    # y values
    y_train = data_train['totalyearlycompensation']
    y_val = data_val['totalyearlycompensation']
    
    del data_train['totalyearlycompensation']
    del data_val['totalyearlycompensation']
        
    # training and predicting
    dv, model = train(data_train, y_train, rfr)
    y_pred = predict(data_val, dv, model)
    
    score = rmse(y_pred, y_val)
    scores.append(score)
    
print('RMSE for model %s: %.5f +- %.5f' % (model, np.mean(scores), np.std(scores)))

We can tell that the RMSE has decreased by a bit, and the standard deviation between the scores has decreased, even in the slightest degree. Seems like Random Forest is working better than the normal linear regression!

## XGB Regressor
XGBoost is a gradient descent algorithm that was covered back in Lesson 6. While we used the Classifier for that lesson, this one uses a Regressor instead to give us a predicted value instead!

In [None]:
from xgboost import XGBRegressor

xgb = XGBRegressor()

In [None]:
from sklearn.model_selection import KFold 
n_splits = 5
scores = []

kfold = KFold(n_splits=n_splits, shuffle=True, random_state=1)
    
for train_idx, val_idx in kfold.split(data_full_train):
    # data splitting
    data_train = data_full_train.iloc[train_idx]
    data_val = data_full_train.iloc[val_idx]
        
    # y values
    y_train = data_train['totalyearlycompensation']
    y_val = data_val['totalyearlycompensation']
    
    del data_train['totalyearlycompensation']
    del data_val['totalyearlycompensation']
        
    # training and predicting
    dv, model = train(data_train, y_train, xgb)
    y_pred = predict(data_val, dv, model)
    
    score = rmse(y_pred, y_val)
    scores.append(score)
    
print('RMSE for XGBRegressor(): %.5f +- %.5f' % (np.mean(scores), np.std(scores)))

The XGBoost algorithm gives us the best scores, along with the best SD so far! 

## CatBoost Regressor
CatBoost is also another gradient boosting algorithm that is used in Machine Learning. It is faster and better performing than other gradient boosting algorithms. 

In [None]:
from catboost import CatBoostRegressor

cbr = CatBoostRegressor(silent = True)

In [None]:
from sklearn.model_selection import KFold 
n_splits = 5
scores = []

kfold = KFold(n_splits=n_splits, shuffle=True, random_state=1)
    
for train_idx, val_idx in kfold.split(data_full_train):
    # data splitting
    data_train = data_full_train.iloc[train_idx]
    data_val = data_full_train.iloc[val_idx]
        
    # y values
    y_train = data_train['totalyearlycompensation']
    y_val = data_val['totalyearlycompensation']
    
    del data_train['totalyearlycompensation']
    del data_val['totalyearlycompensation']
        
    # training and predicting
    dv, model = train(data_train, y_train, cbr)
    y_pred = predict(data_val, dv, model)
    
    score = rmse(y_pred, y_val)
    scores.append(score)
    
print('RMSE for CatBoostRegressor(): %.5f +- %.5f' % (np.mean(scores), np.std(scores)))

The CatBoost Regressor gives the best results, as expected! However, we do have to note that the standard deviation for the scores are a little higher than that of the XGBoost Regressor!

## Hyperparameter Tuning

In order to optimize these algorithms properly, we can tune the parameters of the different models and create even better predictions from this. We will be using RandomizedSearchCV and GridSearchCV in order to tune these parameters. 

We will be using the full_train and the test dataset for the hyperparameter tuning, we're really justt using the 

linreg, ridge, rfc, xgb, catboost 

In [None]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

## GridSearchCV vs RandomizedSearchCV

GridSearchCV and RandomizedSearchCV are two cross validation methods whereby you can tune your parameters. GridSearchCV basically uses a dictionary of parameters and tries out every single combination in that dictionary. It will give best values like best_estimator, best_score, and most important best_params_. This helps if we are trying to optimise the parameters and we have a range of the best parameters. 

RandomizedSearchCV is the same, except that it does not try out every single combination in the dictionary, and instead samples a few number of parameters from the dictionary itself. This helps if we are just trying to get a rough estimate for the parameters we're trying to finetune. 

In [None]:
y_full_train = pd.concat([y_train, y_val])

del data_full_train['totalyearlycompensation']

dicts = data_full_train.to_dict(orient = 'records')
dv = DictVectorizer(sparse = False)
X_full_train = dv.fit_transform(dicts)

dicts = data_test.to_dict(orient = 'records')
X_test = dv.fit_transform(dicts)

## Linear Regression parameters 

There aren't a lot of parameters to play with for the original linear regression algorithm, therefore we really should just look at a few parameters here.

These parameters are mainly fit_intercept and normalize. Just for fun, let's do a grid search on these parameters since there aren't many things to train.

In [None]:
parameters = {'fit_intercept':(True, False), 'normalize': (True, False)}

clf = GridSearchCV(linreg, parameters)
clf.fit(X_full_train, y_full_train)

In [None]:
clf.best_params_

Just as expected, the best parameters are the default ones. Therefore, we will not need to train the linear regression model again with the optimized parameters.

## Random Forest Regression 

The random forest regressor has more parameters to tune, as compared to linear regression. From here on, we will be using RandomizedSearchCV first before we further tune it with GridSearchCV. 

In [None]:
# RandomizedSearchCV
parameters = {
    'n_estimators': [10, 100, 200, 350, 500, 700, 850, 1000],
    'max_depth':[None, 5, 10, 20, 30, 37],
    'min_samples_split':[2, 10, 40, 100, 200, 300, 500, 1000],
    'min_samples_leaf':[1, 2, 10, 40, 100, 200, 300, 500, 1000],
    'bootstrap':(True, False),
}

clf = RandomizedSearchCV(rfr, parameters)
clf.fit(X_full_train, y_full_train)

In [None]:
clf.best_params_

In [None]:
# GridSearchCV

rfr = RandomForestRegressor(bootstrap = True, max_depth = None)
parameters = {
    'n_estimators': [150, 170, 190, 200, 210, 230, 250],
    'min_samples_split': [150, 170, 190, 200, 210, 230, 250],
    'min_samples_leaf': [150, 170, 190, 200, 210, 230, 250]
}

clf = GridSearchCV(rfr, parameters)
clf.fit(X_full_train, y_full_train)

In [None]:
rfr_params = clf.best_params_
rfr_params

## XGBoost Parameters 

Like the random forest regressor, there are more parameters to tune. I have defined a new XGBRegressor with the tree_method as gpu_hist so that I am able to use my graphics card for hyperparameter tuning.

In [None]:
xgb = XGBRegressor(tree_method = 'gpu_hist')

parameters = {
    'eta': [0.001, 0.01, 0.1, 0.3, 0.5, 1, 2, 5],
    'max_depth':[None, 5, 10, 20, 30],
    'min_child_weight': [0.1, 0.2, 0.5, 1, 2, 5, 10, 20, 50, 100]
    
}

clf = RandomizedSearchCV(xgb, parameters)
clf.fit(X_full_train, y_full_train)

In [None]:
clf.best_params_

In [None]:
parameters = {'min_child_weight': [1.5, 2, 2.5, 3, 4],
              'max_depth': [6, 8, 10, 12, 14, 16],
              'eta': [0.05, 0.075, 0.1, 0.15, 0.2, 0.25]}

clf = GridSearchCV(xgb, parameters)
clf.fit(X_full_train, y_full_train)

In [None]:
clf.best_params_

## CatBoostParameters 



In [None]:
cbr = CatBoostRegressor(task_type = 'GPU')

parameters = {
    'learning_rate': [0.001, 0.01, 0.1, 0.3, 0.5, 1, 2, 5],
    'l2_leaf_reg': [0.001, 0.01, 0.1, 1, 2, 5, 10],
    'bagging_temperature': [0,1],
    'depth':[1,2,3,4,5,6,7,8]
}

clf = RandomizedSearchCV(cbr, parameters)
clf.fit(X_full_train, y_full_train)

In [None]:
clf.best_params_

In [None]:
cbr = CatBoostRegressor(task_type = 'GPU', depth = 2)

parameters = {'learning_rate': [0.01, 0.02, 0.05, 0.07],
              'l2_leaf_reg': [0.001, 0.002, 0.003, 0.004, 0.005],
              'bagging_temperature': [1, 2, 3, 4, 5, 6]}

clf = GridSearchCV(cbr, parameters)
clf.fit(X_full_train, y_full_train)

In [None]:
clf.best_params_

## Training the models - round 2
We will be training the models further, now with the optimised hyperparameters. We will be using the same models, just that we are training the models with X_full_train and y_full_train in this case. 

In [None]:
linreg = LinearRegression()
rfr = RandomForestRegressor(bootstrap = True, max_depth = None, 
                            min_samples_leaf= 250, min_samples_split = 210, n_estimators = 230)
xgb = XGBRegressor(tree_method = 'gpu_hist', min_child_weight= 2.5, max_depth= 16, eta= 0.075)
cbr = CatBoostRegressor(task_type = 'GPU', depth = 2, learning_rate = 0.01, bagging_temperature = 3, l2_leaf_reg = 0.002, silent = True)

In [None]:
models= [linreg, rfr, xgb, cbr]
for model in models:
    model.fit(X_full_train, y_full_train)
    y_pred = model.predict(X_test)
    score = rmse(y_pred, y_test)
    print("For model %s: RMSE is : %.3f" % (model, score))

It seems like when we use the X_full_train and X_test, the RMSE for RandomForestRegressor and XGBRegressor are a bit higher than that of the CatBoostRegressor and LinearRegression. 


## End of Notebook No.2 

Thank you for reading notebook no.2 for training the models and hyperparameter tuning. In the next few python files, I will be deploying the CatBoostRegressor model to the cloud! 