## Comparison of Different Validation Methods

#### By:
#### Swati Kohli & Poonam Patil

## Performing Supervised Machine Learning using Lasso Regression with different Validation Methods
#### The notebook is an analytics exercise exploring the use of Lasso regression in scikit-learn as an efficient tool   for a high dimensional dataset (too many predictors, too few observations), as a first step is to screen out insignificant features.  However, the goal is to demonstrate different validation methods used to avoid overfitting and their comparison along with pros and cons of each towards model building.
### Objective:
#### Implement 3 validation methods by building a Lasso regression model to predict the total number of non-violate crimes (per 100k population) and compare performance. 
Original Dataset Source: https://archive.ics.uci.edu/ml/datasets/Communities+and+Crime+Unnormalized
### Technique:  
#### The validation method is used to determine best model. The three validation methods used for computation here are:
1. Train / Validation / Test split
2. 5-Fold Cross Validation
3. 10-Fold Cross Validation 
  
#### For Lasso Regression, hyperparameter selection is done for the following parameters:
1. **alpha:** the lamba value in the penalty
2. **max iter:** the maximum number of iterations for optimization algorithm (Min. 50).
3. **tol:** the tolerance for optimization (max 0.1)  

### Performance Comparison
#### Finally, the results are compared across the three methods along the following metrics :
1. Time taken for hyperparameter selection
2. Coefficients of selected predictors
3. Prediction results or MSE on test set.

### Pre processing
The dataset was cleaned beforehand to work with and finally consists of 2118 observations, and 101 predictors + 1 response (total number of non-violent crimes).  
All relevant libraries are imported for the task, dataset loaded and classified as X (Predictor variables) and y(Response Variable).

In [1]:
# Import relevant Libraries
import pandas as pd
import numpy as np 

# For Lasso Regression
from sklearn import linear_model # For LASSO Regression 
from sklearn.linear_model import Lasso
from sklearn import metrics # For evaluation
from sklearn.metrics import mean_squared_error # For evaluation
from sklearn.preprocessing import StandardScaler # For scaling/standardizing dataset

# For Validation methods
from sklearn.model_selection import train_test_split # Dataset Splitting
import itertools # To form all Hyperparameter combination pairs
from sklearn.pipeline import Pipeline # Package to perform instructions 
from sklearn.model_selection import GridSearchCV # CV method

import time # For time evaluation

import warnings # Suppress warnings because they are annoying
warnings.filterwarnings('ignore') 

In [2]:
# Import Dataset
community = pd.read_csv('community.csv')
community.head()

Unnamed: 0,population,householdsize,racepctblack,racePctWhite,racePctAsian,racePctHisp,agePct12t21,agePct12t29,agePct16t24,agePct65up,...,PctForeignBorn,PctBornSameState,PctSameHouse85,PctSameCity85,PctSameState85,LandArea,PopDens,PctUsePubTrans,LemasPctOfficDrugUn,nonViolPerPop
0,11980,3.1,1.37,91.78,6.5,1.88,12.47,21.44,10.93,11.33,...,10.66,53.72,65.29,78.09,89.14,6.5,1845.9,9.63,0.0,1394.59
1,23123,2.82,0.8,95.57,3.44,0.85,11.01,21.3,10.48,17.18,...,8.3,77.17,71.27,90.22,96.12,10.6,2186.7,3.84,0.0,1955.95
2,29344,2.43,0.74,94.33,3.43,2.35,11.36,25.88,11.01,10.28,...,5.0,44.77,36.6,61.26,82.85,10.6,2780.9,4.37,0.0,6167.51
3,11245,2.76,0.53,89.16,1.17,0.52,24.46,40.53,28.69,12.65,...,1.74,73.75,42.22,60.34,89.02,11.5,974.2,0.38,0.0,9988.79
4,140494,2.45,2.51,95.65,0.9,0.95,18.09,32.89,20.04,13.26,...,1.49,64.35,42.29,70.61,85.66,70.4,1995.7,0.97,0.0,6867.42


In [3]:
# Classify as X & y (Predictors and Response variable)
X = community.copy()
del X['nonViolPerPop']
y = community['nonViolPerPop']

### Validation methods

### **Method 1** - Train, Validation & Test set  
**About**  
In Train & Test method approach, there is quite a risk of overfitting on the test set because the parameters can be tweaked until the estimator performs optimally. Therefore, to avoid data leakage, instead of train test, Train, Validation & Test set method is used where dataset is partitioned in 3 parts. The training is done on the training set, post that evaluation is done on the validation set, and ultimately, final evaluation is done on the test set.
For this exercise, 50%, 20% & 30% dataset split is used.

**Approach:**  
Dataset is split
1. Split the dataset and create Train, Validation & test set of 50-21-30% respectively.
2. Set up the combination of various candidate values for hyperparameter selection.
2. Scale the training data and transform training and Valid set. This is to change the values of numeric columns in the dataset to a common scale.
3. Perform Lasso to learn the best hyperparameters based on MSE of validation set and compute time taken for this process.
4. Scale train + valid and transform test set
5. Fit the train + valid set with best lambda
6. Predict and find MSE on test set
7. Extract the final features, coefficients and their count.

In [4]:
#   ******** lasso method 1 : Train / validation / test split method ******** 

# 1. Split the dataset and create Train, Validation & test set of 50-21-30% respectively.

# Create 70% - 30 % split as train-test split of original data
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 861)

# create 70% - 30% as training-validation split of train set (30% of train set is 21% of original data set)
X_training, X_valid, y_training, y_valid = train_test_split(X_train, y_train, test_size = 0.3, random_state = 861)

In [5]:
# 2. Set up candidate values for hyper parameter selection
lambdas= np.logspace(-10, 10, 21) # tuning parameter 
max_iter = [50,60,70]             # Maximum number of iterations taken for the solver to converge
tol= [0.0001, 0.001, 0.01, 0.1]   # Tolerance for stopping criteria

# Now we will form sets of lambdas, max_iter and tol using the itertools library to form all combination pairs.
hyperparameter_sets = list(itertools.product(lambdas, max_iter, tol)) 

In [6]:
# 3. Scale the data

start1 = time.time() # record start time
scaler = StandardScaler()
scaler.fit(X_training)
X_training = pd.DataFrame(scaler.transform(X_training))
X_valid = pd.DataFrame(scaler.transform(X_valid))


# 4. Lasso to evaluate best hyperparameter
validation_mse =[] 

for ind, sets in enumerate(hyperparameter_sets):
    lm= Lasso(alpha =sets[0], max_iter=sets[1],tol=sets[2])
    lm.fit(X_training, y_training)         # fit lasso on training set
    # predict on validation set
    validation_mse.append(metrics.mean_squared_error(lm.predict(X_valid), y_valid))

end1 = time.time() # record end time

print('Hyperparameter selection execution time in seconds:',end1-start1)
print('Min. validation MSE :', min(validation_mse))
print('Best hyper parameter set :',hyperparameter_sets[np.argmin(validation_mse)])

# select best hyper parameters
bestlambda = hyperparameter_sets[np.argmin(validation_mse)][0]
best_max_iter= hyperparameter_sets[np.argmin(validation_mse)][1]
best_tol= hyperparameter_sets[np.argmin(validation_mse)][2]


# 5. Now fit scaler on train set and then transform train and test set for standardization
scaler = StandardScaler()
scaler.fit(X_train)
X_train = pd.DataFrame(scaler.transform(X_train))
X_test = pd.DataFrame(scaler.transform(X_test))

X_train.columns = X.columns.values # to assign column names to X_train set as Scaler transform resets column names


# 6. fit lasso on train set using best hyper parameters selected
lm=Lasso(alpha = bestlambda, max_iter = best_max_iter, tol=best_tol)
lm.fit(X_train, y_train)


# 7. Evaluate model on test set
MSE_method1 = metrics.mean_squared_error(lm.predict(X_test),y_test)
print('MSE of train-valid-test split method on test set:',MSE_method1)

Hyperparameter selection execution time in seconds: 2.747811794281006
Min. validation MSE : 2361000.874450426
Best hyper parameter set : (10.0, 70, 0.0001)
MSE of train-valid-test split method on test set: 3590601.4151061825


In [7]:
# 8. Extract the final features, coefficients and their count. 

# save lasso method1 coefficient in a dataframe
lasso_method1 = pd.DataFrame(zip(X_train.columns.values,lm.coef_))

# rename column names
lasso_method1.columns = ['Predictor', 'Lasso_Coef']

# set index to predictor
lasso_method1.set_index('Predictor',inplace=True)

# select lasso method1 significant predictor variables
lasso_signi_vars = lasso_method1[lasso_method1['Lasso_Coef'] != 0]
print('Number of significant variables in Lasso method 1:',lasso_signi_vars.shape[0])

Number of significant variables in Lasso method 1: 69


**Result**  
Lasso method 1 selected 69 predictor variables which are significant for modeling. 
We will compare the model performance later.

### **Method 2** - 5 fold CV + test set  
**About**
Cross Validation method with k fold is used for Hyperparameter tuning and better learning. A test set is still held out for final evaluation. The method performs CV where the training set is split into k smaller sets(in this case 5), training is done on training sets which is k-1 folds and then validation is done on remaining data to compute the performance measure(MSE) which is averaged over the loops of each hyperparameter set. Further refinement is done for further tuning the best hyperparamters. Final evaluation is done on the test set.  

**Approach:**  
1. Divide train and test set. (Train/Test partition of method 1 used with same hold out set).
2. Setup Pipeline to  
    a. Scale the data  
    b. Lasso algorithm for hyperparamter selection
3. Set parameters for each item in the pipeline
4. Perform CV 5 fold through a sparse Grid Search to find best hyperparameters.
5. Refine the grid search further for tuning the best hyperparameter and rerun the process.
6. Predict and find MSE on test set with refined hyperparameters.
7. Extract the final features, coefficients and their count.

In [8]:
#   ******** Method-2 : lasso regression with 5 fold cross validation ******** 

# 2. Set up the model pipeline
start2_1 = time.time() # record start time

estimator = Pipeline(steps = [('scale', StandardScaler()), # Scale the data
                     ('lasso', Lasso()) ]) # fit the scaled data using Lasso

# 3. Set up the parameters for each item in pipeline
parameters = {'lasso__alpha': np.logspace(-10,10,21),'lasso__max_iter': [50,60,70],
              'lasso__tol': [0.0001, 0.001, 0.01, 0.1]}

# 4. Instantiate gridsearch cross validation for the model in pipeline
reg2_1 = GridSearchCV(estimator = estimator, param_grid = parameters, cv = 5, 
                   scoring = 'neg_mean_squared_error', n_jobs = -1) # Instantiate the gridsearch

# fit the model on train data
reg2_1.fit(X_train, y_train)

end2_1 = time.time() # record end time
print('Hyper parameter selection execution time in seconds:',end2_1-start2_1)
print('Best hyper parameter set :',reg2_1.best_params_)   # The best parameter from CV

print('MSE of CV5 on test set:',mean_squared_error(reg2_1.predict(X_test), y_test))

Hyper parameter selection execution time in seconds: 17.3836829662323
Best hyper parameter set : {'lasso__alpha': 10.0, 'lasso__max_iter': 50, 'lasso__tol': 0.0001}
MSE of CV5 on test set: 3589793.910006581


#### 5 fold CV Parameter Refinement

In [9]:
# 5. Best Hyperparameter set refinement

start2_2 = time.time() # record start time

# Set up the refined parameters
parameters = {'lasso__alpha': np.linspace(1,20,20),'lasso__max_iter': [60,65,70,75,80],
              'lasso__tol': [0.0001, 0.001, 0.01, 0.1]}

# Instantiate gridsearch cross validation
reg2_2 = GridSearchCV(estimator = estimator, param_grid = parameters, cv = 5, 
                   scoring = 'neg_mean_squared_error', n_jobs = -1) 

# Fit the grid search, i.e. perform CV and grid search. 
reg2_2.fit(X_train, y_train) 

end2_2 = time.time() # record end time

print('Refined Hyperparameter selection execution time in seconds:', end2_2-start2_2)
print('Refined hyperparameter set :', reg2_2.best_params_) # The best parameter from CV

MSE_method2 = mean_squared_error(reg2_2.predict(X_test), y_test)
print('MSE of refined CV5 on test set:',MSE_method2)

Refined Hyperparameter selection execution time in seconds: 31.418373823165894
Refined hyperparameter set : {'lasso__alpha': 19.0, 'lasso__max_iter': 80, 'lasso__tol': 0.01}
MSE of refined CV5 on test set: 3696196.0844516433


In [10]:
# 6. Extract the final features, coefficients and their count. 
lasso_method2 = reg2_2.best_estimator_.named_steps['lasso'].coef_
lasso2 = pd.DataFrame(zip(X_train.columns.values,lasso_method2))
lasso2.columns = ['Predictor', 'LassoCV5_Coef']

# set index to predictor
lasso2.set_index('Predictor',inplace=True)

# select lasso method2 significant predictor variables
lasso_signi_vars2 =lasso2[lasso2['LassoCV5_Coef'] != 0]
print('Number of significant variables in Lasso method-2:', lasso_signi_vars2.shape[0])

Number of significant variables in Lasso method-2: 56


**Result**  
Lasso method 2 selected 56 predictor variables which are significant for modeling. 
We will compare the model performance later.

### **Method 3** - 10 fold CV + test set  
CV 10 fold is same as CV 5 fold in terms of approach. The only difference is the k folds will be 10 in this case. Also,Train/Test partition of method 1 used with same hold out set.

In [11]:
#   ******** Method-3 : lasso regression with 10 fold cross validation ********

# 2. Set up the model pipeline
start3_1 = time.time() # record start time

estimator = Pipeline(steps = [('scale', StandardScaler()), # Scale the data
                     ('lasso', Lasso()) ]) # regression model to use

# 3. Set up the parameters for each item in pipeline
parameters = {'lasso__alpha': np.logspace(-10,10,21),'lasso__max_iter': [50,60,70],'lasso__tol': [0.0001, 0.001, 0.01, 0.1]}

# 4. Instantiate gridsearch cross validation for the model in pipeline
reg3_1 = GridSearchCV(estimator = estimator, param_grid = parameters, cv = 10, 
                   scoring = 'neg_mean_squared_error', n_jobs = -1) 

# Fit the grid search, i.e. perform CV and grid search. 
reg3_1.fit(X_train, y_train) 

end3_1 = time.time() # record end time

print('Hyper parameter selection execution time in seconds:',end3_1-start3_1)
print('Best hyper parameter pair :', reg3_1.best_params_) # The best parameter from CV
print('MSE of CV10 on test set:',mean_squared_error(reg3_1.predict(X_test), y_test))

Hyper parameter selection execution time in seconds: 32.18776512145996
Best hyper parameter pair : {'lasso__alpha': 10.0, 'lasso__max_iter': 50, 'lasso__tol': 0.0001}
MSE of CV10 on test set: 3589793.910006581


#### 10 fold CV Parameter Refinement

In [12]:
# 5. Best Hyperparameter set refinement

start3_2 = time.time() # record start time

# Set up the refined parameters
parameters = {'lasso__alpha': np.linspace(1,20,20),'lasso__max_iter': [60,65,70,75,80],'lasso__tol': [0.0001, 0.001, 0.01, 0.1]}

# Instantiate gridsearch cross validation
reg3_2 = GridSearchCV(estimator = estimator, param_grid = parameters, cv = 10, 
                   scoring = 'neg_mean_squared_error', n_jobs = -1) 

# Fit the grid search, i.e. perform CV and grid search. 
reg3_2.fit(X_train, y_train) 

end3_2 = time.time()
print('Refined Hyperparameter selection execution time in seconds:',end3_2-start3_2) # set end time
print('Best hyper parameter pair :', reg3_2.best_params_) # The best parameter from CV

MSE_method3 = mean_squared_error(reg3_1.predict(X_test), y_test)
print('MSE of refined CV10 on test set:',MSE_method3)

Refined Hyperparameter selection execution time in seconds: 55.72141122817993
Best hyper parameter pair : {'lasso__alpha': 13.0, 'lasso__max_iter': 60, 'lasso__tol': 0.1}
MSE of refined CV10 on test set: 3589793.910006581


In [13]:
# 6. Extract the final features, coefficients and their count. 

# Extract regression coefficients of model
lasso_method3 = reg3_2.best_estimator_.named_steps['lasso'].coef_
lasso3=pd.DataFrame(zip(X_train.columns.values,lasso_method3))
lasso3.columns = ['Predictor', 'LassoCV10_Coef']
lasso3.set_index('Predictor',inplace=True)

# select lasso method3 significant predictor variables
lasso_signi_vars3 =lasso3[lasso3['LassoCV10_Coef'] != 0]
print('Number of significant variables in Lasso method-3:',lasso_signi_vars3.shape[0])

Number of significant variables in Lasso method-3: 66


**Result**  
Lasso method 3 selected 66 predictor variables which are significant for modeling. 
## Lets compare the model performances now.

### Metric Evaluation
Lets compare the performances across all the three methods

### Metric 1 - Time Comparison

In [14]:
print('Hyperparameter selection execution time in seconds:')
print('Time for train valid and test Method:',end1-start1)
print('Time for 5 fold CV Method:',end2_2-start2_2)
print('Time for 10 fold CV Method:',end3_2-start3_2)

Hyperparameter selection execution time in seconds:
Time for train valid and test Method: 2.747811794281006
Time for 5 fold CV Method: 31.418373823165894
Time for 10 fold CV Method: 55.72141122817993


**Result**
Computational time for 10 fold CV is highest and that for train-valid-split method is lowest.  

### Metric 2 - Prediction Error comparison

The comparison is demonstrated on test set for all the three methods

In [15]:
print('MSE on test set:',MSE_method1)
print('MSE of refined CV5 on test set:',MSE_method2)
print('MSE of refined CV10 on test set:',MSE_method3)

MSE on test set: 3590601.4151061825
MSE of refined CV5 on test set: 3696196.0844516433
MSE of refined CV10 on test set: 3589793.910006581


**Result**
The comparison shows the least prediction error (MSE) for CV 10 fold method.  

### Metric 3 - Comparison of predictor coefficient 

In [16]:
# The coefficients extracted by three methods are put together for comparision of coefficients values.
Coef_compare = pd.concat([lasso_method1, lasso2, lasso3], axis=1,join='outer')

Unnamed: 0_level_0,Lasso_Coef,LassoCV5_Coef,LassoCV10_Coef
Predictor,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
population,0.0,0.0,0.0
householdsize,-152.807789,-107.140099,-217.70968
racepctblack,89.451908,118.645739,114.3447
racePctWhite,-0.0,-0.0,-0.0
racePctAsian,20.556429,3.855345,19.458797


Features selected by Lasso method 1 are 69, 56 by method2 and 66 by method 3. Thus Lasso method 2 (i.e. 5 fold cross validation) minimizes many coeffients to zero thus proving best in feature selection.

Also Lasso method 2 does maximum coefficient shrinkage for most of the coefficients.

In [17]:
# to get the best predictors in the model, we need to sort coefficients in descending order on its absolute value

# create new column with absolute values of coefficients from method-2 as we selected method 2 as best model
Coef_compare['abs_CV5_Coeff'] = abs(Coef_compare['LassoCV5_Coef'])

# sort on new column with absolute coefficient values
Coef_compare_sorted = Coef_compare.sort_values(by = 'abs_CV5_Coeff', ascending = False)

# delete column with absolute values
del Coef_compare_sorted['abs_CV5_Coeff']

Coef_compare_sorted.head(10)

Unnamed: 0_level_0,Lasso_Coef,LassoCV5_Coef,LassoCV10_Coef
Predictor,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
PctKids2Par,-822.26726,-897.761317,-409.737149
pctWSocSec,529.006152,375.839726,419.204422
MalePctDivorce,394.396462,367.172051,501.053654
PopDens,-378.781304,-311.951733,-337.557425
PctPopUnderPov,361.63033,266.154889,477.237506
PctEmploy,486.053597,264.377057,379.02979
PctBornSameState,-250.985026,-254.108229,-268.803805
pctWRetire,-267.383353,-247.884818,-251.506784
PctForeignBorn,492.777624,205.096989,266.218797
OwnOccMedVal,-202.427118,-197.41075,-115.462639


Lasso 5-fold cross validation method selected PctKids2Par (percentage of kids in family housing with two parents) as best predictor of total number of non-violent crimes.

#### Coefficint Interpretation for first two predictors with method2:
1. If standerdize percentage of kids in family housing with two parents increases by 1 standard deviation then standerdize total number of non-violent crimes decreases by 897.76 standard deviation.
2. If standerdize percentage of households with social security income in 1989 increases by 1 standard deviation then standerdize total number of non-violent crimes increases by 375.83 standard deviations.

### Model Comparison Interpretation
Since the datasize is small and number of predictors are large, in terms of feature selection, CV method (in particular 5 fold) is the best for initial model building and feature selection. This is because even though Train Valid Test method is quick, it's model fitting is predominatly based on chance(by applying hyperparameter selection on only one validation set split). This is also why it takes less time for computation and not very relaible for providing least error on hold out set(test set).

Comparing between 5 Fold & 10 Fold execution time is less for 5 Fold. The CV 10 fold model performs best in error prediction as MSE is reduced by almost 3% from CV 5 to CV 10. But there is a trade off between prediction result and computational time (18 seconds vs almost 1 minute!) & power for hyperparameter selection. For initial model building even though, 10 fold CV provides the best model fitting out of the three methods, since it takes higher computational power and time, it is not a preffered method. 


### Pros and Cons of the Methods

**Method 1** 

PROS  :Least computational time so is a fast method

CONS  :However, by partitioning the data into three sets, the number of samples are reduced significantly which could have been used for learning the model, the results depend on a particular random choice for the pair of (train, validation) sets.

**Method 2 & 3**  
Pros  : There is less data wastage, less dependence on luck for fixing an arbitrary validation set. Therefore better model fitting.

Cons  : Computationally expensive.

**Method 2 vs method 3**  
Longer computational time and power in 10 fold cross validation.

### Conclusion
Due to the option of training on multiple train-test split (as k folds), cross validation is the preffered method predominantly. It also gives a probability of better performance on unseen data(hold-out set). Train Valid Test method, on the other hand, only trains on one validation set which makes the result dependent on chance or luck.
Method-1 is good for initial model building or with a very large dataset because it is much quicker compared to CV method that takes more computational power and time to run.
The CV method is a more accurate representation of how the model will perform on unseen data than Train Valid Test method.

 