# Regression Block: A modularized approach to test and tune multiple regression algorithms with minimal manual intervention!

_Before you read: Not that this module is intended to facilitate quicker experimentation for the users with a moderate understanding of regression models and Python programming_

## Introduction
While working on one of my pet projects, I realised that sometimes it is better to test different model forms to determine the most suitable model which provides a good balance of accuracy, complexity and execution efficiency based on the problem at hand. Some of the softwares such as RapidMiner provide this functionality. However, using a software product for this purpose results in a black-box approach in terms of tuning the model and exploring some of the intricacies. Hence I decided to create a simple python script with just-enough modularization and parameterization to enable testing and tuning many of the widely used regression algorithms with minimal changes in the code.
The summary of this notebook is as follows:

#### Objective:
To test, tune and compare various regression models with minimal manual intervention in Python.
The models included in this module are:
* Linear Regression
* Ridge Regression
* Lasso Regression
* K Nearest Neighbors
* Bayesian Ridge
* Decision Tree Regression
* Random Forest
* Bagging (Using decision tree by default)
* Gradient boosting
* XGBoost
* Support Vector Machines

#### User Proficiency:
The user should have an intuitive understanding of how each of these algorithms works along with a good understanding of how changing a particular hyper-parameter might impact the outcome. Basic understanding of python is required to be able to effectively utilize the code and further customize it based on requirements.

#### Key Modifiable Inputs:
Below are the key inputs (More details are provided for each input in the inline comments). These sections have been highlighted in the code with a note '__MAKE MODIFICATIONS HERE__':
* Input dataset for regression analysis: In this example, I have used 'diabetes' dataset from pandas default datasets
* Test data proportion: Between 0 to 1, default 0.3 (or 30%)
* Normalization:  0 - No Normalization, 1 - Min-max scaling, 2 - Z-score scaling
* List of model objects to test
* Number of folds for grid-search (hyper-parameter tuning)
* Scoring criteria to determine the best model (e.g. Mean squared error) - more details are provided in the code comments
* Flag to see the level of detail on the terminal during model fit: 0 - No output, 1 - All details, 2 - Progress bar
* Hyper-parameter library: A global dictionary in the code that provides set of hyper-parameters for each model form to tune on

#### General Execution Steps:
After taking these inputs, the following actions are performed for __each__ model form under consideration:
* Forward feature selection
* Normalization
* Grid search for hyper-parameter tuning
* Metric calculation for the best model

#### Output
A pandas dataframe 'results' is created which provides following metrics for each of the model forms you are testing
* Model details with most optimum hyper-parameters
* Train and test root mean squared errors
* Train and test mean absolute percentage errors

This table helps in comparing among various model forms while the train and test metrics can be a good indicator to spot overfitting.

#### Important Note
This module in no way deals with feature engineering and only performs feature selection based on the input data. It is highly important to perform effective feature engineering in order to improve results with any model. A user might observe one of the model forms giving better results than the other however overall performance of any model can be improved significantly with improvement in predictor variables.

# Script

#### Environment Setup
This section imports all the required packages.

In [None]:
# importing general purpose libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
import dfply as dp
import math
import random
import warnings
from sklearn import datasets

# importing model selection and evaluation libraries

# train-test-validation dataset creation
from sklearn.model_selection import train_test_split

# data normalization
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Pipeline
from sklearn.pipeline import Pipeline

# feature selection
from mlxtend.feature_selection import SequentialFeatureSelector
from mlxtend.plotting import plot_sequential_feature_selection

# hyperparameter tuning
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# crossvalidation
from sklearn.model_selection import cross_val_score, KFold

# accuracy testing
from sklearn.metrics import mean_absolute_error, r2_score, mean_squared_error

# Importing models

# linear models
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.linear_model import BayesianRidge

# non-parametric models
from sklearn.neighbors import KNeighborsRegressor

# Decision tree
from sklearn.tree import DecisionTreeRegressor

# Support vectr machine
from sklearn.svm import SVR

# ensemble models

# bagging
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor

# tree based boosting
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor

# stacking
from mlxtend.regressor import StackingRegressor


#### Modules for various tasks

The first function creates the pipeline for normalization and grid search based on conditions specified by the user in the control panel.

In [None]:
def create_pipeline(norm, model):
    if norm == 1:
        scale = StandardScaler()
        pipe = Pipeline([('norm', scale), ('reg', model)])
    elif norm == 2:
        scale = MinMaxScaler()
        pipe = Pipeline([('norm', scale), ('reg', model)])
    else:
        pipe = Pipeline([('reg', model)])
    return pipe

The second function performs forward feature selection and returns the indices of best features.

In [None]:
def select_features(model, X_train, Y_train, selection,
                    score_criteria, see_details, norm=0):
    pipe = create_pipeline(norm, model)
    sfs = SequentialFeatureSelector(pipe,
                                    forward=selection,
                                    k_features='best',
                                    scoring=score_criteria,
                                    verbose=see_details)
    sfs = sfs.fit(X_train, Y_train)
    return list(sfs.k_feature_idx_)

This function performs grid search for provided parameter grid and returns best model object.

In [None]:
def run_model(model, param_grid, X_train, Y_train,
              X, Y, score_criteria, folds,
              see_details, norm=0):
    pipe = create_pipeline(norm, model)
    model_grid = GridSearchCV(pipe,
                              param_grid,
                              cv=folds,
                              scoring=score_criteria,
                              verbose=see_details)
    model_grid.fit(X_train, Y_train)

    return model_grid.best_estimator_

The last function calculates all the relevant metrics for the best hyper-parameter combination and returns a pandas series of these metrics.

In [None]:
def get_model_eval(model, X_train, Y_train, X_test, Y_test):
    return pd.Series([model, mean_squared_error(Y_train, model.predict(X_train)),
                      mean_squared_error(Y_test, model.predict(X_test)),
                      (abs(model.predict(X_train) - Y_train) / Y_train).mean(),
                      (abs(model.predict(X_test) - Y_test) / Y_test).mean()])

#### Global Hyper-parameter Dictionary (__MAKE MODIFICATIONS HERE__)
This is the global dictionary for various model parameters for all the models in this module. Some default set of values have been populated in the code for typical ranges based on the diabetes dataset. This dictionary contains some of the key hyper-parameters for each model and it is not exhaustive. Users are encouraged to visit scikit-learn documentation to get a list of all the parameters and add to the below dictionary according to their requirements.

In [None]:
PARAM_DICT = {
              LinearRegression: {'reg__copy_X': [True, False],
                                 'reg__fit_intercept': [True, False],
                                 'reg__n_jobs': [10, 20]},
              Ridge: {'reg__alpha': [0.1, 1, 100],
                      'reg__copy_X': [True, False],
                      'reg__fit_intercept': [True, False],
                      'reg__tol': [0.1, 1],
                      'reg__solver': ['auto', 'svd', 'cholesky', 'lsqr',
                                      'sparse_cg', 'sag', 'saga']},
              Lasso: {'reg__alpha': [0.1, 1, 100],
                      'reg__copy_X': [True, False],
                      'reg__fit_intercept': [True, False],
                      'reg__tol': [0.1, 1]},

              KNeighborsRegressor: {'reg__n_neighbors': [5, 30, 100]},
              BayesianRidge: {'reg__alpha_1': [10**-6, 10**-3],
                              'reg__alpha_2': [10**-6, 10**-3],
                              'reg__copy_X': [True, False],
                              'reg__fit_intercept': [True, False],
                              'reg__lambda_1': [10**-6, 10**-3],
                              'reg__lambda_2': [10**-6, 10**-3],
                              'reg__n_iter': [300, 500, 1000],
                              'reg__tol': [0.001, 0.01, 0.1]},

              DecisionTreeRegressor: {'reg__max_depth': [5, 10, 20],
                                      'reg__max_features': [0.3, 0.7, 1.0],
                                      'reg__max_leaf_nodes': [10, 50, 100],
                                      'reg__splitter': ['best', 'random']},

              BaggingRegressor: {
                                 'reg__bootstrap': [True, False],
                                 'reg__bootstrap_features': [True, False],
                                 'reg__max_features': [0.3, 0.7, 1.0],
                                 'reg__max_samples': [0.3, 0.7, 1.0],
                                 'reg__n_estimators': [10, 50, 100]},
              RandomForestRegressor: {'reg__bootstrap': [True, False],
                                      'reg__max_depth': [5, 10, 20],
                                      'reg__max_features': [0.3, 0.7, 1.0],
                                      'reg__max_leaf_nodes': [10, 50, 100],
                                      'reg__min_impurity_decrease': [0, 0.1, 0.2],
                                      'reg__n_estimators': [10, 50, 100]},

              SVR: {'reg__C': [10**-3, 1, 1000],
                    'reg__kernel': ['linear', 'poly', 'rbf'],
                    'reg__shrinking': [True, False]},

              GradientBoostingRegressor: {'reg__learning_rate': [0.1, 0.2, 0.5],
                                          'reg__loss': ['ls', 'lad', 'huber', 'quantile'],
                                          'reg__max_depth': [10, 20, 50],
                                          'reg__max_features': [0.5, 0.8, 1.0],
                                          'reg__max_leaf_nodes': [10, 50, 100],
                                          'reg__min_impurity_decrease': [0, 0.1, 0.2],
                                          'reg__min_samples_leaf': [5, 10, 20],
                                          'reg__min_samples_split': [5, 10, 20],
                                          'reg__n_estimators': [10, 50, 100]},
              XGBRegressor: {'reg__booster': ['gbtree', 'gblinear', 'dart'],
                             'reg__learning_rate': [0.2, 0.5, 0.8],
                             'reg__max_depth': [5, 10, 20],
                             'reg__n_estimators': [10, 50, 100],
                             'reg__reg_alpha': [0.1, 1, 10],
                             'reg__reg_lambda': [0.1, 1, 10],
                             'reg__subsample': [0.3, 0.5, 0.8]},

              }

#### User Control Panel For Key Inputs (__MAKE MODIFICATIONS HERE__)
The inputs to the modules can be changed here. This is the control panel for this script where all the variables mentioned in the introduction can be altered to test various scenarios. Please refer to the comments to understand the variables.

In [None]:
# --------------------------------------------------------------------------
# USER CONTROL PANEL, CHANGE THE VARIABLES, MODEL FORMS ETC. HERE

# Read data here, define X (features) and Y (Target variable)
data = datasets.load_diabetes()
X = pd.DataFrame(data['data'])
X.columns = data['feature_names']
Y = data['target']

# Specify size of test data (%)
size = 0.3

# Set random seed for sampling consistency
random.seed(100)

# Set type of normalization you want to perform
# 0 - No Normalization, 1 - Min-max scaling, 2 - Zscore scaling
norm = 0

# Mention all model forms you want to run - Model Objects
to_run = [LinearRegression,
          Ridge,
          Lasso,
          KNeighborsRegressor,
          DecisionTreeRegressor,
          BaggingRegressor,
          SVR,
          XGBRegressor]

# Specify number of crossvalidation folds
folds = 5

# Specify model selection criteria
# Possible values are:
# ‘explained_variance’
# ‘neg_mean_absolute_error’
# ‘neg_mean_squared_error’
# ‘neg_mean_squared_log_error’
# ‘neg_median_absolute_error’
# ‘r2’
score_criteria = 'neg_mean_absolute_error'

# Specify details of terminal output you'd like to see
# 0 - No output, 1 - All details, 2 - Progress bar
# Outputs might vary based on individual functions
see_details = 1

# --------------------------------------------------------------------------

#### Model Execution
This section iteratively finds the best set of the hyperparameters for each of the model specified by the user, calculates the metrics and populates results table for further analysis/experimentation.

In [None]:
# Model execution part, resuts will be stored in the dataframe 'results'
# Best model can be selected based on these criteria

results = pd.DataFrame(columns=['ModelForm', 'TrainRMSE', 'TestRMSE',
                                'TrainMAPE', 'TestMAPE'])

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=size)

for model in to_run:
    with warnings.catch_warnings():
        warnings.simplefilter('ignore')
        best_feat = select_features(model(), X_train, Y_train, True,
                                    score_criteria, see_details, norm)
        model = run_model(model(), PARAM_DICT[model],
                          X_train.iloc[:, best_feat],
                          Y_train,
                          X.iloc[:, best_feat], Y,
                          score_criteria, folds, see_details, norm)
        stats = get_model_eval(model, X_train.iloc[:, best_feat], Y_train,
                               X_test.iloc[:, best_feat], Y_test)
        stats.index = results.columns
        results = results.append(stats, ignore_index=True)

print(results)


## Conclusion
It can be observed from the results table that most basic linear regression model provides best and consistent performance among all the model forms tested in this scenario. This highlights the importance of feature engineering as well since we expect ensemble models to show better performance in general. On the other hand, XGB Regressor shows a sign of overfitting based on train and test metrics. All other models provide comparable performance. This indicates the need for testing a different range of hyper-parameters as well.
I hope this module enables faster experimentation and provides an opportunity to build further customizations on top of it based on your needs!