## Regularization and base learners in XGBoost
Regularization is a very important technique in machine learning to prevent overfitting. Mathematically speaking, it adds a regularization term in order to prevent the coefficients to fit so perfectly to overfit. The difference between the L1 and L2 is just that L2 is the sum of the square of the weights, while L1 is just the sum of the weights. 

[Regularization](http://enhancedatascience.com/2017/07/04/machine-learning-explained-regularization/), in the context of machine learning, refers to the process of modifying a learning algorithm so as to prevent overfitting. This generally involves imposing some sort of smoothness constraint on the learned model. This smoothness may be enforced explicitly, by fixing the number of parameters in the model, or by augmenting the cost function.

In [1]:
# may be required as xgboost import throws errors 
# import os
# mingw_path = 'C:\\Program Files\\mingw-w64\\x86_64-7.2.0-posix-seh-rt_v5-rev1\\mingw64\\bin'
# os.environ['PATH'] = mingw_path + ';' + os.environ['PATH']

In [2]:
import xgboost as xgb
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [3]:
# Load data Ames, Iowa dataset from DataCamp's AWS url
housing_data = pd.read_csv("https://s3.amazonaws.com/assets.datacamp.com/production/course_3786/datasets/ames_housing_trimmed_processed.csv")

In [4]:
# Create df for the features and the target: X, y
X, y = housing_data.iloc[:,:-1], housing_data.iloc[:,-1]

### L1 Regularization Penalty ('alpha')
The L1 regularization adds a penalty equal to the sum of the absolute value of the coefficients. The L1 regularization will ___shrink some parameters to zero___. Hence some variables will not play any role in the model, L1 regression can be seen as a way to select features in a model.  A regression model that uses L1 regularization technique is called Lasso Regression.  Lasso  (Least Absolute Shrinkage and Selection Operator) adds “absolute value of magnitude” of coefficient as penalty term to the loss function.  If lambda is zero then we will get back OLS whereas a very large value will make coefficients zero hence it will under-fit.

Effect on overall model performance on the Ames housing dataset

In [5]:
# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

reg_params = [1, 10, 100]

# Create the initial parameter dictionary for varying l1 strength: params
params = {"objective":"reg:linear","max_depth":4}

# Create an empty list for storing rmses as a function of l1 complexity
rmses_l1 = []

# Iterate over reg_params
for reg in reg_params:

    # Update l1 strength
    params["alpha"] = reg
    
    # Pass this updated param dictionary into cv
    cv_results_rmse = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=4, 
                             num_boost_round=10, metrics="rmse", as_pandas=True, seed=123)
    
    # Append best rmse (final round) to rmses_l1
    rmses_l1.append(cv_results_rmse["test-rmse-mean"].tail(1).values[0])

# Look at best rmse per l1 param
print("Best rmse as a function of L1:")
print(pd.DataFrame(list(zip(reg_params, rmses_l1)), columns=["L1", "rmse"]))

Best rmse as a function of L1:
    L1          rmse
0    1  35572.514160
1   10  35571.970215
2  100  35572.370605


Observation: alpha (L1 regularization) has little to no effect on RMSE

### L2 Regularization Penalty ('lambda')
A regression model that uses L2 is called Ridge Regression.
Ridge regression adds “squared magnitude” of coefficient as penalty term to the loss function.  The L2 regularization adds a penalty equal to the sum of the squared value of the coefficients. The L2 regularization will force the parameters to be relatively small; the bigger the penalization, the smaller (and the more robust) the coefficients.  

Effect on overall model performance on the Ames housing dataset

In [7]:
# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

reg_params = [1, 10, 100]

# Create the initial parameter dictionary for varying l2 strength: params
params = {"objective":"reg:linear","max_depth":3}

# Create an empty list for storing rmses as a function of l2 complexity
rmses_l2 = []

# Iterate over reg_params
for reg in reg_params:

    # Update l2 strength
    params["lambda"] = reg
    
    # Pass this updated param dictionary into cv
    cv_results_rmse = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=2, 
                             num_boost_round=5, metrics="rmse", as_pandas=True, seed=123)
    
    # Append best rmse (final round) to rmses_l2
    rmses_l2.append(cv_results_rmse["test-rmse-mean"].tail(1).values[0])

# Look at best rmse per l2 param
print("Best rmse as a function of l2:")
print(pd.DataFrame(list(zip(reg_params, rmses_l2)), columns=["l2", "rmse"]))

Best rmse as a function of l2:
    l2          rmse
0    1  52275.357421
1   10  57746.064453
2  100  76624.625000


Observation: RMSE increases with lambda

The key difference between these techniques is that Lasso shrinks the less important feature’s coefficient to zero thus, removing some feature altogether. So, this works well for feature selection in case we have a huge number of features.

### Linear Base Learner: 
    - Sum of linear terms 
    - Boosted model is weighted sum of linear models (thus is itself linear) 
    - Rarely used 
### Tree Base Learner: 
    - Decision tree 
    - Boosted model is weighted sum of decision trees (nonlinear) 
    - Almost exclusively used in XGBoost