# Regularization for Regression

In [None]:
Before we move into Regularization, let us see what regression is. 
Regression helps to analyze the relationship between two or more variables.
Regression analyses significant relationships between dependent variable and independent variable and it indicates the strength of impact of multiple independent variables on a dependent variable.

# Linear Regression

In [None]:
Linear Regression establishes a relationship between dependent variable (Y) and one or more independent variables (X) using a best fit straight line (also known as regression line).
In this technique, the dependent variable is continuous, independent variable(s) can be continuous or discrete, and nature of regression line is linear.
It is represented by the equation, Y =  a1X1 + a2X2 + a2X2…..+ anXn + e 
Here, we have Y as our dependent variable, X’s are the independent variables and all “a” are the coefficients. 
Coefficients are basically the weights assigned to the features, based on their importance and “e” is error. 
This equation can be used to predict the value of target variable based on given predictor variable(s).

In [None]:
####### Linear Regression with Example ############
I have taken a simple “mtcars” data for example, where,
The “mpg” [miles per gallon - mileage] is initially predicted with one independent variable “wt” [weight]. 
Then more independent variables are added to improve accuracy.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cross_validation import train_test_split
import statsmodels.formula.api as smf

mtcars = pd.read_csv("data/mtcars.csv")

############### Simple Linear Regression with one independent variable #############################################
# DV & IDV Identification
# DV: mpg
# IDV: wt

# Visualize the relationship between DV and IDV
plt.scatter("wt","mpg",data = mtcars)
plt.xlabel("wt")
plt.ylabel("mpg")

# Correlation analysis between DV and IDV
mtcars["mpg"].corr(mtcars["wt"]) # -0.867 Strong negative correlation

# Simple linear Regression Model Building
cars_model = smf.ols(formula = "mpg ~ wt", data = mtcars).fit()
cars_model.summary()

# Coefficient Check
# mpg = -5.3445*wt + 37.28
# For 1 unit increase in weight, mpg will decrease by 5 units
# R-Squared: 0.753 # Decent model

################ MultiLinear Regression Model Building - Adding more independent variables ##################################
# Adding more variables to improve accuracy
# DV & IDV Identification
mtcars.columns
# DV: mpg
# IDV: 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'am', 'gear','carb'

# Visualize the relationship between DV and IDV
f, ((ax1,ax2),(ax3,ax4),(ax5,ax6),(ax7,ax8),(ax9,ax10)) = plt.subplots(5,2)
ax1.scatter("cyl","mpg",data = mtcars)
ax2.scatter("disp","mpg",data = mtcars)
ax3.scatter("hp","mpg",data = mtcars)
ax4.scatter("drat","mpg",data = mtcars)
ax5.scatter("wt","mpg",data = mtcars)
ax6.scatter("qsec","mpg",data = mtcars)
ax7.scatter("vs","mpg",data = mtcars)
ax8.scatter("am","mpg",data = mtcars)
ax9.scatter("gear","mpg",data = mtcars)
ax9.scatter("carb","mpg",data = mtcars)

# Correlation analysis between DV and IDV
mtcars_corr_matrix = mtcars.corr()

# Multilinear model building
cars_model = smf.ols(formula = "mpg ~ cyl+disp+hp+drat+wt+qsec+vs+am+gear+carb", data = mtcars).fit()
cars_model.summary()

# Coefficient Check
# mpg = -0.130cyl + 0.024disp -0.0232hp + 0.029drat - 6.917wt + 1.855qsec - 1.898vs + 0.879am + 1.283gear - 0.636carb + 1.454
# Adj R-Squared : 807

# training-test split
all_samples = np.arange(31)
np.random.seed(10)
0.7*31
tr_samples = np.random.choice(all_samples,22,replace=False)
mtcars_training_data = mtcars.iloc[tr_samples,:]
mtcars_test_data = mtcars.drop(tr_samples)

# Linear Regression Model Buildiing on Training data
cars_model = smf.ols(formula = "mpg ~ cyl+disp+hp+drat+wt+qsec+vs+am+gear+carb", data = mtcars_training_data).fit()
cars_model.summary()

# Coefficient Check
# mpg = -0.111cyl + 0.013disp -0.0215hp + 0.7871drat - 3.715wt + 0.8210qsec + 0.317vs + 2.5202am + 0.655gear - 0.199carb + 12.303
# Adj R-Squared : 807

pred_mpg = cars_model.predict(mtcars_test_data)

mpg_compare = pd.DataFrame({"Actual_mpg":mtcars_test_data["mpg"],"Predicted_mpg":pred_mpg})

# Mean/Median Absolute Percentage Error
def MAPE(actual,predicted):    
    ape = np.abs(predicted - actual)/actual
    ape = ape.replace(np.inf,np.nan)
    return([np.mean(ape)*100,np.nanmedian(ape)*100])
    
MAPE(mtcars_test_data["mpg"],pred_mpg)
# 13.3 % error , 86.7% Accuracy

# Feature Selection

In [None]:
When we have a high dimensional data set, it would be highly inefficient to use all the variables since some of them might be imparting redundant information. 
We would need to select the right set of variables which give us an accurate model as well as are able to explain the dependent variable well. 
There are multiple ways to select the right set of variables for the model. 
First among them would be the business understanding and domain knowledge. 
We should also take care that the variables we’re selecting should not be correlated among themselves.

# Stepwise Regression

In [None]:
Instead of manually selecting the variables, we can automate this process by using forward or backward selection. 
Forward selection starts with most significant predictor in the model and adds variable for each step. 
Backward elimination starts with all predictors in the model and removes the least significant variable for each step. 

# Bias and Variance in regression models

In [None]:
Bias is error due to overly simplistic assumptions in the learning algorithm [ Having less independenet variables ]. 
This can lead to the model underfitting the data, making it hard for it to have high predictive accuracy.

Variance is error due to too much complexity in the learning algorithm [Adding more independent variables]. 
This leads to the algorithm being highly sensitive to high degrees of variation in the training data, which can lead the model to overfit the data. we’ll be carrying too much noise from your training data for your model to be very useful for your test data.
The bias-variance decomposition essentially decomposes the learning error from any algorithm by adding the bias, the variance and a bit of irreducible error due to noise in the underlying dataset. 

Essentially, if you make the model more complex and add more variables, you’ll lose bias but gain some variance — in order to get the optimally reduced amount of error, you’ll have to tradeoff bias and variance. 
You don’t want either high bias or high variance in your model.

# Regularization

In [None]:
Regularization is a mechanisms for avoiding overfitting of the model. 
Sometimes certain particular variables would dominate the data set. 
In regularization, what we do is normally we keep the same number of features, but reduce the magnitude of the coefficients.
Regularization basically adds the penalty as model complexity increases. 
Regularization parameter (alpha) penalizes all the parameters except intercept so that model generalizes the data and won’t overfit.

There are two types of regularization – L1 [Lasso] & L2 [Ridge]
Let us first define L2 [Ridge] regularization.

# L2 [Ridge] Regularization

In [None]:
L2 regularization adds penalty equivalent to square of the magnitude of coefficients
• Minimization objective = LS Obj + α * (sum of square of coefficients)

L2 regularization retains all the independent varaibles while adjusting their coefficients to avoid overfitting. 

The penalty value [alpha] lies between 0 to 1. 
As the value increases towards 1, the coeficients of the variables tends to reach "zero" but does not exactly reach absolute "Zero".

We will have to iterate through different Alpha values to see better accuracy.
If you calculate R-square for each alpha and see which is best.
So we have to choose it wisely by iterating it through a range of values and using the one which gives us lowest error.

In [None]:
####################################### Ridge Regression #######################################
X_mtcars_train, X_mtcars_test, y_mtcars_train, y_mtcars_test = \
    train_test_split(mtcars.iloc[:,1:11], 
                     mtcars.iloc[:,0], 
                     test_size=0.3, random_state=100)

from sklearn.linear_model import Ridge

ridgeReg = Ridge (alpha=0.6, normalize = True)

r = ridgeReg.fit(X_mtcars_train,y_mtcars_train)

# Coefficient Check
ridgeReg.coef_
pred_ridge = ridgeReg.predict(X_mtcars_test)

# Mean Square Error Calculation
mse = np.mean((pred_ridge - y_mtcars_test)**2) # 6.22

ridgeReg.score(X_mtcars_test,y_mtcars_test)# 0.828

MAPE(y_mtcars_test,pred_ridge)
# 8.3 % error , 91.7% Accuracy

# Checking the magnitude of coefficients

predictors = X_mtcars_train.columns

coef = pd.Series(ridgeReg.coef_,predictors).sort_values()

coef.plot(kind='bar', title='Modal Coefficients')
# magnitude of the coefficients of all the IDVs have been reduced. 'hp' & 'disp' have almost become zero.


# L1 [Lasso] Regularization

In [None]:
L1 regularization adds penalty equivalent to absolute value of the magnitude of coefficients
• Minimization objective = LS Obj + α * (sum of absolute value of coefficients)

L1 regularization  selects the only some features while reduces the coefficients of others to zero. 

The penalty value [alpha] lies between 0 to 1. 
Even for smaller alpha the coeficients of the some variables reaches absolute "Zero" and coeficients of other variables moves towards "Zero"

We will have to iterate through different Alpha values to see better accuracy.
If you calculate R-square for each alpha and see which is best.
So we have to choose it wisely by iterating it through a range of values and using the one which gives us lowest error.

In [None]:
####################################### Lasso Regression ########################################
from sklearn.linear_model import Lasso

lassoReg = Lasso (alpha=0.6, normalize = True)

l = lassoReg.fit(X_mtcars_train,y_mtcars_train)

# Coefficient Check
lassoReg.coef_

pred_lasso = lassoReg.predict(X_mtcars_test)

# Mean Square Error Calculation
mse = np.mean((pred_lasso - y_mtcars_test)**2) # 20.79

lassoReg.score(X_mtcars_test,y_mtcars_test) # 0.425

MAPE(y_mtcars_test,pred_lasso)
# 15.84% error, 84% Accuracy

# Checking the magnitude of coefficients

predictors = X_mtcars_train.columns

coef = pd.Series(lassoReg.coef_,predictors).sort_values()

coef.plot(kind='bar', title='Modal Coefficients')
# 'wt' and 'cyl' are enough to build the model. Coefficients of rest of the IDVs have become zero.

# Difference between Regularization & Dimentionality Reduction [PCA]

In [None]:
Regularisation is the process of penalising complexity in a model so as to prevent overfitting through generalisation.
Regularization is used to create constraints on machine learning models to induce (typically) sparseness or robustness.

Dimensionality reduction refers to unsupervised learning within a dataset to find a low-dimensional space that adequately captures the data.
Dimensionality reduction reduces the number or variables under consideration and is related to feature extraction. 
It is useful when the data set has similar measurements with different unit e.g Meters, Centimeters etc.

In [None]:
############################# Dimensionality Reduction using PCA #################################

from sklearn.decomposition import PCA
from sklearn.preprocessing import scale

newmtcars = mtcars.iloc[:,1:11]
mtcars_corr = newmtcars.corr()

# Scaling

mtcars_scaled= pd.DataFrame(scale(newmtcars))
mtcars_scaled.columns = newmtcars.columns

# Correlation comparison between scaled and unscaled data:
corrmatrix_mtcarsscaled =  mtcars_scaled.corr()
corrmatrix_mtcars = newmtcars.corr()
# correlation doesn't get affected due to scaling

# PCA
mtcarspca = PCA().fit(mtcars_scaled)
mtcars_projected = pd.DataFrame(mtcarspca.transform(mtcars_scaled))
# Dim1, Dim2, Dim3.... Dim13
mtcars_projected.columns = ["Dim" + str(i) for i in range(1,11)]

# Explained Variance Ratio
sum(mtcarspca.explained_variance_ratio_)
np.cumsum(mtcarspca.explained_variance_ratio_)

# 92% accuracy can be obtained with just 4 variables