## Overfitting

One problem that often arises with Linear Regression (and all machine learning algorithms) is overfitting on training data. The workhorse optimization function converges on weights that minimize the cost function *too* much. To deal with this, we add a term to the cost function which penalizes the weights by adding a metric onto the sum of squared error terms. This metric is commonly the *L-1* or *L-2* norm, named **Lasso** and **Ridge** respectively. We now present a class that updates the simple regression class with a choice of regularization to prevent overfitting. 

A quick note, with Ridge regression, we shrink the coefficients and it helps to reduce the model complexity and multi-collinearity using the $L_2$ norm of the weights times a $\lambda$. So if we set $\lambda=0$, we have the same cost function as in normal linear regression. As $\lambda$ gets bigger, the weights are penalized more. In Lasso regression, the process is identical but uses the $L_1$ norm. The important distinction is that Lasso tends to zero out non relevant features much more than ridge, so it can be useful in feature selection.

In this algorithm, we provide a new feature for the model which evaluates the predictive power of the linear regression on unseen test data. 

#### The algorithm

In [1]:
import numpy as np
import pandas as pd

In [2]:
class LinearRegressionRegularization():
    
    def __init__(self, learning_rate):
        '''Initializes the class with a learning rate for the optimization of weights.'''
        self.learning_rate = learning_rate
          
    def fit(self, train_data, train_target, regularization, lambda_):
        '''Input the training data and its respective target values'''
        
        #Convert data to numpy arrays
        X = np.concatenate((np.array([np.zeros(len(train_data))+1]).T, train_data), axis = 1)
        y = np.array(train_target)
        
        #initialize coefficients
        coefficients = np.random.normal(0, 1, X.shape[1])
        self.coefficients = coefficients
        
        #Keep a list of SSE and MSE for each iteration if you desire to analyze
        
        self.SSE_list = []
        self.MSE_list = []
            
        #Define different regularizations
        def regularization_function(weights):
            if regularization == 'None':
                return 0
            if regularization == 'Lasso':
                sign_func = lambda x: 1 if x >= 0 else -1
                return lambda_ * np.array([sign_func(wi) for wi in weights])
            if regularization == 'Ridge':
                return lambda_ * 2 * weights
                
        #iteratively improve coefficients:
        i = 0
        while i < 1000:
            predict = np.matmul(X, self.coefficients)
            SSE = np.linalg.norm(y-predict)
            MSE = SSE/X.shape[0]
            self.SSE_list.append(SSE)
            self.MSE_list.append(MSE)
            
            #Calculate the gradient of SSE with regularization and apply some randomness
            gradient = -(1/SSE) * np.matmul((y-predict).T, X)+regularization_function(self.coefficients) + np.random.normal(0, 1, self.coefficients.shape)
            self.coefficients = self.coefficients - (1/(i+1))*self.learning_rate * gradient
            i += 1
        SStotal = np.sum((y - np.mean(y))**2)
        self.intercept = self.coefficients[0]
        self.weights = self.coefficients[1:]
        self.error_analysis = pd.DataFrame({'Sum of Squared Errors': [self.SSE_list[-1]],
                                           'Mean Squared Error': [self.MSE_list[-1]],
                                           'Root Mean Squared Error': [np.sqrt(self.MSE_list[-1])],
                                           'R-squared': [1-self.SSE_list[-1]/SStotal]})
        return self
        
    def predict(self, X_test,y_test):
        '''Inputs are unseen test data and respective targets.'''
        X = np.concatenate((np.array([np.zeros(len(X_test))+1]).T, X_test), axis = 1)
        y = np.array(y_test)
        
        self.predictions = np.matmul(X, self.coefficients)
        SStotal = np.sum((y - np.mean(y))**2)
        self.SSE = np.linalg.norm(y-self.predictions)
        self.MSE = MSE = self.SSE/X.shape[0]
        self.test_error = pd.DataFrame({'Sum of Squared Errors': [self.SSE],
                                   'Mean Squared Error': [self.MSE],
                                   'Root Mean Squared Error': [np.sqrt(self.MSE)],
                                   'R-squared': [1-self.SSE/SStotal]}, index = ["Test Data Metrics"])
        return self

#### Test it out

We're going to fit a linear regression to our model. Note that the column we defined does not effect our target at all. We will test the different regression functions.  

In [58]:
sample_data = pd.DataFrame({'a': 10*np.random.normal(0,1, 1000), 'b': 5*np.random.normal(0,1, 1000), 
                           'c': 2*np.random.normal(0,1, 1000)})
sample_data['target'] = 2*(sample_data['a']+sample_data['b']) + 10 + 30*np.random.normal(0,1, 1000)

In [59]:
model = LinearRegressionRegularization(0.2)

In [60]:
model.fit(sample_data[['a','b', 'c']], sample_data['target'], 'None', 0.5)

<__main__.LinearRegressionRegularization at 0x7fe8e3737b50>

In [61]:
model.weights

array([ 1.93405345,  1.97667421, -0.07084178])

In [62]:
model.intercept

6.2837403600485855

In [63]:
model.error_analysis

Unnamed: 0,Sum of Squared Errors,Mean Squared Error,Root Mean Squared Error,R-squared
0,955.474127,0.955474,0.977484,0.999292


In [64]:
ridge_model = LinearRegressionRegularization(0.2)
ridge_model.fit(sample_data[['a','b', 'c']], sample_data['target'], 'Ridge', 0.7)

<__main__.LinearRegressionRegularization at 0x7fe8e37be210>

In [65]:
ridge_model.weights

array([ 1.90651198,  1.85566253, -0.0380647 ])

In [66]:
ridge_model.intercept

3.7645150561860645

In [67]:
ridge_model.error_analysis

Unnamed: 0,Sum of Squared Errors,Mean Squared Error,Root Mean Squared Error,R-squared
0,966.524918,0.966525,0.98312,0.999284


In [68]:
lasso_model = LinearRegressionRegularization(0.2)
lasso_model.fit(sample_data[['a','b', 'c']], sample_data['target'], 'Lasso', 0.7)

<__main__.LinearRegressionRegularization at 0x7fe8e3891290>

In [69]:
lasso_model.weights

array([ 1.92521770e+00,  1.94289142e+00, -2.82950403e-05])

In [70]:
lasso_model.intercept

6.113898144897629

In [71]:
lasso_model.error_analysis

Unnamed: 0,Sum of Squared Errors,Mean Squared Error,Root Mean Squared Error,R-squared
0,956.035073,0.956035,0.97777,0.999292


So although the model was best using standard linear regression, we should note that the whole point is to make sure that we aren't overfitting the data. That means the training data obviously is going to have the best score when we use normal linear regression. Note one cool thing however, Lasso did exactly what we thought it would with the feature c, which we purposefully left out. It correctly sent it the closest to zero by far. In this sense, Lasso helped us choose only the features that were important.