### Elastic Net

&nbsp;

Elastic Net is essentially a fancy name for a regularized least squares with both L1 norm and L2 norm penalties. Regularized least squares can tackle the multicollinearity problem to certain degree in the setting of linear regression. Usually regularized least squares with L1 penalty is called Lasso Regression and regularized least squares with L2 penalty is called Ridge Regression. Elastic Net is attempting to have it both ways. 

Assuming we have an equation $y=x\theta+\epsilon$, where $y,\epsilon \in \mathbb{R}^m$, $x \in \mathbb{R}^{m \times n}$, $I \in \mathbb{R}^{n \times n}$ and $\theta \in \mathbb{R}^{n}$.

* Ordinary Least Squares solves the problem by minimizing $J(\theta)=(y-x\theta)^T(y-x\theta)$ with respect to $\theta$. By setting the partial derivative $\frac{\partial J(\theta)}{\partial \theta}=0$, we shall obtain $\theta=(x^Tx)^{-1}x^Ty$. 


* Ridge Regression solves the problem by minimizing $J(\theta)=(y-x\theta)^T(y-x\theta)+\lambda\theta^T\theta$ with respect to $\theta$. By setting the partial derivative $\frac{\partial J(\theta)}{\partial \theta}=0$, we shall obtain $\theta=(x^Tx+\lambda I)^{-1}x^Ty$.


* Lasso Regression solves the problem by minimizing $J(\theta)=(y-x\theta)^T(y-x\theta)+\lambda|\theta|$ with respect to $\theta$. By setting the partial derivative $\frac{\partial J(\theta)}{\partial \theta}=0$, we shall obtain $\theta=(x^Tx)^{-1}(x^Ty-\frac {\lambda} {2} sign(\theta))$. Nah, that's wrong, the problem is L1 norm is not differentiable. The proper way to do it is via coordinate descent (Friedman, Hastie and Tibshirani, 2008).


* Elastic Net Regression solves the problem by minimizing $J(\theta)=(y-x\theta)^T(y-x\theta)+\lambda_1\theta^T\theta+\lambda_2|\theta|$ with respect to $\theta$. Same as Lasso, we cannot solve the equation in closed form. In some cases (particularly sklearn), the Lagrangian form of loss function is written as $J(\theta)=(y-x\theta)^T(y-x\theta)+\lambda [\alpha\theta^T\theta+(1-\alpha)|\theta|]$ whereas $\lambda=\lambda_1+\lambda_2$ and $\alpha=\frac {\lambda_1} {\lambda_1+\lambda_2}$. The raison d'être of this compound form is to treat L1 and L2 as a single constraint. When $\lambda=0$, the equation reverts to OLS. 

More details of how to implement Lasso can be found in the link below.

https://xavierbourretsicotte.github.io/lasso_implementation.html

More details of how to implement Elastic Net can be found in the link below.

https://towardsdatascience.com/regularized-linear-regression-models-dcf5aa662ab9

### Coordinate Descent

&nbsp;

Coordinate descent can be thought as an alternative gradient descent to solve constrained optimization. At each iteration, the algorithm selects a feature $\theta_j$, which can be cyclic from the default order of features or stochastic from picking randomly. Then the algorithm computes partial residual $r_{i,j}$ without $\theta_j$ and updates $\theta_j$ with $r_{i,j}$.

$$ r_{i,j} = y_i - \sum_{k \neq j} x_{i, k}\theta_k$$

$$\theta_j^* = \frac{1}{n}\sum_{i=1}^{n} x_{i, j}r_{i, j}$$

Check the original paper of coordinate descent on regularized least squares.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2929880/

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import sklearn.linear_model
import sklearn.datasets
import sklearn.decomposition
import sklearn.model_selection
import scipy.stats

In [2]:
np.seterr(divide='raise')

{'divide': 'warn', 'over': 'warn', 'under': 'ignore', 'invalid': 'warn'}

### Functions

In [3]:
#simple ols in linear regression
def ols(X,Y):
    return np.linalg.inv(X.T@X)@X.T@Y

In [4]:
#ridge regression
def ridge(X,Y,lambda_=0.5):
    I=np.ones((X.shape[1],X.shape[1]))
    return np.linalg.inv(X.T@X+lambda_*I)@X.T@Y

In [5]:
#lasso regression
def lasso(X,Y,lambda_=0.5,max_itr=1000,tol=0.01,
         diagnosis=False):
    
    #normalize
    #as l1 norm depends on the variable
    #we need to remove the scale of each variable to be robust
    X_norm=X/np.linalg.norm(X,axis=0)
    
    #initialize with one
    theta=np.ones((X_norm.shape[1],1))
    
    #coordinate descent
    stop=False
    counter=0
    while not stop:
        theta_prev=theta.copy()
        counter+=1
        
        #cyclic coordinate descent
        for i in range(X_norm.shape[1]):
            
            #coordinate descent keeps one feature fixed
            theta[i]=0.0
            
            #no need to do partial residual on intercept
            if i==0:
                theta[0]=(Y-X_norm@theta).mean()
                continue
                
            #partial residual                
            ols_partial_residual=X_norm[:,i]@(Y-X_norm@theta)    

            #soft thresholding
            theta[i]=np.sign(ols_partial_residual)*max(
                abs(ols_partial_residual)-lambda_,0)/X.shape[0]
        
        #convergence check
        if np.all(abs(theta/theta_prev-1)<tol):
            if diagnosis:
                print(f'Converged after {counter} iterations')
            stop=True
        
        #maximum iteration check
        if counter==max_itr:
            stop=True
            
    return theta

In [6]:
#elastic net regression
#note that alpha has different definitions in sklearn
#which is equivalent to lambda_ in this function
#our alpha is also different from the original paper 
#which is equivalent to (1-alpha) in the paper
def elastic_net(X,Y,lambda_=0.5,alpha=0.5,
                max_itr=1000,tol=0.01,diagnosis=False):
    
    #normalize
    #as l1 norm depends on the variable
    #we need to remove the scale of each variable to be robust
    X_norm=X/np.linalg.norm(X,axis=0)
    
    #initialize with one
    theta=np.ones((X_norm.shape[1],1))
    
    #coordinate descent
    stop=False
    counter=0
    while not stop:
        theta_prev=theta.copy()
        counter+=1
        
        #cyclic coordinate descent
        for i in range(X_norm.shape[1]):
            
            #coordinate descent keeps one feature fixed
            theta[i]=0.0
            
            #no need to do partial residual on intercept
            if i==0:
                theta[0]=(Y-X_norm@theta).mean()
                continue
                
            #partial residual                
            ols_partial_residual=X_norm[:,i]@(Y-X_norm@theta)    

            #soft thresholding
            theta[i]=np.sign(
                ols_partial_residual)*max(
                abs(ols_partial_residual)-lambda_*(
                    1-alpha),0)/(1+lambda_*alpha)/X.shape[0]
        
        #convergence check
        if np.all(abs(theta/theta_prev-1)<tol):
            if diagnosis:
                print(f'Converged after {counter} iterations')
            stop=True
        
        #maximum iteration check
        if counter==max_itr:
            stop=True
            
    return theta

### ETL

In [7]:
#iris dataset is the default choice in this repository
#plz refer to the website in the following link for original data
# https://archive.ics.uci.edu/ml/datasets/iris
iris=sklearn.datasets.load_iris()

In [8]:
#to make life easier
#we start with a binary classification problem
dataset=iris.data[iris.target!=2]
Y=iris.target[iris.target!=2]

In [9]:
#use principal component analysis to reduce dimensions for viz
#check link below for details on pca
# https://github.com/je-suis-tm/machine-learning/blob/master/principal%20component%20analysis.ipynb
X=sklearn.decomposition.PCA(n_components=2).fit_transform(dataset)

In [10]:
#make sure its in the vector form
Y=Y.reshape(-1,1)

#adding intercept
#you can also do 
#import statsmodels.api as sm
#X=sm.add_constant(X
constant=np.ones((X.shape[0],1))
X=np.concatenate([constant,X],axis=1)

### Run

In [11]:
#ols
#our linear algebra yields the same result as sklearn
clf=sklearn.linear_model.LinearRegression(fit_intercept=False)
clf.fit(X,Y)

theta_ols_skl=clf.coef_
theta_ols_self=ols(X,Y)

print('OLS Comparison')
print(theta_ols_skl,theta_ols_self)

OLS Comparison
[[ 0.5         0.29194508 -0.1693774 ]] [[ 0.5       ]
 [ 0.29194508]
 [-0.1693774 ]]


In [12]:
#rls l2 penalty
#our linear algebra is slightly different from sklearn
clf=sklearn.linear_model.Ridge(fit_intercept=False,solver='svd')
clf.fit(X,Y)

theta_l2_skl=clf.coef_
theta_l2_self=ridge(X,Y,lambda_=1)

print('Ridge Comparison')
print(theta_l2_skl,theta_l2_self)

Ridge Comparison
[[ 0.4950495   0.29088508 -0.16219036]] [[ 0.49411538]
 [ 0.28980069]
 [-0.19545357]]


In [13]:
#rls l1 penalty
#our implementation is very different from sklearn
#it seems lasso is not doing so well compared to ridge
clf=sklearn.linear_model.Lasso(fit_intercept=False,
                              max_iter=1000,
                             tol=0.0001,
                              alpha=0.5)
clf.fit(X,Y)

theta_l1_skl=clf.coef_
theta_l1_self=lasso(X,Y,lambda_=0.5,max_itr=1000,
                   tol=0.0001,diagnosis=True)

print('Lasso Comparison')
print(theta_l1_skl,theta_l1_self)

Converged after 2 iterations
Lasso Comparison
[ 7.10542736e-17  1.09742072e-01 -0.00000000e+00] [[ 0.5       ]
 [ 0.04336246]
 [-0.00304624]]


In [14]:
#rls l1+l2 penalty
#our implementation is very different from sklearn
#severely affected by l1 penalty
clf=sklearn.linear_model.ElasticNet(fit_intercept=False,
                                   alpha=1,l1_ratio=0.5)
clf.fit(X,Y)

theta_en_skl=clf.coef_
theta_en_self=elastic_net(X,Y,alpha=0.5,
                          lambda_=1,max_itr=1000,
                        tol=0.0001,diagnosis=True)

print('Elastic Net Comparison')
print(theta_en_skl,theta_en_self)

Converged after 2 iterations
Elastic Net Comparison
[ 4.73695157e-17  9.28284491e-02 -0.00000000e+00] [[ 0.5       ]
 [ 0.02890831]
 [-0.00203083]]


In [15]:
#compute sum of squared errors
sse_ols_self=((X@theta_ols_self-Y).T@(X@theta_ols_self-Y)).ravel()[0]
sse_l2_self=((X@theta_l2_self-Y).T@(X@theta_l2_self-Y)).ravel()[0]
sse_l1_self=((X@theta_l1_self-Y).T@(X@theta_l1_self-Y)).ravel()[0]
sse_en_self=((X@theta_en_self-Y).T@(X@theta_en_self-Y)).ravel()[0]

In [16]:
#bias variance tradeoff
print('Sum of squared errors from OLS\n',sse_ols_self)
print('Sum of squared errors from Ridge\n',sse_l2_self)
print('Sum of squared errors from Lasso\n',sse_l1_self)
print('Sum of squared errors from Elastic Net\n',sse_en_self)

Sum of squared errors from OLS
 0.9633040783278697
Sum of squared errors from Ridge
 0.9833737003537734
Sum of squared errors from Lasso
 18.544917363017788
Sum of squared errors from Elastic Net
 20.581900325351896


In [17]:
print('Forecast accuracy of OLS',
      len(Y[np.round(X@theta_ols_self,0)==Y])/len(Y))
print('Forecast accuracy of Ridge',
      len(Y[np.round(X@theta_l2_self,0)==Y])/len(Y))
print('Forecast accuracy of Lasso',
      len(Y[np.round(X@theta_l1_self,0)==Y])/len(Y))
print('Forecast accuracy of Elastic Net',
      len(Y[np.round(X@theta_en_self,0)==Y])/len(Y))

Forecast accuracy of OLS 1.0
Forecast accuracy of Ridge 1.0
Forecast accuracy of Lasso 1.0
Forecast accuracy of Elastic Net 1.0


### Optimal Choice of λ

&nbsp;

Unfortunately, there is no smart way to pick the optimal λ for L1 or L2 penalty. Like any other hyperparameter tuning in machine learning, the optima can only be obtained via brute force. And the result is always arbitrary. Traditionally, there are three different methods. Multiple methods are recommended to use for comparison.

* Cross Validation
* Akaike Information Criterion
* Bayesian Information Criterion

You can check this example of Lasso λ tuning from sklearn.

https://scikit-learn.org/stable/auto_examples/linear_model/plot_lasso_model_selection.html

Please note that this script computes conventional AIC and BIC where the log likelihood is based upon model error $\epsilon=(y-x\theta)$. Albeit AIC-Lasso shrinkage and BIC-Lasso shrinkage is a more common approach for Lasso (Zou, Hastie and Tibshirani, 2007). Check the link below for the original paper.

https://www.stanford.edu/~hastie/Papers/dflasso.pdf

###### Cross Validation

In [18]:
#our k fold cross validation is purely based upon sum of squared error
#as we shuffle the original dataset
#the result could be different from each run
#in sklearn,this can be done via
#sklearn.linear_model.LassoCV and sklearn.linear_model.ElasticNetCV
def kfold_cross_validation(X,Y,method,
                           n_splits=10,**kwargs):

    #k fold
    kf=sklearn.model_selection.KFold(n_splits=n_splits,shuffle=True)

    arr=[]
    for train_index,test_index in kf.split(X,Y):
        X_train, X_test = X[train_index], X[test_index]
        Y_train, Y_test = Y[train_index], Y[test_index]
        
        #regression
        theta=method(X_train,Y_train,**kwargs)
        sse=((X_test@theta-Y_test).T@(X_test@theta-Y_test)).ravel()[0]
        arr.append(sse)
            
    return np.mean(arr)

In [19]:
#l2 cv
arr=[]
for i in np.arange(0,1,0.1):
    if i==0:
        continue
    arr.append(kfold_cross_validation(X,Y,ridge,lambda_=i))
    
print('optimal λ for Ridge via CV')
print(np.arange(0,1,0.1)[1:][arr.index(min(arr))])

optimal λ for Ridge via CV
0.6000000000000001


In [20]:
#l1 cv
arr=[]
for i in np.arange(0,1,0.1):
    if i==0:
        continue
    arr.append(kfold_cross_validation(X,Y,lasso,lambda_=i))
    
print('optimal λ for Lasso via CV')
print(np.arange(0,1,0.1)[1:][arr.index(min(arr))])



optimal λ for Lasso via CV
0.1


In [21]:
#l1+l2 cv
D={}
for i in np.arange(0,1,0.1):
    for j in np.arange(0,1,0.1):
        if i==0 or j==0:
            continue
        D[(i,j)]=kfold_cross_validation(X,Y,elastic_net,lambda_=i,alpha=j)

print('optimal λ,α for Elastic Net via CV')
print(sorted(D.items(),key=lambda x:x[1])[0][0])



optimal λ,α for Elastic Net via CV
(0.1, 0.1)


###### AIC

In [22]:
#details of aic can be found in the following link
# http://www.modelselection.org/aic/
#in sklearn,this can be done via
#sklearn.linear_model.LassoLarsIC(criterion='aic')
def compute_aic(X,Y,method,**kwargs):

    #regression
    theta=method(X,Y,**kwargs)
    epsilon=Y-X@theta
    
    #compute log likelihood
    miu=epsilon.mean(axis=0)
    sigma=epsilon.std(axis=0)
    log_likelihood=np.log(
        scipy.stats.multivariate_normal(
            mean=miu,cov=sigma).pdf(epsilon)).sum()
    
    return -2*log_likelihood+X.shape[1]*2

In [23]:
#l2 aic
arr=[]
for i in np.arange(0,1,0.1):
    if i==0:
        continue
    arr.append(compute_aic(X,Y,ridge,lambda_=i))
    
print('optimal λ for Ridge via AIC')
print(np.arange(0,1,0.1)[1:][arr.index(min(arr))])

optimal λ for Ridge via AIC
0.1


In [24]:
#l1 aic
arr=[]
for i in np.arange(0,1,0.1):
    if i==0:
        continue
    arr.append(compute_aic(X,Y,lasso,lambda_=i))
    
print('optimal λ for Lasso via AIC')
print(np.arange(0,1,0.1)[1:][arr.index(min(arr))])

optimal λ for Lasso via AIC
0.1




In [25]:
#l1+l2 aic
D={}
for i in np.arange(0,1,0.1):
    for j in np.arange(0,1,0.1):
        if i==0 or j==0:
            continue
        D[(i,j)]=compute_aic(X,Y,elastic_net,lambda_=i,alpha=j)

print('optimal λ,α for Elastic Net via AIC')
print(sorted(D.items(),key=lambda x:x[1])[0][0])

optimal λ,α for Elastic Net via AIC
(0.1, 0.1)




###### BIC

In [26]:
#details of bic can be found in the following link
# http://www.modelselection.org/bic/
#in sklearn,this can be done via
#sklearn.linear_model.LassoLarsIC(criterion='bic')
def compute_bic(X,Y,method,**kwargs):

    #regression
    theta=method(X,Y,**kwargs)
    epsilon=Y-X@theta
    miu=epsilon.mean(axis=0)
    sigma=epsilon.std(axis=0)
    
    #compute log likelihood
    log_likelihood=np.log(
        scipy.stats.multivariate_normal(
            mean=miu,cov=sigma).pdf(epsilon)).sum()
    
    return -2*log_likelihood+X.shape[1]*np.log(X.shape[0])

In [27]:
#l2 bic
arr=[]
for i in np.arange(0,1,0.1):
    if i==0:
        continue
    arr.append(compute_bic(X,Y,ridge,lambda_=i))
    
print('optimal λ for Ridge via BIC')
print(np.arange(0,1,0.1)[1:][arr.index(min(arr))])

optimal λ for Ridge via BIC
0.1


In [28]:
#l1 bic
arr=[]
for i in np.arange(0,1,0.1):
    if i==0:
        continue
    arr.append(compute_bic(X,Y,lasso,lambda_=i))
    
print('optimal λ for Lasso via BIC')
print(np.arange(0,1,0.1)[1:][arr.index(min(arr))])

optimal λ for Lasso via BIC
0.1




In [29]:
#l1+l2 bic
D={}
for i in np.arange(0,1,0.1):
    for j in np.arange(0,1,0.1):
        if i==0 or j==0:
            continue
        D[(i,j)]=compute_bic(X,Y,elastic_net,lambda_=i,alpha=j)

print('optimal λ,α for Elastic Net via BIC')
print(sorted(D.items(),key=lambda x:x[1])[0][0])



optimal λ,α for Elastic Net via BIC
(0.1, 0.1)
