### Learning Hyperparameters Online

In the L-1 penalized regression notebook, there is a hyperparameter $\lambda$ that controls the amount of regularization.  In this notebook, we will see how to outline an algorithm for updating this parameter adaptively online

In [1]:
import numpy as np

Assume given our observations we wish to determine a sparse model of the form

$$y = X \beta + \epsilon$$

where $X$ is a $n \times p$ matrix of observations, $y$ is a $n$-vector of responses, and $\beta$ is a $p$-vector of coefficients.  We will assume that the observations are independent and identically distributed (i.i.d.) with mean zero and variance $\sigma^2$. 

Under the assumptions of a sparse model we can construct the objective as 
$$ \min_{\beta} \frac{1}{2} \sum_{i=1}^n (y_i - \beta^T x_i)^2 + \lambda \sum_{j=1}^p |\beta_j| $$



Ref:  Adaptive regularization for Lasso models in the context of non-stationary data streams  https://arxiv.org/abs/1610.09127

### Update Rule

Defining $C(X_{t+1})$ as the cost function for the next observations $X_{t+1}$, we can update the regularization parameter as

$$ \lambda_{t+1} = \lambda_t + \epsilon \ \frac{\partial{C (X_{t+1})}}{\partial{\hat{\beta_t}}}\frac{\partial{\hat{\beta_t}}}{\partial{\lambda_t}} $$

where $\alpha$ is a learning rate

### Utilizing the piecweise linear solutions for the Lasso, we can expand the second partial as

$$ \frac{\partial{\hat{\beta_t}}}{\partial{\lambda_t}} = -(S_t)^{-1} sign(\hat{\beta_t}) $$

where $S_t$ is the covariance matrix of the observations $X_t$

In [2]:
def update_lambda(X,y,beta,lambda_t,epsilon=0.1):
    """
    Update the regularization parameter for the Lasso via a gradient descent algorithm
    """
    

    # if active set is empty take a epsilon-size step in the direction of the most correlated predictor (LARS)
    if max(abs(beta))==0:
        j=np.argmax(np.abs(np.dot(X.T,y)))
        beta[j]=1
        return lambda_t + epsilon* beta
    
    # else compute gradient of the parameter estimate to the data
    dc_dbt=-2*np.dot(X.T,y) + 2*np.dot(np.dot(X.T,X),beta)
    
    # compute sample covariance matrix
    S=np.cov(X.T)

    # inverse of sample covariance matrix
    S_inv=np.linalg.inv(S)

    # compute gradient of the parameter estimate to the regularization parameter
    dbt_dlambda_t= -np.dot(S_inv,np.sign(beta))

    lambda_t_1=lambda_t + epsilon*dc_dbt*dbt_dlambda_t

    return lambda_t_1