# 1. Introduction
In the previous notebook, we explored gradient descent, a fundamental optimization technique used to find the optimal parameters for linear regression models. Gradient descent iteratively adjusts model parameters to minimize the cost function, which measures the difference between the predicted values and the actual outcomes.

While gradient descent is effective for fitting linear models, it does not inherently address the issue of overfitting. Overfitting occurs when a model learns not only the underlying pattern but also the noise in the training data, resulting in poor generalization to new data. To combat overfitting, we use regularization techniques that add a penalty to the cost function, discouraging overly complex models.


### 1.1. Ridge and Lasso Regression
Ridge and Lasso regression are two such regularization techniques. They modify the linear regression cost function by introducing additional terms that penalize large coefficients. These techniques help to prevent overfitting by controlling the magnitude of the model parameters, leading to better generalization on unseen data. Note that we do not regularize the intercept term $\theta_0$.

- **Ridge Regression**: Also known as L2 regularization, Ridge regression adds a penalty proportional to the square of the magnitude of coefficients. This results in the following modified cost function:

  $$
  J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (f(x_i) - y_i)^2 + \frac{\lambda}{2} \sum_{j=1}^{n} \theta_j^2
  $$

  Here, $ \lambda $ is the regularization parameter that controls the strength of the penalty. Ridge regression tends to shrink all coefficients towards zero but generally does not set any coefficients exactly to zero.

- **Lasso Regression**: Also known as L1 regularization, Lasso regression adds a penalty proportional to the absolute value of the coefficients. The cost function for Lasso regression is:

  $$
  J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (f(x_i) - y_i)^2 + \lambda \sum_{j=1}^{n} |\theta_j|
  $$

  In this case, $ \lambda $ controls the penalty strength. Lasso regression can drive some coefficients exactly to zero, performing feature selection by effectively ignoring some features.

By implementing Ridge and Lasso regression from scratch, we will see how these regularization techniques influence the model's coefficients and performance. We'll use the Diabetes dataset from `sklearn.datasets` to compare the results of Ridge and Lasso regression. This will help us understand how regularization affects the complexity and interpretability of linear models.


# 2. Implementation
We will now extend the `gradientDescentVec` function from the previous notebook. Note that we will only showcase the vectorized implementation of Ridge and Lasso Regression as it is computationally more feasible.

In [18]:
def gradientDescentVec(X, y, alpha, iterations, lambda_reg=0, regularization=0):
    """
    Perform gradient descent for linear regression with optional Ridge or Lasso regularization.

    Parameters:
    - X: Input feature matrix
    - y: Target vector
    - alpha: Learning rate
    - iterations: Number of iterations for gradient descent
    - lambda_reg: Regularization parameter (for Ridge or Lasso)
    - regularization: Type of regularization (0 for none, 1 for Lasso, 2 for Ridge)

    Returns:
    - theta: Optimized parameters
    """
    
    # Add a column of ones to the input matrix X for the bias term (theta_0)
    app_X = np.ones((X.shape[0], 1))
    X = np.hstack((app_X, X))
    
    # Initialize the parameter vector theta with zeros
    theta = np.zeros((X.shape[1], 1))
    
    # Perform gradient descent for the specified number of iterations
    for i in range(iterations):
        # Compute the predictions
        predictions = np.dot(X, theta)
        errors = predictions - y
        
        if regularization == 1:
            # Lasso regularization
            gradients = (1/X.shape[0]) * np.dot(X.T, errors) + lambda_reg * np.sign(theta)
        elif regularization == 2:
            # Ridge regularization
            gradients = (1/X.shape[0]) * np.dot(X.T, errors) + (lambda_reg / X.shape[0]) * np.concatenate(([0], theta[1:].flatten()), axis=0).reshape(-1, 1)
        else:
            # No regularization
            gradients = (1/X.shape[0]) * np.dot(X.T, errors)
        
        # Update theta using the gradient descent formula
        theta -= alpha * gradients
    
    return theta

Additional explanations to the implementation of the regularizations:

- **Lasso**:
     - `(1/X.shape[0])`: This factor normalizes the gradient by dividing by the number of samples `(X.shape[0])`, averaging the gradient over all training examples
     - `np.dot(X.T, errors)`: This term calculates the gradient of the cost function with respect to the parameters theta without regularization. Here, `X.T` is the transpose of the feature matrix `X`, and `errors` is the vector of differences between the predicted values and the actual target values.
     
- **Ridge**:
    - `lambda_reg / X.shape[0]`: Scales the regularization term by dividing by the number of samples, thus averaging the penalty over all training examples.
    - `np.concatenate(([0], theta[1:]), axis=0)`: Creates a vector where the first element is 0 (for the intercept term) and the rest are the coefficients of theta starting from the second element. This ensures that the intercept is not regularized, while the other coefficients are penalized.

# 3. Comparison on Example Data

We will now use the Diabetes dataset to compare the theta matrices that unregularized, L1-regularized, and L2-regularized gradient descent produce.

In [31]:
# Imports
from sklearn.datasets import load_diabetes
from sklearn.preprocessing import StandardScaler

# Load and preprocess the diabetes dataset
data = load_diabetes()
X, y = data.data, data.target.reshape(-1, 1)

# Standardize the features
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Set parameters for gradient descent
alpha = 0.01  # Learning rate
iterations = 1000  # Number of iterations
lambda_reg = 0.9  # Regularization parameter - high value selected for illustration

# Perform gradient descent with no regularization
theta_no_reg = gradientDescentVec(X, y, alpha, iterations, lambda_reg=0, regularization=0)

# Perform gradient descent with Ridge regularization
theta_ridge = gradientDescentVec(X, y, alpha, iterations, lambda_reg=lambda_reg, regularization=2)

# Perform gradient descent with Lasso regularization
theta_lasso = gradientDescentVec(X, y, alpha, iterations, lambda_reg=lambda_reg, regularization=1)

print("Theta with No Regularization:\n", theta_no_reg)
print("\nTheta with Ridge Regularization:\n", theta_ridge)
print("\nTheta with Lasso Regularization:\n", theta_lasso)

Theta with No Regularization:
 [[152.12691637]
 [ -0.31310953]
 [-11.25407205]
 [ 25.16330292]
 [ 15.32275063]
 [ -4.43042964]
 [ -4.24852822]
 [ -9.43327635]
 [  5.25839603]
 [ 23.01461081]
 [  3.35416841]]

Theta with Ridge Regularization:
 [[152.12691637]
 [ -0.30437793]
 [-11.22269819]
 [ 25.12210248]
 [ 15.30138751]
 [ -4.39374972]
 [ -4.25247927]
 [ -9.43217031]
 [  5.266604  ]
 [ 22.95904187]
 [  3.37400217]]

Theta with Lasso Regularization:
 [[ 1.51226956e+02]
 [-6.07332225e-04]
 [-9.54719735e+00]
 [ 2.49888211e+01]
 [ 1.43133383e+01]
 [-2.76998959e+00]
 [-2.80660436e+00]
 [-1.02610432e+01]
 [ 2.23401857e+00]
 [ 2.27969316e+01]
 [ 2.64133785e+00]]
