# Using elastic net in a linear regression to predict a possum's length

The routine **[handle_solve_nldf](https://support.nag.com/numeric/py/nagdoc_latest/naginterfaces.library.opt.handle_solve_nldf.html)** is a general nonlinear data-fitting solver in the [NAG® Library](https://nag.com/nag-library/) that supports a variety of different loss functions and regularization options - including elastic net.

**Elastic net** regularization is a combination of L1 (lasso) and L2 (ridge) regularization that is ideally suited for high-dimensional and noisy data. In these cases, it can be used for **feature selection** by setting the coefficients of irrelevant features to zero. Further, it is particularly useful when dealing with **multicollinear** features - where two or more features are highly correlated. It achieves this by shrinking the coefficients of correlated features towards each other. It also helps to **reduce overfitting** by penalizing large coefficients, which can lead to better generalization performance.

To demonstrate the use of elastic net regularization, we will build a linear regression model to predict a possum's total length based upon several features, such as, capture site, age, and head length.

Note, the purpose of this notebook is to illustrate the use of handle_solve_nldf which is a general data-fitting framework that utilises nonlinear programming algorithms, such as sequential quadratic programming and interior point method. Therefore, it may not be as performant as one of our dedicated linear regression solvers.


**Reference:** \
Lindenmayer, D. B., Viggers, K. L., Cunningham, R. B., and Donnelly, C. F. 1995. Morphological variation among columns of the mountain brushtail possum, Trichosurus caninus Ogilby (Phalangeridae: Marsupiala). Australian Journal of Zoology 43: 449-458. \
Dataset source: https://www.kaggle.com/datasets/abrambeyer/openintro-possum

In [1]:
# Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from naginterfaces.library import opt
from naginterfaces.base import utils

# Set a random seed
np.random.seed(0)

## 1. Load and preprocess the data
This dataset has 13 features and 101 observations.

In [2]:
df = pd.read_csv('possum.csv', usecols=range(1,14))
df.head()

Unnamed: 0,site,Pop,sex,age,hdlngth,skullw,totlngth,taill,footlgth,earconch,eye,chest,belly
0,1,Vic,m,8.0,94.1,60.4,89.0,36.0,74.5,54.5,15.2,28.0,36.0
1,1,Vic,f,6.0,92.5,57.6,91.5,36.5,72.5,51.2,16.0,28.5,33.0
2,1,Vic,f,6.0,94.0,60.0,95.5,39.0,75.4,51.9,15.5,30.0,34.0
3,1,Vic,f,6.0,93.2,57.1,92.0,38.0,76.1,52.2,15.2,28.0,34.0
4,1,Vic,f,2.0,91.5,56.3,85.5,36.0,71.0,53.2,15.1,28.5,33.0


In [3]:
# Remove NAN data
df.dropna(axis=0,inplace=True)

# Categorical features (site, population, sex) need to be one-hot encoded
df_encoded = pd.get_dummies(df, columns=['site','Pop','sex'], dtype=float, drop_first=True)

# Extract total length (y), which is the variable to be predicted
y = df_encoded[["totlngth"]].values
X = df_encoded.drop(columns=["totlngth"]).values

## 2. Split the data into training and testing sets

In [4]:
def train_test(X, y, test_size=0.2):
    """
    Split dataset into training and testing sets.

    Parameters:
    X (numpy array): Features
    y (numpy array): Observations
    test_size (float, optional): Proportion of data to use for testing

    Returns:
    X_train, y_train, X_test, y_test
    """
    # Get total number of samples
    num_samples = X.shape[0]

    # Calculate number of test samples
    num_test_samples = int(num_samples * test_size)

    # Generate random indices for training set
    train_indices = np.random.choice(num_samples, num_samples - num_test_samples, replace=False)

    # Create training sets
    X_train = X[train_indices]
    y_train = y[train_indices]

    # Create testing sets
    test_indices = np.setdiff1d(np.arange(num_samples), train_indices)
    X_test = X[test_indices]
    y_test = y[test_indices]

    return X_train, y_train, X_test, y_test

In [5]:
def scale_data(X_train, y_train, X_test, y_test):
    """
    Scale the training and testing datasets.

    Returns:
    X_train, y_train, X_test, y_test
    """
    mu = X_train.mean(0)
    sigma = X_train.std(0)
    for j in range(X_train.shape[-1]):
        xs = X_train[:,j]
        is_categorical = np.logical_or(np.isclose(xs, 1.), np.isclose(xs, 0.)).all()
        if not is_categorical:
            X_train[:,j] = (X_train[:,j] - mu[j]) / sigma[j]
            X_test[:,j] = (X_test[:,j] - mu[j]) / sigma[j]
    
    y_test = (y_test - y_train.mean()) / y_train.std()
    y_train = (y_train - y_train.mean()) / y_train.std()

    return X_train, y_train, X_test, y_test
    

In [6]:
# Split data into training and testing sets
X_train, y_train, X_test, y_test = train_test(X, y)

# Scale the data
X_train, y_train, X_test, y_test = scale_data(X_train, y_train, X_test, y_test)

print("Training set shapes:", X_train.shape, y_train.shape)
print("Testing set shapes:", X_test.shape, y_test.shape)

Training set shapes: (81, 17) (81, 1)
Testing set shapes: (20, 17) (20, 1)


## 3. Fit a linear regression with least squares loss and elastic net regularization

In [7]:
# Number of variables = number of features + bias term
nvar = X_train.shape[1] + 1

# Create a handle for the model
handle = opt.handle_init(nvar=nvar)

# Register residual structure
nres =  X_train.shape[0]
opt.handle_set_nlnls(handle, nres)

# Create the data structure to be passed to the solver
data = {}
data["X_train"] = X_train
data["y_train"] = y_train

# Define the residual callback function and its gradient
def lsqfun(x, nres, inform, data):
    rx = np.zeros(nres, dtype=float)
    X_train = data["X_train"]
    y_train = data["y_train"].squeeze()
    
    # Fit a linear regression to the data
    r_full = y_train - (x[0] + X_train @ x[1:]) 
    for i in range(nres):
        rx[i] = r_full[i]
 
    return rx, inform
    
def lsqgrd(x, nres, rdx, inform, data):
    X_train = data["X_train"]

    for i in range(nres):
        for j in range(nvar):
            if j==0:
                rdx[i*nvar] = -1               
            else:
                rdx[i*nvar + j] = -X_train[i, j-1]
            
    return inform

# Set loss function to l2-norm, elastic net regularization, and printing options
for option in [
    'NLDF Loss Function Type = L2',
    'Print Level = 1',
    'Print Options = No',
    'Reg Term Type = Elastic Net',
    ]:
    opt.handle_opt_set (handle, option)

# Use an explicit I/O manager for abbreviated iteration output
iom = utils.FileObjManager(locus_in_output=False)

# Set initial guess and solve
x = np.array([np.random.rand() for _ in range(nvar)])

sol_en = opt.handle_solve_nldf(handle, lsqfun, lsqgrd, x, nres, data=data, io_manager=iom)

 E04GN, Nonlinear Data-Fitting
 Status: converged, an optimal solution found
 Final objective value  1.652821E+01


In [8]:
# Resolve the problem with no regularization
opt.handle_opt_set(handle, 'Reg Term Type = Off')
sol_noreg = opt.handle_solve_nldf(handle, lsqfun, lsqgrd, x, nres, data=data, io_manager=iom)

# Destroy the handle
opt.handle_free(handle)

 E04GN, Nonlinear Data-Fitting
 Status: converged, an optimal solution found
 Final objective value  1.362383E+01


## 4. Compute root mean square error (RMSE)

In [9]:
def calculate_rmse(y_actual, y_pred):
    """
    Calculate the Root Mean Square Error (RMSE) between two lists of numbers

    Args:
        y_actual (list): The actual values
        y_pred (list): The predicted values

    Returns:
        float: The Root Mean Squared Error
    """
    return np.sqrt(np.square(y_actual - y_pred).mean())

In [10]:
# Calculate predicted values for y with elastic net regularization and find RMSE
y_pred_en = sol_en.x[0] + X_test @ sol_en.x[1:]
rmse_elastic_net = calculate_rmse(y_test, y_pred_en)

# Calculate predicted values for y with no regularization and find RMSE
y_pred_noreg = sol_noreg.x[0] + X_test @ sol_noreg.x[1:]
rmse_noreg = calculate_rmse(y_test, y_pred_noreg)

# Report the difference in RMSE
print(f"Using elastic net regularization decreased the RMSE by {round(rmse_noreg - rmse_elastic_net, 4)} compared to using no regularization.")

Using elastic net regularization decreased the RMSE by 0.034 compared to using no regularization.


For more information on the NAG® Library and our [Optimization Modelling Suite](https://nag.com/mathematical-optimization/) or to try it for yourself, visit [‘Getting Started with the NAG Library’](https://support.nag.com/content/getting-started-nag-library?_gl=1*xmlppm*_gcl_au*MTEwNDczODM2NS4xNzIyMDAyNzkz*_ga*MjA2NzgxMjY0NS4xNzIyMDAyNzk0*_ga_6MCQDQP46G*MTcyMzEzNDUxNi41LjAuMTcyMzEzNDUzNy4zOS4wLjA.), select your configuration and language, download the software, request a trial key and experiment for yourself.