# Sparse Optimization with Lasso Regression
# Feature Selection for High-Dimensional Data

## Javier FernÃ¡ndez Ramos

This notebook demonstrates how to use Lasso regression for feature selection in high-dimensional datasets. Lasso (Least Absolute Shrinkage and Selection Operator) is a linear regression technique that includes a regularization term to promote sparsity in the model coefficients, effectively selecting a subset of features.

The data was acquired by the public kaggle dataset "TCGA - LUSC | Lung Cancer Gene Expression Dataset" from https://www.kaggle.com/datasets/noepinefrin/tcga-lusc-lung-cell-squamous-carcinoma-gene-exp/data . 

Dataset consists of 551 patients, samples, each with 56970 differetn transcripts (expressed genes). 

Lung Cell Squamos Carcinoma is a cancer that occurs in lungs. Detecting and predicting it by Machine Learning is a clear challenge that could be very helpful.

## Modelling

### Libraries

In [1]:
import numpy as np
import joblib
import pandas as pd

### Import needed data

In [14]:
X = np.load("../data/processed/X_preprocess.npy")
y = np.load("../data/processed/y.npy")
gene_names = joblib.load("../data/processed/gene_names.pkl")

print(X.shape)   # (551, N)
print(y.shape)          # (551,)
print(len(gene_names))  # N

(551, 24763)
(551,)
24763


### Data splitting

Before running any models, we must spit our preprocessed data into training and testing


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, roc_auc_score

  from scipy.sparse import csr_matrix, issparse


In [17]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42)

print("Train samples:", X_train.shape[0])
print("Test samples:", X_test.shape[0])

Train samples: 413
Test samples: 138


### Baseline modelling

Before diving into LASSO, we are going to run some baselines models to compare the metrics.

### Ridge Regression 

Now we are going to execute Ridge Regression as baseline to find the optimal penalty strength using cross-validation. 

Ridge applies L2 penalty, which shrinks the coefficients but does not force them to zero (does not include sparsity).

##### It shows the benefit of regularization on generalization error (lowering MSE and improvinf AUC) without the added constraint of sparsity. 

The performance difference between Ridge Regression and Lasso will help us understand the trade off.

In [44]:
from sklearn.linear_model import RidgeCV

In [None]:
# Define a range of alpha values for the cross-validation
alphas = np.logspace(-2, 6, 50) # Log scale alpha values from 10^-2 to 10^2, from 0.01 to 100
# The RidgeCV model will automatically search these alphas
ridge_cv = RidgeCV(
    alphas=alphas, # search for best performance alpha
    cv=5, # 5-fold cross-validation. Data splitted innto 5 parts, train on 4, validate on 1, repeat 5 times
    scoring='neg_mean_squared_error' # Optimize for lowest MSE
)
ridge_cv.fit(X_train, y_train)

ridge_y_pred = ridge_cv.predict(X_test)

print(f"Optimal Alpha (Ridge): {ridge_cv.alpha_:.4f}")
print(f"Ridge MSE on Test Set: {mean_squared_error(y_test, ridge_y_pred):.4f}")
print(f"Ridge AUC on Test Set: {roc_auc_score(y_test, ridge_y_pred):.4f}") # How well the model ranks positive vs negative samples
print(f"Number of non-zero coefficients: {np.sum(ridge_cv.coef_ != 0)}")

Optimal Alpha (Ridge): 5179.4747
Ridge MSE on Test Set: 0.0094
Ridge AUC on Test Set: 1.0000
Number of non-zero coefficients: 24763


We can see that finally the optimal alpha was 5179, we obtained a very small MSE and an AUC of 1... what means very good performance. This could mean two things:

1. The dataset is linearly separable, allowing both Ridge and Lasso to achieve high performance.
2. The features are highly informative, enabling effective classification even with regularization.
3. Data leakage.
4. Good dense baseline to compare our Lasso Regression

In [47]:
if len(gene_names) != len(ridge_cv.coef_):
    print("Length mismatch")
    print(f"List has {len(gene_names)} names, but Model has {len(ridge_cv.coef_)} coefficients.")
else:
    print(f"Sanity Check Passed: genes names match the model dimensions.")
    print(f"List has {len(gene_names)} names, but Mmdel has {len(ridge_cv.coef_)} coefficients.")

    # 2. Re-attach names to the coefficients
    # We create a Pandas Series where the data is the coefs, and the index is your names
    ridge_series = pd.Series(ridge_cv.coef_, index=gene_names)

    # 3. Filter for the non-zero survivors
    selected_genes = ridge_series[ridge_series != 0].abs().sort_values(ascending=False)
    selected_genes.to_csv("final_lasso_genes.csv")

    print("\nRIDGE INFLUENTIAL DISCOVERED GENES")
    print(selected_genes.head(10))

    selected_genes = ridge_series[ridge_series != 0].sort_values(ascending=False)

    print("\nRIDGE NEGATIVE INFLUENTIAL DISCOVERED GENES")
    print(selected_genes.head(10))

Sanity Check Passed: genes names match the model dimensions.
List has 24763 names, but Mmdel has 24763 coefficients.

RIDGE INFLUENTIAL DISCOVERED GENES
SPP1          0.000620
AL035665.1    0.000580
HBA1          0.000553
R3HDML        0.000538
TUBB1         0.000524
ITLN1         0.000517
PKHD1L1       0.000510
MIR4530       0.000496
LRRN4CL       0.000482
SDS           0.000479
dtype: float64

RIDGE NEGATIVE INFLUENTIAL DISCOVERED GENES
SPP1          0.000620
AL035665.1    0.000580
R3HDML        0.000538
MIR4530       0.000496
SDS           0.000479
MMP9          0.000474
MYBPH         0.000473
MIR6774       0.000444
DSP           0.000428
AL117382.1    0.000427
dtype: float64


There is no present leakage effect of the features... then we are having a very good dense baseline

## Lasso Challenge

After proving to have a very good dense and solid baseline execution with Ridge, OLS and Dummy Regressor. We have to see if Lasso is capable to reduce the features to find the needed group of genes that can simplify the complexity of the problem adding sparsity and reducing number of features

### Feature selection

Now we want to see that the number of non-zero coefficients, the number of selected features are dropped from 24763 to something smaller.

The performance might drop but we need to optimize the model

In [None]:
from sklearn.linear_model import LassoCV
import numpy as np
import pandas as pd


lasso_cv = LassoCV(
    cv=5, 
    random_state=42, 
    max_iter=10000,  # Increased from default 1000 to ensure convergence
    n_jobs=-1        # Use all CPU cores
)

# Fit the Model
lasso_cv.fit(X_train, y_train)

# Predict
lasso_y_pred = lasso_cv.predict(X_test)

# 4. Evaluate Sparsity
# Count how many coefficients are NOT zero
n_features = len(lasso_cv.coef_)
n_nonzero = np.sum(lasso_cv.coef_ != 0)
sparsity_ratio = (1 - (n_nonzero / n_features)) * 100

print("\n   LASSO RESULTS (Sparse Model)")
print(f"Optimal Alpha: {lasso_cv.alpha_:.6f}")
print(f"Lasso MSE: {mean_squared_error(y_test, lasso_y_pred):.4f}")
print(f"Lasso AUC: {roc_auc_score(y_test, lasso_y_pred):.4f}")
print("\n")
print(f"Original Features: {n_features}")
print(f"Features Selected (Non-Zero): {n_nonzero}")
print(f"Sparsity Achieved: {sparsity_ratio:.2f}% of features removed")

# 5. Inspect the Survivors (The features Lasso kept)
if hasattr(X_train, 'columns'):
    lasso_coefs = pd.Series(lasso_cv.coef_, index=X_train.columns)
else:
    lasso_coefs = pd.Series(lasso_cv.coef_)



   LASSO RESULTS (Sparse Model)
Optimal Alpha: 0.008514
Lasso MSE: 0.0102
Lasso AUC: 1.0000


Original Features: 24763
Features Selected (Non-Zero): 95
Sparsity Achieved: 99.62% of features removed

Top 5 Features selected:
19608    0.060168
14453    0.025671
13138    0.016630
21186    0.015327
15235    0.012584
dtype: float64


In [None]:
import pandas as pd
import numpy as np

# 1. Sanity Check
# The length of your name list MUST match the number of features in the model (24763)
# If this fails, the list represents the wrong step in your pipeline.
if len(gene_names) != len(lasso_cv.coef_):
    print("Length mismatch")
    print(f"List has {len(gene_names)} names, but Model has {len(lasso_cv.coef_)} coefficients.")
else:
    print(f"Sanity Check Passed: genes names match the model dimensions.")
    print(f"List has {len(gene_names)} names, but Mmdel has {len(lasso_cv.coef_)} coefficients.")

    # 2. Re-attach names to the coefficients
    # We create a Pandas Series where the data is the coefs, and the index is your names
    lasso_series = pd.Series(lasso_cv.coef_, index=gene_names)

    # 3. Filter for the non-zero survivors
    selected_genes = lasso_series[lasso_series != 0].abs().sort_values(ascending=False)
    selected_genes.to_csv("final_lasso_genes.csv")

    print("\nTOP INFLUENTIAL DISCOVERED GENES")
    print(selected_genes.head(10))

    selected_genes = lasso_series[lasso_series != 0].sort_values(ascending=False)

    print("\nTOP NEGATIVE INFLUENTIAL DISCOVERED GENES")
    print(selected_genes.head(10))


Sanity Check Passed: genes names match the model dimensions.
List has 24763 names, but Mmdel has 24763 coefficients.

TOP INFLUENTIAL DISCOVERED GENES
RS1          0.060168
MIR3945HG    0.025671
LINC00702    0.016630
SPP1         0.015327
MYRIP        0.012584
MYBL2        0.011655
DSP          0.011527
TUBB1        0.010998
PKHD1L1      0.010713
HBA1         0.009160
dtype: float64

TOP NEGATIVE INFLUENTIAL DISCOVERED GENES
SPP1          0.015327
MYBL2         0.011655
DSP           0.011527
HMGA1         0.008613
AL035665.1    0.007960
FBXO32        0.006323
TNFRSF13C     0.005843
DDIT4L        0.004779
C16orf70      0.004028
CXCL13        0.003990
dtype: float64


## Results

Between the most influential genes, the ones more present in ill patients, when lung cancer tumors are detected, are the following ones:

- LINC00702 Gene, that is related with Meningioma, Meningeal Cell Tumor.

- MYBL2 Gene, related with Myeloma. Present in patients with different cancers such as gastric, breast or neuroblastoma and cerebellera degeneration.

- SPP1 Gene, related with Cholangiocarcinoma, tumors and cancer proliferations in the Liver, Skin or Nephrological diseases.