# Variable selection: an introduction

Lino Galiana  
2025-10-07

<div class="badge-container"><div class="badge-text">If you want to try the examples in this tutorial:</div><a href="https://github.com/linogaliana/python-datascientist-notebooks/blob/main/notebooks/en/modelisation/4_featureselection.ipynb" target="_blank" rel="noopener"><img src="https://img.shields.io/static/v1?logo=github&label=&message=View%20on%20GitHub&color=181717" alt="View on GitHub"></a>
<a href="https://datalab.sspcloud.fr/launcher/ide/vscode-python?autoLaunch=true&name=«4_featureselection»&init.personalInit=«https%3A%2F%2Fraw.githubusercontent.com%2Flinogaliana%2Fpython-datascientist%2Fmain%2Fsspcloud%2Finit-vscode.sh»&init.personalInitArgs=«en/modelisation%204_featureselection%20correction»" target="_blank" rel="noopener"><img src="https://custom-icon-badges.demolab.com/badge/SSP%20Cloud-Lancer_avec_VSCode-blue?logo=vsc&logoColor=white" alt="Onyxia"></a>
<a href="https://datalab.sspcloud.fr/launcher/ide/jupyter-python?autoLaunch=true&name=«4_featureselection»&init.personalInit=«https%3A%2F%2Fraw.githubusercontent.com%2Flinogaliana%2Fpython-datascientist%2Fmain%2Fsspcloud%2Finit-jupyter.sh»&init.personalInitArgs=«en/modelisation%204_featureselection%20correction»" target="_blank" rel="noopener"><img src="https://img.shields.io/badge/SSP%20Cloud-Lancer_avec_Jupyter-orange?logo=Jupyter&logoColor=orange" alt="Onyxia"></a>
<a href="https://colab.research.google.com/github/linogaliana/python-datascientist-notebooks-colab//en/blob/main//notebooks/en/modelisation/4_featureselection.ipynb" target="_blank" rel="noopener"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a><br></div>

Ce chapitre utilise toujours le même jeu de données, présenté dans l’[introduction
de cette partie](index.qmd) : les données de vote aux élections présidentielles américaines
croisées à des variables sociodémographiques.
Le code
est disponible [sur Github](https://github.com/linogaliana/python-datascientist/blob/main/content/modelisation/get_data.py).

In [None]:
!pip install --upgrade xlrd #colab bug verson xlrd
!pip install geopandas

In [None]:
import requests

url = 'https://raw.githubusercontent.com/linogaliana/python-datascientist/main/content/modelisation/get_data.py'
r = requests.get(url, allow_redirects=True)
open('getdata.py', 'wb').write(r.content)

import getdata
votes = getdata.create_votes_dataframes()

So far, we have assumed that the variables useful for predicting the Republican
vote were known to the modeler. Thus, we have only used a limited portion of the
available variables in our data. However, beyond the computational burden of building
a model with a large number of variables, choosing a limited number of variables
(a parsimonious model) reduces the risk of overfitting.

How, then, can we choose the right number of variables and the best combination of these variables? There are multiple methods, including:

-   Relying on statistical performance criteria that penalize non-parsimonious models. For example, BIC.
-   *Backward elimination* techniques.
-   Building models where the statistic of interest penalizes the lack of parsimony (which is what we aim to do here).

In this chapter, we will present the main challenges of variable selection through LASSO.

We will subsequently use the following functions or packages:

In [None]:
import numpy as np
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import Lasso
import sklearn.metrics
from sklearn.linear_model import LinearRegression
import matplotlib.cm as cm
import matplotlib.pyplot as plt
from sklearn.linear_model import lasso_path
import seaborn as sns

# 1. The Principle of LASSO

## 1.1 General Principle

The class of *feature selection* models is very broad and includes
a diverse range of models. We will focus on LASSO
(*Least Absolute Shrinkage and Selection Operator*),
which is an extension of linear regression aimed at selecting
*sparse* models. This type of model is central to the field of
*Compressed sensing* (where the term *L1-regularization* is more commonly used than LASSO). LASSO is a special case of
elastic-net regressions, with another famous case being *ridge regression*.
Unlike classical linear regression, these methods also work
in a framework where $p>N$, i.e., where the number of predictors is much larger than
the number of observations.

## 1.2 Regularization

By adopting the principle of a penalized objective function,
LASSO allows certain coefficients to be set to 0.
Variables with non-zero norms thus pass the selection test.

> **Tip**
>
> LASSO is a constrained optimization problem. It seeks to find the estimator $\beta$ that minimizes the quadratic error (linear regression) under an additional constraint regularizing the parameters:
> $$
> \min_{\beta} \frac{1}{2}\mathbb{E}\bigg( \big( X\beta - y  \big)^2 \bigg) \\ 
> \text{s.t. } \sum_{j=1}^p |\beta_j| \leq t
> $$
>
> This program is reformulated using the Lagrangian, allowing for a more tractable minimization program:
>
> $$
> \beta^{\text{LASSO}} = \arg \min_{\beta} \frac{1}{2}\mathbb{E}\bigg( \big( X\beta - y  \big)^2 \bigg) + \alpha \sum_{j=1}^p |\beta_j| = \arg \min_{\beta} ||y-X\beta||_{2}^{2} + \lambda ||\beta||_1
> $$
>
> where $\lambda$ is a reformulation of the previous regularization term, depending on $\alpha$. The strength of the penalty applied to non-parsimonious models depends on this parameter.

## 1.3 First LASSO Regression

As we aim to find the
best predictors of the Republican vote,
we will remove variables
that can be directly derived from these: the competitors’ scores!

In [None]:
import pandas as pd

df2 = pd.DataFrame(votes.drop(columns='geometry'))
df2 = df2.loc[
  :,
  ~df2.columns.str.endswith(
    ('_democrat','_green','_other', 'winner', 'per_point_diff', 'per_dem')
    )
  ]


df2 = df2.loc[:,~df2.columns.duplicated()]

In this exercise, we will also use
a function to extract
the variables selected by LASSO,
here it is:

In [None]:
from sklearn.linear_model import Lasso
from sklearn.pipeline import Pipeline

def extract_features_selected(lasso: Pipeline, preprocessing_step_name: str = 'preprocess') -> pd.Series:
    """
    Extracts selected features based on the coefficients obtained from Lasso regression.

    Parameters:
    - lasso (Pipeline): The scikit-learn pipeline containing a trained Lasso regression model.
    - preprocessing_step_name (str): The name of the preprocessing step in the pipeline. Default is 'preprocess'.

    Returns:
    - pd.Series: A Pandas Series containing selected features with non-zero coefficients.
    """
    # Check if lasso object is provided
    if not isinstance(lasso, Pipeline):
        raise ValueError("The provided lasso object is not a scikit-learn pipeline.")

    # Extract the final transformer from the pipeline
    lasso_model = lasso[-1]

    # Check if lasso_model is a Lasso regression model
    if not isinstance(lasso_model, Lasso):
        raise ValueError("The final step of the pipeline is not a Lasso regression model.")

    # Check if lasso model has 'coef_' attribute
    if not hasattr(lasso_model, 'coef_'):
        raise ValueError("The provided Lasso regression model does not have 'coef_' attribute. "
                         "Make sure it is a trained Lasso regression model.")

    # Get feature names from the preprocessing step
    features_preprocessing = lasso[preprocessing_step_name].get_feature_names_out()

    # Extract selected features based on non-zero coefficients
    features_selec = pd.Series(features_preprocessing[np.abs(lasso_model.coef_) > 0])

    return features_selec

> **Exercise 1: First LASSO**
>
> We are still trying to predict the variable `per_gop`. Before making our estimation, we will create certain intermediate objects to define our *pipeline*:
>
> 1.  In our `DataFrame`, replace infinite values with `NaN`.
>
> 2.  Create a training sample and a test sample.
>
> Now we can move on to defining our *pipeline*.
> [This example](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) might serve as inspiration, as well as [this one](https://www.kaggle.com/code/bextuychiev/lasso-regression-with-pipelines-tutorial).
>
> 1.  First, create the *preprocessing* steps for our model.
>     For this, it is common to separate the steps applied to continuous numerical variables from those applied to categorical variables.
>
> -   For **numerical variables**, impute with the mean and then standardize;
> -   For **categorical variables**, linear regression techniques require using one-hot encoding. Before performing one-hot encoding, impute with the most frequent value.
>
> 1.  Finalize the *pipeline* by adding the estimation step and then estimate a LASSO model penalized with $\alpha = 0.1$.
>
> Assuming your *pipeline* is stored in an object named `pipeline` and the last step is named `model`, you can directly access this step using the object `pipeline['model']`.
>
> 1.  Display the coefficient values. Which variables have a non-zero value?
> 2.  Show that the selected variables are sometimes highly correlated.
> 3.  Compare the performance of this parsimonious model with that of a model with more variables.
>
> <details>
>
> <summary>
>
> Help for Question 1
>
> </summary>
>
> ``` python
> # Replace infinities with NaN
> df2.replace([np.inf, -np.inf], np.nan, inplace=True)
> ```
>
> </details>
>
> <details>
>
> <summary>
>
> Help for Question 3
>
> </summary>
>
> The pipeline definition follows this structure:
>
> ``` python
> numeric_pipeline = Pipeline(steps=[
>     ('impute', # define the imputation method here
>      ),
>     ('scale', # define the standardization method here
>     )
> ])
>
> categorical_pipeline = # adapt the template
>
> # Define numerical_features and categorical_features beforehand
> preprocessor = ColumnTransformer(transformers=[
>     ('number', numeric_pipeline, numerical_features),
>     ('category', categorical_pipeline, categorical_features)
> ])
> ```
>
> </details>

The *preprocessing* pipeline (question 3) takes the following form:

The *pipeline* takes the following form once finalized (question 4):

At the end of question 5, the selected variables are:

The model is quite parsimonious as it uses a subset of our initial variables (especially since our categorical variables have been split into numerous variables through *one hot encoding*).

Some variables make sense, such as education-related variables. Notably, one of the best predictors for the Republican score in 2020 is… the Republican score (and mechanically the Democratic score) in 2016 and 2012.

Additionally, redundant variables are being selected. A more thorough data cleaning phase would actually be necessary.

The parsimonious model is (slightly) more performant:

Moreover, it can already be noted that regressing the 2020 score on the 2016 score results in very good explanatory performance, suggesting that voting behaves like an autoregressive process:

In [None]:
import statsmodels.api as sm
import statsmodels.formula.api as smf
smf.ols("per_gop ~ share_2016_republican", data = df2).fit().summary()

# 2. Role of the Penalty $\alpha$ in Variable Selection

So far, we have taken the hyperparameter $\alpha$
as given. What role does it play in the conclusions of
our modeling? To investigate this, we can explore the effect
of its value on the number of variables passing the selection step.

For the next exercise, we will consider exclusively
quantitative variables to speed up the computations.
Indeed, with non-parsimonious models, the multiple
categories of our categorical variables make the optimization problem
difficult to handle.

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

df2.replace([np.inf, -np.inf], np.nan, inplace=True)
X_train, X_test, y_train, y_test = train_test_split(
    df2.drop(["per_gop"], axis = 1),
    100*df2[['per_gop']], test_size=0.2, random_state=0
)

numerical_features = X_train.select_dtypes(include='number').columns.tolist()
categorical_features = X_train.select_dtypes(exclude='number').columns.tolist()

numeric_pipeline = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='mean')),
    ('scale', StandardScaler())
])
preprocessed_features = pd.DataFrame(
  numeric_pipeline.fit_transform(
    X_train.drop(columns = categorical_features)
  )
)


Skipping features without any observed values: ['POV04_2021' 'CI90LB04_2021' 'CI90UB04_2021' 'PCTPOV04_2021'
 'CI90LB04P_2021' 'CI90UB04P_2021']. At least one non-missing value is needed for imputation with strategy='mean'.


> **Exercise 2: Role of the Penalty Parameter**
>
> Use the `lasso_path` function to evaluate the number of parameters selected by LASSO as $\alpha$
> varies (explore $\alpha \in [0.001,0.01,0.02,0.025,0.05,0.1,0.25,0.5,0.8,1.0]$).

The relationship you should obtain between $\alpha$ and
the number of parameters is as follows:

We see that the higher $\alpha$ is, the fewer variables the model selects.

# 3. Cross-Validation to Select the Model

Which $\alpha$ should be preferred? For this,
cross-validation should be performed to choose the model
for which the variables passing the selection phase best predict
the Republican outcome.

In [None]:
from sklearn.linear_model import LassoCV

my_alphas = np.array([0.001,0.01,0.02,0.025,0.05,0.1,0.25,0.5,0.8,1.0])

lcv = (
  LassoCV(
    alphas=my_alphas,
    fit_intercept=False,
    random_state=0,
    cv=5
    ).fit(
      preprocessed_features, y_train
    )
)

The *“best”* $\alpha$ can be retrieved as follows:

In [None]:
print("alpha optimal :", lcv.alpha_)

alpha optimal : 1.0

This can be used to run a new *pipeline*:

In [None]:
from sklearn.compose import make_column_transformer, ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

numeric_pipeline = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='mean')),
    ('scale', StandardScaler())
])

categorical_pipeline = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='most_frequent')),
    ('one-hot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

preprocessor = ColumnTransformer(transformers=[
    ('number', numeric_pipeline, numerical_features),
    ('category', categorical_pipeline, categorical_features)
])

model = Lasso(
  fit_intercept=False, 
  alpha = lcv.alpha_
)  

lasso_pipeline = Pipeline(steps=[
    ('preprocess', preprocessor),
    ('model', model)
])

lasso_optimal = lasso_pipeline.fit(X_train,y_train)

features_selec2 = extract_features_selected(lasso_optimal)

Les variables sélectionnées sont :

Cela correspond à un modèle avec 13 variables sélectionnées.

> **Tip**
>
> If the model appears to be insufficiently parsimonious, it would be necessary to revisit the variable definition phase to determine whether different scales for some variables might be more appropriate (e.g., using the `log`).