# Bayesian Sparse Feature Selection Demo

This notebook demonstrates the Bayesian Gaussian Mixture Feature Selection algorithm on artificial data.

**Features:**
- Modular prior (spike-and-slab, easily replaceable)
- Flexible mixture model
- Variational inference with PyTorch
- Interactive diagnostics with Plotly/Seaborn

> **Note:** Ensure that the `feature_selection` package and dependencies are installed.

In [1]:
# %pip install -q -e ..

In [2]:
# Add parent directory to Python path to find our package
import sys
import os
from pathlib import Path

parent_dir = Path(os.path.dirname(os.getcwd()))
if parent_dir not in sys.path:
    sys.path.insert(0, str(parent_dir))

In [3]:
from typing import Dict, List
from IPython.display import display, Markdown
import numpy as np
import pandas as pd
import torch
import plotly.io as pio
import plotly.express as px
import plotly.graph_objects as go

import feature_selection.config as C
from feature_selection.utils import (
    solve_with_logistic_regression,
    solve_with_linear_regression,
    recover_solutions,
    display_features_overview,
    show_regression_results_for_solutions,
    show_algorithm_progress,
)
from feature_selection.inference import BayesianFeatureSelector
from feature_selection.visualizations import (
    plot_elbo,
    plot_mu,
    plot_alpha,
    show_correlation_matrix,
    show_correlations_with_response,
    show_label_histogram,
    show_features_in_components,
    compare_parameters,
)
from feature_selection.generate_artificial_dataset import (
    generate_artificial_dataset,
)

pio.renderers.default = "notebook_connected"  # Ensures plotly plots show in notebooks

# Import parameters and settings

# Generate Artificial Dataset

In [4]:
# Generate dataset
df, y, generating_solutions, parameters = generate_artificial_dataset(
    n_samples=C.NSAMPLES,
    n_features=C.NFEATURES,
    n_solutions=C.NSOLUTIONS,
    sparsity=C.SPARSITY,
    noise_data_std=C.NOISE_STD,
    binarize=C.BINARIZE,
    binary_response_ratio=C.BINARY_RESPONSE_RATIO,
    random_seed=C.RANDOM_SEED,
    save_to_csv=False,
    print_data_overview=True,
)

support_indices = parameters["support_indices"].sum()
true_support_features = [f"feature_{i}" for i in set(support_indices)]

### Artificial dataset

- **Number of samples:** 40

- **Number of features:** 80

- **Number of generating solutions:** 3

- **Sparsity (nonzero dimensions per component):** 4

- **Support features:** 12<br>['feature_10', 'feature_15', 'feature_35', 'feature_40', 'feature_42', 'feature_51', 'feature_53', 'feature_55', 'feature_6', 'feature_60', 'feature_61', 'feature_7']

- **Parameters of the mixture components:**

Unnamed: 0,solution_index,support_indices,weights,full_weights,sparsity
0,0,"[35, 6, 51, 60]","[9.414119910788814, 7.150920960645316, 8.58209...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 7.1509209606453...",4
1,1,"[53, 15, 7, 42]","[-1.1492209573156602, -0.10472150375732858, -0...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -0.8359565...",4
2,2,"[55, 40, 61, 10]","[-0.5789199035899424, -0.35300806808508334, 0....","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",4


- **Distribution of binary labels:**

binary_response
0    0.5
1    0.5
Name: proportion, dtype: float64

### Features' Correlation with Binary Response

# Classical approach
Solve the classification problem using logistic regression with l1 penalty

In [5]:
# Solve the full problem (all features) with l2 penalty
#
# if BINARIZE:
#     solve_with_logistic_regression(X=df, y=y, penalty="l2")
# else:
#     solve_with_linear_regression(X=df, y=y, penalty="l2")

In [6]:
# Solve the full problem (all features) with l1 penalty
if C.BINARIZE:
    solve_with_logistic_regression(X=df, y=y, penalty="l1")
else:
    solve_with_linear_regression(X=df, y=y, penalty="l1")

### Logistic regression with L1 penalty - Performance on training set

**Accuracy:** 0.975

**ROC-AUC:** 1.0

**Coefficients of the logistic regression model:**

feature_10    0.759737
feature_32    0.464510
feature_30    0.203492
feature_40   -0.076762
feature_3    -0.183644
feature_7    -1.238947
Name: Coefficient, dtype: float64

In [7]:
# Show regression results only for the generating solutions
show_regression_results_for_solutions(
    solutions=generating_solutions,
    df=df,
    y=y,
    penalty="l1",
)

## Features of **solution_0**

### Logistic regression with L1 penalty - Performance on training set

**Accuracy:** 1.0

**ROC-AUC:** 1.0

**Coefficients of the logistic regression model:**

feature_35    3.475026
feature_51    2.978935
feature_6     2.836392
feature_60    2.746666
Name: Coefficient, dtype: float64

------------------

## Features of **solution_1**

### Logistic regression with L1 penalty - Performance on training set

**Accuracy:** 0.925

**ROC-AUC:** 1.0

**Coefficients of the logistic regression model:**

feature_53   -0.623887
feature_7    -1.814898
Name: Coefficient, dtype: float64

------------------

## Features of **solution_2**

### Logistic regression with L1 penalty - Performance on training set

**Accuracy:** 1.0

**ROC-AUC:** 1.0

**Coefficients of the logistic regression model:**

feature_10    13.864407
feature_61     3.610115
feature_40    -2.468969
feature_55    -9.393171
Name: Coefficient, dtype: float64

------------------

In [8]:
# Solve for arbitrary combinations of features
combinations = [
    # ["3", "17", "30", "39"],
]

for combination in combinations:
    display(Markdown(f"## Regression model for component {combination}"))
    features = [f"feature_{i}" for i in combination]
    display(Markdown(f"- Support features: {features}"))
    if C.BINARIZE:
        solve_with_logistic_regression(X=df[features], y=y, penalty="l1")
    else:
        solve_with_linear_regression(X=df[features], y=y, penalty="l1")
    display(Markdown("------------------"))

# Bayesian Feature Selector

In [None]:
selector = BayesianFeatureSelector(
    n_features=C.NFEATURES,
    n_components=C.N_COMPONENTS,
    X=df.values,
    y=y,
    prior=C.PRIOR_TYPE,
    sss_sparsity=C.SPARSITY,
    var_slab=C.VAR_SLAB,
    var_spike=C.VAR_SPIKE,
    weight_slab=C.WEIGHT_SLAB,
    weight_spike=C.WEIGHT_SPIKE,
    student_df=C.STUDENT_DF,
    student_scale=C.STUDENT_SCALE,
    lr=C.LEARNING_RATE,
    batch_size=C.BATCH_SIZE,
    n_iter=C.N_ITER,
)

history = selector.optimize(
    regularize=C.IS_REGULARIZED,
    lambda_jaccard=C.LAMBDA_JACCARD,
    verbose=True,
)

show_algorithm_progress(history)

#### Running Bayesian Feature Selector:

- desired number of solutions: 6

- desired sparsity: 4

- number of iterations: 4000

- regularization parameters:

 - Jaccard penalization: 100.0

 - threshold for support: 0.001

##### Algorithm settings:

 - prior name: StructuredSpikeAndSlabPrior

 - var_slab: 100.0

 - var_spike: 0.1

 - weight_slab: None

 - weight_spike: None

## Solution quality assessment

In [None]:
solutions, final_parameters, full_nonzero_solutions = recover_solutions(
    search_history=history,
    desired_sparsity=C.SPARSITY,
    min_mu_threshold=C.MIN_MU_THRESHOLD,
    verbose=True,
)

## Overview of discovered features

In [None]:
features_found = set().union(
    *solutions.values()
)  # unique features occuring in any component


display_features_overview(
    features_found=features_found,
    true_support_features=true_support_features,
    n_total_features=len(df.columns),
)

## Comparison of the solutions

In [None]:
display(pd.DataFrame.from_dict(solutions, orient="index").T)

In [None]:
features_to_show = list(set(true_support_features).union(set(features_found)))
show_features_in_components(solutions, features_to_show=features_to_show)

show_correlation_matrix(df[sorted(features_to_show)], width=600, height=600)

## Mixture means vs true means

In [None]:
# Show final mixture means and weights
compare_parameters(parameters, final_parameters["final mu"])

display(Markdown("### Final mixture weights (alpha):"))
for i, alpha in enumerate(final_parameters["final alpha"]):
    display(Markdown(f"- **Component {i}:** {alpha:.3f}"))