# Bayesian Sparse Feature Selection Demo

This notebook demonstrates the Bayesian Gaussian Mixture Feature Selection algorithm on artificial data.

**Features:**
- Modular prior (spike-and-slab, easily replaceable)
- Flexible mixture model
- Variational inference with PyTorch
- Interactive diagnostics with Plotly/Seaborn

> **Note:** Ensure that the `gemss` package and dependencies are installed.

In [1]:
# %pip install -q -e ..

In [2]:
# Add parent directory to Python path to find our package
import sys
import os
from pathlib import Path

parent_dir = Path(os.path.dirname(os.getcwd()))
if parent_dir not in sys.path:
    sys.path.insert(0, str(parent_dir))

In [3]:
import numpy as np
from IPython.display import display, Markdown
import pandas as pd
import plotly.io as pio

import gemss.config as C
from gemss.generate_artificial_dataset import (
    generate_artificial_dataset,
)
from gemss.feature_selection.inference import BayesianFeatureSelector
from gemss.diagnostics.visualizations import (
    show_correlation_matrix,
    show_features_in_components,
    compare_parameters,
)
from gemss.diagnostics.performance_tests import run_performance_diagnostics
from gemss.diagnostics.recommendations import display_recommendations
from gemss.diagnostics.result_postprocessing import (
    recover_solutions,
    show_algorithm_progress,
    show_regression_results_for_solutions,
    display_features_overview,
    get_long_solutions_df,
)
from gemss.diagnostics.simple_regressions import (
    solve_with_logistic_regression,
    solve_with_linear_regression,
)
from gemss.diagnostics.visualizations import show_final_alphas

pio.renderers.default = "notebook_connected"  # Ensures plotly plots show in notebooks

# Generate Artificial Dataset

In [4]:
# Generate dataset
df, y, generating_solutions, parameters = generate_artificial_dataset(
    n_samples=C.N_SAMPLES,
    n_features=C.N_FEATURES,
    n_solutions=C.N_GENERATING_SOLUTIONS,
    sparsity=C.SPARSITY,
    noise_data_std=C.NOISE_STD,
    binarize=C.BINARIZE,
    binary_response_ratio=C.BINARY_RESPONSE_RATIO,
    random_seed=C.DATASET_SEED,
    save_to_csv=False,
    print_data_overview=True,
)

support_indices = parameters["support_indices"].sum()
true_support_features = [f"feature_{i}" for i in set(support_indices)]

### Artificial dataset

- **Number of samples:** 30

- **Number of features:** 60

- **Number of generating solutions:** 3

- **Sparsity (nonzero dimensions per component):** 2

- **Support features:** 6<br>['feature_25', 'feature_41', 'feature_46', 'feature_5', 'feature_5', 'feature_59']

- **Parameters of the mixture components:**

Unnamed: 0,solution_index,support_indices,weights,full_weights,sparsity
0,0,"[5, 46]","[9.804978813094047, 8.089117615922824]","[0.0, 0.0, 0.0, 0.0, 0.0, 9.804978813094047, 0...",2
1,1,"[25, 59]","[-1.9535794192164646, 0.24954462247334064]","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",2
2,2,"[41, 5]","[75.26801244822697, -45.97609622291779]","[0.0, 0.0, 0.0, 0.0, 0.0, -45.97609622291779, ...",2


- **Distribution of binary labels:**

binary_response
0    0.5
1    0.5
Name: proportion, dtype: float64

### Features' Correlation with Binary Response

- **Correlation matrix of features:**

# Classical approach
Solve the problem using logistic/linear regression with regularization.

In [5]:
# Solve the full problem (all features) with l2 penalty
#
# if BINARIZE:
#     solve_with_logistic_regression(X=df, y=y, penalty="l2")
# else:
#     solve_with_linear_regression(X=df, y=y, penalty="l2")

In [6]:
# Solve the full problem (all features) with l1 penalty
if C.BINARIZE:
    solve_with_logistic_regression(X=df, y=y, penalty="l1")
else:
    solve_with_linear_regression(X=df, y=y, penalty="l1")

### Logistic regression with L1 penalty - performance on training set

**Accuracy:** 0.967

**Balanced Accuracy:** 0.967

**ROC-AUC:** 1.0

**Balanced F1 Score:** 0.967

**Recall on class 0:** 1.0

**Recall on class 1:** 0.933

**Coefficients of the logistic regression model:**

feature_10    0.066624
feature_25   -2.114147
Name: Coefficient, dtype: float64

In [7]:
# Show regression results only for the generating solutions
# show_regression_results_for_solutions(
#     solutions=generating_solutions,
#     df=df,
#     y=y,
#     penalty="l1",
# )

# Bayesian Feature Selector

In [8]:
selector = BayesianFeatureSelector(
    n_features=C.N_FEATURES,
    n_components=C.N_CANDIDATE_SOLUTIONS,
    X=df.values,
    y=y,
    prior=C.PRIOR_TYPE,
    sss_sparsity=C.PRIOR_SPARSITY,
    sample_more_priors_coeff=C.SAMPLE_MORE_PRIORS_COEFF,
    var_slab=C.VAR_SLAB,
    var_spike=C.VAR_SPIKE,
    weight_slab=C.WEIGHT_SLAB,
    weight_spike=C.WEIGHT_SPIKE,
    student_df=C.STUDENT_DF,
    student_scale=C.STUDENT_SCALE,
    lr=C.LEARNING_RATE,
    batch_size=C.BATCH_SIZE,
    n_iter=C.N_ITER,
)

history = selector.optimize(
    regularize=C.IS_REGULARIZED,
    lambda_jaccard=C.LAMBDA_JACCARD,
    verbose=True,
)

show_algorithm_progress(history)

#### Running Bayesian Feature Selector:

- desired number of solutions: 6

- number of iterations: 4000

- regularization parameters:

  - Jaccard penalization: 100

  - threshold for support: 0.001

##### Algorithm settings:

 - prior name: StructuredSpikeAndSlabPrior

 - prior_sparsity: 2

 - var_slab: 100

 - var_spike: 0.1

KeyboardInterrupt: 

## Solution quality assessment

In [None]:
solutions, final_parameters, full_nonzero_solutions = recover_solutions(
    search_history=history,
    desired_sparsity=C.DESIRED_SPARSITY,
    min_mu_threshold=C.MIN_MU_THRESHOLD,
    verbose=False,
)

In [None]:
show_final_alphas(
    history,
    show_bar_plot=False,
    show_pie_chart=True,
)

## Overview of discovered features

In [None]:
features_found = set().union(
    *solutions.values()
)  # unique features occuring in any component

display_features_overview(
    features_found=features_found,
    true_support_features=true_support_features,
    n_total_features=len(df.columns),
)

## Comparison of the solutions

In [None]:
display(pd.DataFrame.from_dict(solutions, orient="index").T)

In [None]:
features_to_show = list(set(true_support_features).union(set(features_found)))
show_features_in_components(solutions, features_to_show=features_to_show)

show_correlation_matrix(df[sorted(features_to_show)], width=600, height=600)

## Full long solutions

In [None]:
df_full_solutions = get_long_solutions_df(full_nonzero_solutions)
display(df_full_solutions)

## Mixture means vs. true means

In [None]:
# Show final mixture means and weights
compare_parameters(parameters, final_parameters["final mu"])

display(Markdown("### Final mixture weights (alpha):"))
for i, alpha in enumerate(final_parameters["final alpha"]):
    display(Markdown(f"- **Component {i}:** {alpha:.3f}"))

# Compute regression using features in solutions

In [None]:
# iterate over df_solutions columns
show_regression_results_for_solutions(
    solutions=solutions,
    df=df,
    y=y,
    penalty="l1",
    verbose=False,
)