# Bayesian Sparse Feature Selection on an Unknown Dataset

This notebook demonstrates how to apply the Bayesian Sparse Feature Selector to an arbitrary dataset with unknown ground truth. It mirrors the workflow of demo.ipynb, but assumes no knowledge of the true supports. Feature names are anonymized during analysis and mapped back for interpretability.


In [None]:
# %pip install -e ..

In [1]:
from IPython.display import display, Markdown
import pandas as pd
import numpy as np
from plotly import io as pio
from sklearn.preprocessing import StandardScaler

import gemss.config as C
from gemss.feature_selection.inference import BayesianFeatureSelector
from gemss.diagnostics.visualizations import show_label_histogram, show_final_alphas
from gemss.diagnostics.outliers import show_outlier_features_by_component
from gemss.diagnostics.result_postprocessing import (
    recover_solutions,
    show_algorithm_progress,
    show_unique_features,
    show_features_in_solutions,
)
from gemss.diagnostics.simple_regressions import show_regression_results_for_solutions
from gemss.diagnostics.performance_tests import run_performance_diagnostics
from gemss.diagnostics.recommendations import display_recommendations
from gemss.utils import show_solution_summary

pio.renderers.default = "notebook_connected"  # Ensures plotly plots show in notebooks

from gemss.data_handling.data_processing import (
    load_data,
    preprocess_features,
    get_feature_name_mapping,
)

# Your setup

In this section, define the specifics for your data and parameters for the feature selection algorithm.

## Govern verbosity and outputting

In [None]:
# Show plots of algorithm progress over iterations
show_search_history = True

# Detect outlier features (in each component) by statistical means
# Ideally, outliers == solutions
outlier_analysis = True

# Choose overall verbosity for various outputs
verbose = True

# Whether to run regressions for the selected solutions and show results
run_regressions_for_solutions = True

# Whether to run performance diagnostics
# The diagnostics is necessary for showing recommendations
# but can be run quietly, if desired
run_diagnostics = True
verbose_diagnostics = True

# Whether to show recommendations based on diagnostics
# requires `run_diagnostics = True` too
show_recommendations = True

## Set parameters to handle your dataset

In [None]:
# dataset parameters
# the CSV file should be in the ../data/ directory
# the index and label column names must be included in the dataset
csv_dataset_name = "shelflife_data_all_preprocessed.csv"
index_column_name = "sample ID"
label_column_name = "label"

# NA handling options
# Options are:
# - "response": drop rows with NA in the response column only (default).
# - "all": drop rows with NA in any column.
# - "none": do not drop any rows.
dropna_columns = "response"

# Apply standard scaling to features
apply_standard_scaling = False

### Load your data

In [None]:
if verbose:
    display(Markdown("#### Loading your data"))

df, response = load_data(
    csv_dataset_name,
    index_column_name,
    label_column_name,
)
X, y = preprocess_features(
    df,
    response,
    dropna=dropna_columns,
    apply_standard_scaling=apply_standard_scaling,
    verbose=verbose,
)
feature_to_name = get_feature_name_mapping(df)

if verbose:
    display(Markdown("Your data:"))
    display(Markdown(f"- Number of labels: **{len(np.unique(y))}**"))
    display(Markdown(f"- Number of samples: **{X.shape[0]}**"))
    display(Markdown(f"- Number of features: **{X.shape[1]}**"))
    display(Markdown(f"{list(feature_to_name.values())}"))

## Set parameters for the feature selection algorithm

- First, default contant values are loaded by the config module.
- Then override the settings of select parameters as needed.

- The algorithm usually takes about 1+ minute per 1000 training iterations on CPU for the default 'sss' prior. The 'student' prior is faster.

In [None]:
# DO NOT CHANGE THIS CELL unless you know what you're doing

# First load the default constants defined by the config module (including nonsensical values for your data)
constants = C.as_dict()

# Override the settings for this specific dataset
constants["N_SAMPLES"] = df.shape[0]
constants["N_FEATURES"] = df.shape[1]

In [None]:
# SET YOUR PARAMETERS HERE

# Algorithm settings
constants["PRIOR_TYPE"] = "sss"

### PRIOR_SPARSITY = number of supporting dimensions for the 'sss' prior
### used only with the 'sss' prior
constants["PRIOR_SPARSITY"] = constants["DESIRED_SPARSITY"]

### VAR_SPIKE and VAR_SLAB are only used with 'ss' and 'sss' prior
### ...VAR_SPIKE: parameter with the most influence on solution quality
### ...smaller VAR_SPIKE => more sparsity, i.e. fewer nonzero solutions
### ...increse VAR_SPIKE when all features converge to 0, typically in a uniform manner
### ...decrease VAR_SPIKE when there are too many nonzero features at the end of the run
constants["VAR_SPIKE"] = 0.0005
## Prior hyperparameters
constants["VAR_SLAB"] = 100.0

### WEIGHT_SLAB and WEIGHT_SPIKE are only used with 'ss' prior
constants["WEIGHT_SLAB"] = 0.9  # not used with 'sss' prior
constants["WEIGHT_SPIKE"] = 0.1  # not used with 'sss' prior
constants["STUDENT_DF"] = 1
constants["STUDENT_SCALE"] = 1.0

## SGD optimization settings
### N_ITER = number of training iterations (runtime approximately 1+ minutes/1000 iterations)
### increase/decrease depending on both the ELBO and the mus' convergence behavior
constants["N_ITER"] = 3000

## Regularization settings to make solutions more distinct
### IS_REGULARIZED = whether to use penalization based on average Jaccard similarity among solutions
### ...preferrably False or True with small LAMBDA_JACCARD at this moment
constants["IS_REGULARIZED"] = False
### LAMBDA_JACCARD: larger values => more differentiated feature sets
### ...it is safer to use smaller values and increase only if needed
### ...if the ELBO converges to a too large value (in absolute terms), your lambda is probably too large
constants["LAMBDA_JACCARD"] = 500.0

# Properties of the desired solutions
# N_CANDIDATE_SOLUTIONS = number of candidate solutions
#                       = no. components of Gaussian mixture that approximate the posterior
constants["N_CANDIDATE_SOLUTIONS"] = 8
constants["DESIRED_SPARSITY"] = 5  # Expected # of features per solution
constants["MIN_MU_THRESHOLD"] = 0.25  # minimum |Î¼| to consider a feature nonzero

# Run the feature selector on your data

There is no need to touch any code below this cell.

In [None]:
selector = BayesianFeatureSelector(
    n_features=constants["N_FEATURES"],
    n_components=constants["N_CANDIDATE_SOLUTIONS"],
    X=X,
    y=y,
    prior=constants["PRIOR_TYPE"],
    sss_sparsity=constants["PRIOR_SPARSITY"],
    var_slab=constants["VAR_SLAB"],
    var_spike=constants["VAR_SPIKE"],
    weight_slab=constants["WEIGHT_SLAB"],
    weight_spike=constants["WEIGHT_SPIKE"],
    student_df=constants["STUDENT_DF"],
    student_scale=constants["STUDENT_SCALE"],
    lr=constants["LEARNING_RATE"],
    batch_size=constants["BATCH_SIZE"],
    n_iter=constants["N_ITER"],
)

history = selector.optimize(
    regularize=constants["IS_REGULARIZED"],
    lambda_jaccard=constants["LAMBDA_JACCARD"],
    verbose=verbose,
)

In [None]:
if show_search_history:
    show_algorithm_progress(
        history,
        original_feature_names_mapping=feature_to_name,
    )

    show_final_alphas(
        history,
        show_bar_plot=False,
        show_pie_chart=True,
    )

## Outlier Analysis

The outlier analysis helps identify features with unusually high or low importance values in each component.

Ideally, the detected outliers are identical to the final solutions.

In [None]:
if outlier_analysis:
    for outlier_threshold_coeff in [2.5, 3.0, 3.5]:
        show_outlier_features_by_component(
            history=history,
            use_median=False,
            outlier_threshold_coeff=outlier_threshold_coeff,
            use_markdown=True,
        )

# Show the results

In [None]:
solutions, final_parameters, full_nonzero_solutions = recover_solutions(
    search_history=history,
    desired_sparsity=constants["DESIRED_SPARSITY"],
    min_mu_threshold=constants["MIN_MU_THRESHOLD"],
    verbose=verbose,
    original_feature_names_mapping=feature_to_name,
)

## Overview of full (long) solutions

The 'long' solutions are the actual solutions (full sets of features) found by the algorithm. Their sparsity may not be as strong as desired.

In [None]:
show_solution_summary(
    solution_data=full_nonzero_solutions,
    title="Full solutions found by the feature selector, ordered by importance",
    value_column="Feature",
)

## Final (short) candidate solutions

The long solutions are further shortened to the required number of features based on the final mu values.

In [None]:
show_features_in_solutions(
    solutions=solutions,
    history=history,
    constants=constants,
    use_markdown=True,
)

show_unique_features(
    solutions=solutions,
    use_markdown=True,
)

In [None]:
if run_regressions_for_solutions:
    show_regression_results_for_solutions(
        solutions,
        df=df,
        y=y,
        penalty="l1",
        verbose=True,
    )

# Performance diagnostics and recommendations

Let us assess the feature selector's progress history to evaluate the reliability of the results. Based on the diagnostics, hyperparameter tuning might be recommended.

In [None]:
if run_performance_diagnostics:
    diagnostics = run_performance_diagnostics(
        history,
        desired_sparsity=constants["DESIRED_SPARSITY"],
        verbose=verbose_diagnostics,
    )

In [None]:
if run_performance_diagnostics and show_recommendations:
    display_recommendations(diagnostics=diagnostics, constants=constants)