# GEMSS Demo

This notebook demonstrates the GEMSS algorithm on artificially generated data.

**Features:**
- Modular prior (structured spike-and-slab by default)
- Flexible mixture model
- Variational inference with PyTorch
- Performance diagnostics with Plotly
- Reliability and tuning recommendations based on performance diagnostics

> **Note:** Ensure that the `gemss` package and dependencies are installed.

In [None]:
# %pip install -q -e ..

In [None]:
# Add parent directory to Python path to find our package
import sys
import os
from pathlib import Path

parent_dir = Path(os.path.dirname(os.getcwd()))
if parent_dir not in sys.path:
    sys.path.insert(0, str(parent_dir))

In [None]:
import numpy as np
from IPython.display import display, Markdown
import pandas as pd
import plotly.io as pio

import gemss.config as C
from gemss.data_handling.generate_artificial_dataset import (
    generate_artificial_dataset,
)
from gemss.feature_selection.inference import BayesianFeatureSelector
from gemss.diagnostics.visualizations import (
    show_correlation_matrix,
    show_features_in_components,
)
from gemss.diagnostics.performance_tests import run_performance_diagnostics
from gemss.diagnostics.recommendations import display_recommendations
from gemss.diagnostics.result_postprocessing import (
    compare_true_and_found_features,
    get_unique_features,
    recover_solutions,
    show_final_parameter_comparison,
    show_algorithm_progress,
    show_features_in_solutions,
)
from gemss.utils import show_solution_summary
from gemss.diagnostics.simple_regressions import (
    solve_with_logistic_regression,
    solve_with_linear_regression,
    show_regression_results_for_solutions,
)
from gemss.diagnostics.visualizations import show_final_alphas
from gemss.diagnostics.outliers import (
    get_outlier_summary_from_history,
    show_outlier_summary,
    show_outlier_features_by_component,
)

pio.renderers.default = "notebook_connected"  # Ensures plotly plots show in notebooks

# Generate artificial dataset

In [None]:
# Generate dataset
df, y, generating_solutions, parameters = generate_artificial_dataset(
    n_samples=C.N_SAMPLES,
    n_features=C.N_FEATURES,
    n_solutions=C.N_GENERATING_SOLUTIONS,
    sparsity=C.SPARSITY,
    noise_data_std=C.NOISE_STD,
    nan_ratio=C.NAN_RATIO,
    binarize=C.BINARIZE,
    binary_response_ratio=C.BINARY_RESPONSE_RATIO,
    random_seed=C.DATASET_SEED,
    save_to_csv=False,
    print_data_overview=True,
    show_feature_correlations=False,
)

support_indices = parameters["support_indices"].sum()
true_support_features = [f"feature_{i}" for i in set(support_indices)]

# Classical approach

If there is an acceptable amount of missing values, solve the problem using logistic/linear regression with regularization.

In [None]:
# Show regression results only for the generating solutions and the full feature set
generating_solutions_expanded = generating_solutions
generating_solutions_expanded["original feature set"] = df.columns.to_list()

for penalty in ["l1", "l2"]:
    show_regression_results_for_solutions(
        solutions=generating_solutions_expanded,
        df=df,
        response=y,
        penalty=penalty,
        verbose=False,  # if true, shows detailed results for each solution
    )

# GEMSS: running a Bayesian feature selector

In [None]:
selector = BayesianFeatureSelector(
    n_features=C.N_FEATURES,
    n_components=C.N_CANDIDATE_SOLUTIONS,
    X=df.values,
    y=y,
    prior=C.PRIOR_TYPE,
    sss_sparsity=C.PRIOR_SPARSITY,
    sample_more_priors_coeff=C.SAMPLE_MORE_PRIORS_COEFF,
    var_slab=C.VAR_SLAB,
    var_spike=C.VAR_SPIKE,
    weight_slab=C.WEIGHT_SLAB,
    weight_spike=C.WEIGHT_SPIKE,
    student_df=C.STUDENT_DF,
    student_scale=C.STUDENT_SCALE,
    lr=C.LEARNING_RATE,
    batch_size=C.BATCH_SIZE,
    n_iter=C.N_ITER,
)

history = selector.optimize(
    regularize=C.IS_REGULARIZED,
    lambda_jaccard=C.LAMBDA_JACCARD,
    verbose=True,
)

In [None]:
# The most important function: show_algorithm_progress
# visualizes the optimization process by displaying the evolution of key metrics over iterations
# essential for understanding model behavior and spotting issues
show_algorithm_progress(history)

# Compare the relative weights of the candidate solutions
show_final_alphas(
    history,
    show_bar_plot=False,
    show_pie_chart=True,
)

## Outlier Analysis

The outlier analysis helps identify features with unusually high importances (either positive or negative mu values) in each component.

Ideally, the detected outliers are identical to the final solutions.


In [None]:
# Show outlier features
for outlier_threshold_coeff in [2.5, 3.0, 3.5]:
    show_outlier_features_by_component(
        history=history,
        use_median=False,
        outlier_threshold_coeff=outlier_threshold_coeff,
        use_markdown=True,
    )

# Results

## Overview of full (long) solutions

The 'long' solutions are the actual solutions (full sets of features) found by the algorithm. Their sparsity may not be as strong as desired.

In [None]:
solutions, final_parameters, full_nonzero_solutions = recover_solutions(
    search_history=history,
    desired_sparsity=C.DESIRED_SPARSITY,
    min_mu_threshold=C.MIN_MU_THRESHOLD,
    verbose=False,
)

In [None]:
show_solution_summary(
    solution_data=full_nonzero_solutions,
    title="Full solutions found by the feature selector, ordered by importance",
    value_column="Feature",
)

## Assessment of (short) solutions

The short solutions are just the most important features from the long solutions. The number of features selected is defined by the desired sparsity parameter.

In [None]:
show_features_in_solutions(
    solutions=solutions,
    history=history,
    constants=C.as_dict(),
    use_markdown=True,
)

## Overview of discovered features in the short solutions

Comparison of the "ground truth" set of features that support the original solutions and the discovered set of features across all (short) candidate solutions.

In [None]:
features_found = get_unique_features(solutions)

compare_true_and_found_features(
    features_found=features_found,
    true_support_features=true_support_features,
    n_total_features=len(df.columns),
)

## Mutual comparison of the short solutions

In [None]:
display(pd.DataFrame.from_dict(solutions, orient="index").T)

features_to_show = list(set(true_support_features).union(set(features_found)))
show_features_in_components(solutions, features_to_show=features_to_show)

In [None]:
show_correlation_matrix(df[sorted(features_to_show)], width=600, height=600)

In [None]:
# display(Markdown("### Mixture means vs. true means"))
# show_final_parameter_comparison(
#     true_parameters=parameters,
#     final_parameters=final_parameters,
# )

# Compute regression using features in short solutions

See how the short candidate solutions perform when simple logistic/linear regression is applied.

In [None]:
for penalty in ["l1", "l2"]:
    show_regression_results_for_solutions(
        solutions=solutions,
        df=df,
        response=y,
        penalty=penalty,
        verbose=False,  # if true, shows detailed results for each solution
    )

# Performance diagnostics (work in progress)

Let us assess the feature selector's progress history to evaluate the reliability of the results.

In [None]:
diagnostics = run_performance_diagnostics(
    history,
    desired_sparsity=C.as_dict()["DESIRED_SPARSITY"],
    verbose=True,
)

## Recommendations (work in progress)

Based on the diagnostics, hyperparameter tuning might be recommended.

In [None]:
display_recommendations(
    diagnostics=diagnostics,
    constants=C.as_dict(),
)