# GEMSS Demo

This notebook demonstrates the GEMSS algorithm on artificially generated data.

> **Note:**
Ensure that the `gemss` package and dependencies are installed.

> **Configuration:**
All parameters can be configured in json files in ```gemss/config```.

**The artificial data:**
- Are generated randomly with a random seed.
- Have the prescribed dimensions (sample size, number of features), noise level, ratio of NaNs (evenly distributed), response type (continuous or binarized) and (for binary classification problems) prevalence of the minority class.
- Have a predefined number of "generating" solutions, each of the same sparsity. The generating features are picked at random and have random weights.
- The generated data can be saved (optional).

**Using GEMSS:**
- Define hyperparameters as instructed below and run the feature selector.
- Assess the convergence. If needed, adjust hyperparameters and rerun.
- Inspect the three types of solutions ("outliers", "full", "top"), their properties and modeling potential in simple linear/logistic models.


In [None]:
from IPython.display import display, Markdown
import plotly.io as pio

import gemss.config as C
from gemss.config import display_current_config
from gemss.data_handling.generate_artificial_dataset import (
    generate_artificial_dataset,
)
from gemss.feature_selection.inference import BayesianFeatureSelector
from gemss.utils.visualizations import (
    show_correlation_matrix,
    show_features_in_components,
)
from gemss.diagnostics.recommendations import display_parameter_adjustment_summary
from gemss.postprocessing.result_postprocessing import (
    get_features_from_solutions,
    get_unique_features,
    recover_solutions,
    show_algorithm_progress,
    show_solution_summary,
    show_unique_features_from_full_solutions,
    compare_true_and_found_features,
)
from gemss.utils.utils import show_solution_summary
from gemss.postprocessing.simple_regressions import (
    solve_any_regression,
    show_regression_metrics,
)
from gemss.utils.visualizations import show_final_alphas

pio.renderers.default = "notebook_connected"  # Ensures plotly plots show in notebooks

# Hyperparameter overview

Review the default hyperparameter setting. It can be overridden as needed: see the JSON configuration files in ```gemss/config```.

In [None]:
display_current_config(
    constants=C.as_dict(),
    constant_type="all",
)
display_parameter_adjustment_summary()

# Generate artificial dataset

In [None]:
# Generate dataset
df, y, generating_solutions, parameters = generate_artificial_dataset(
    n_samples=C.N_SAMPLES,
    n_features=C.N_FEATURES,
    n_solutions=C.N_GENERATING_SOLUTIONS,
    sparsity=C.SPARSITY,
    noise_data_std=C.NOISE_STD,
    nan_ratio=C.NAN_RATIO,
    binarize=C.BINARIZE,
    binary_response_ratio=C.BINARY_RESPONSE_RATIO,
    random_seed=C.DATASET_SEED,
    save_to_csv=False,  # Set to True to save generated data
    print_data_overview=True,
    show_feature_correlations=False,
)

support_indices = parameters["support_indices"].sum()
true_support_features = [f"feature_{i}" for i in set(support_indices)]

# Classical approach: simple regressions

If there is an acceptable amount of missing values, solve the problem using logistic/linear regression with regularization.

We run l2- and l1-regularized regression on the KNOWN sets of features (*solution_0*, *solution_1*, ...) and on the whole original feature set (including all the noise) to see the "baseline" performance of such models.
Observe the number of nonzero coefficients (*n_nonzero_coefficients*) for the model with the original feature set: even the optimal l1-regularized model ($\lambda$ optimalized by inner CV) does not achieve perfect sparsity.
Moreover, even some performance metrics of models that are provided with perfect feature sets do not reach the perfect scores 1.0. 


In [None]:
# Show regression results only for the generating solutions and the full feature set
generating_solutions_expanded = generating_solutions
generating_solutions_expanded["original feature set"] = df.columns.to_list()

for penalty in ["l1", "l2"]:
    metrics = solve_any_regression(
        solutions=generating_solutions_expanded,
        df=df,
        response=y,
        apply_scaling="standard",
        penalty=penalty,
        verbose=False,  # if true, shows detailed results for each solution
    )
    show_regression_metrics(
        metrics_df=metrics,
        title=f"Results on training data (penalty={penalty})",
        use_markdown=True,
    )

# GEMSS: running a Bayesian feature selector

Find multiple solutions simultaneously. It is recommended to set *N_CANDIDATE_SOLUTIONS* 2-3x the number of expected "true" solutions because of the properties of solutions in linear vector spaces.

In [None]:
selector = BayesianFeatureSelector(
    n_features=C.N_FEATURES,
    n_components=C.N_CANDIDATE_SOLUTIONS,
    X=df.values,
    y=y,
    prior=C.PRIOR_TYPE,
    sss_sparsity=C.PRIOR_SPARSITY,
    sample_more_priors_coeff=C.SAMPLE_MORE_PRIORS_COEFF,
    var_slab=C.VAR_SLAB,
    var_spike=C.VAR_SPIKE,
    weight_slab=C.WEIGHT_SLAB,
    weight_spike=C.WEIGHT_SPIKE,
    student_df=C.STUDENT_DF,
    student_scale=C.STUDENT_SCALE,
    lr=C.LEARNING_RATE,
    batch_size=C.BATCH_SIZE,
    n_iter=C.N_ITER,
)

history = selector.optimize(
    regularize=C.IS_REGULARIZED,
    lambda_jaccard=C.LAMBDA_JACCARD,
    verbose=True,
)

## Assess algorithm convergence



In [None]:
# The most important function: show_algorithm_progress
# visualizes the optimization process by displaying the evolution of key metrics over iterations
# essential for understanding model behavior and spotting issues
show_algorithm_progress(history, subsample_history_for_plotting=True)

In [None]:
# Compare the relative weights of the candidate solutions
show_final_alphas(
    history,
    show_bar_plot=False,
    show_pie_chart=True,
)

# Extract solutions from the components

Analyze the feature selection process to extract solutions. There are 3 basic solution types (under ideal conditions, all should be equivalent):

- **top** solutions recover strictly the predefined number of features with the greatest (absolute) $\mu$ values.

- **full** solutions recover all features, whose $\mu$ values exceed a hard predefined $\epsilon$ threshold. If there are fewer such features than the required sparsity, it backtracks those that "zeroed-out" as last.

- **outlier** solutions use statistics to detect all features with STD (or MAD) greater than a given threshold. They are good for flexible solution recovery when sparsity is not known apriori. The downside might be a sizeable overlap of candidate solutions (if the STD is too small) or too few features found (if STD is too large) - this is mitigated by having multiple thresholds. Recommended STD values range from 2.0 to 4.0.

In [None]:
full_solutions, top_solutions, outlier_solutions, final_parameters = recover_solutions(
    search_history=history,
    desired_sparsity=C.DESIRED_SPARSITY,
    min_mu_threshold=C.MIN_MU_THRESHOLD,
    use_median_for_outlier_detection=C.USE_MEDIAN_FOR_OUTLIER_DETECTION,
    outlier_deviation_thresholds=C.OUTLIER_DEVIATION_THRESHOLDS,
)

## Solutions by outlier detection

The outlier analysis helps identify features with unusually high importance values (mu, either positive or negative) in each component.

In [None]:
outlier_features = {}
for deviation, df_outlier in outlier_solutions.items():
    outlier_features[f"{deviation}"] = get_features_from_solutions(df_outlier)
    features_found = get_unique_features(outlier_features[f"{deviation}"])

    show_solution_summary(
        solution_data=df_outlier,
        title=f"Outlier solutions ({deviation})",
        value_column="Feature",
    )

    show_unique_features_from_full_solutions(df_outlier)

    show_features_in_components(
        solutions=outlier_features[f"{deviation}"],
        features_to_show=get_unique_features(outlier_features[f"{deviation}"]),
    )

    display(Markdown(f"## Regression on discovered outlier ({deviation}) features"))
    penalty = "l2"
    metrics = solve_any_regression(
        solutions=outlier_features[f"{deviation}"],
        df=df,
        response=y,
        apply_scaling="standard",
        penalty=penalty,
        verbose=False,  # if true, shows detailed results for each solution
    )
    show_regression_metrics(
        metrics_df=metrics,
        title=f"Outlier ({deviation}) features - regression results on training data (penalty = {penalty})",
        use_markdown=True,
    )

    display(
        Markdown(
            f"## Compare discovered outlier ({deviation}) features with ground truth"
        )
    )
    compare_true_and_found_features(
        features_found=features_found,
        true_support_features=true_support_features,
        n_total_features=len(df.columns),
    )

    display(Markdown("------------------------------------------------------------"))

## Full solutions

The 'full' solutions contain all features that are not "negligible", i.e. greater than a given manual threshold.

In [None]:
full_features = get_features_from_solutions(full_solutions)
features_found = get_unique_features(full_features)

show_solution_summary(
    solution_data=full_solutions,
    title="Full solutions found by the feature selector, ordered by importance",
    value_column="Feature",
)

show_unique_features_from_full_solutions(full_solutions)

show_features_in_components(
    solutions=full_features,
    features_to_show=features_found,
)

display(Markdown("## Regression on discovered full features"))
for penalty in ["l2", "l1"]:
    metrics = solve_any_regression(
        solutions=full_features,
        df=df,
        response=y,
        apply_scaling="standard",
        penalty=penalty,
        verbose=False,  # if true, shows detailed results for each solution
    )
    show_regression_metrics(
        metrics_df=metrics,
        title=f"Full features - regression results on training data (penalty={penalty})",
        use_markdown=True,
    )

display(Markdown("## Compare discovered top features with ground truth"))
compare_true_and_found_features(
    features_found=features_found,
    true_support_features=true_support_features,
    n_total_features=len(df.columns),
)

display(Markdown("------------------------------------------------------------"))

## Top solutions

The top solutions contain strictly the predefined number of features.

In [None]:
top_features = get_features_from_solutions(top_solutions)
features_found = get_unique_features(top_features)

display(Markdown(f"**Required sparsity:** {C.DESIRED_SPARSITY}"))

show_solution_summary(
    solution_data=top_solutions,
    title=f"Top solutions with required sparsity = {C.DESIRED_SPARSITY}",
    value_column="Feature",
)

show_unique_features_from_full_solutions(top_solutions)

show_features_in_components(
    solutions=top_features,
    features_to_show=features_found,
)

show_correlation_matrix(df[sorted(features_found)], width=600, height=600)

display(Markdown("## Regression on discovered top features"))
for penalty in ["l2", "l1"]:
    metrics = solve_any_regression(
        solutions=top_features,
        df=df,
        response=y,
        apply_scaling="standard",
        penalty=penalty,
        verbose=False,  # if true, shows detailed results for each solution
    )
    show_regression_metrics(
        metrics_df=metrics,
        title=f"Top features - regression results on training data (penalty = {penalty})",
        use_markdown=True,
    )

display(Markdown("## Compare discovered top features with ground truth"))
compare_true_and_found_features(
    features_found=features_found,
    true_support_features=true_support_features,
    n_total_features=len(df.columns),
)