# GEMSS on your custom dataset

This notebook demonstrates how to apply GEMSS to your own dataset with unknown ground truth. It largely mirrors the workflow of demo.ipynb, where applicable.

Use this notebook to try out GEMSS on your custom data. Follow the instructions below to set up the algorithm's parameters.

In [None]:
# %pip install -e ..

In [None]:
from IPython.display import display, Markdown
import pandas as pd
import numpy as np
from plotly import io as pio

import gemss.config as C
from gemss.feature_selection.inference import BayesianFeatureSelector
from gemss.diagnostics.visualizations import (
    show_label_histogram,
    show_final_alphas,
    show_features_in_components,
)
from gemss.diagnostics.outliers import show_outlier_features_by_component
from gemss.diagnostics.result_postprocessing import *
from gemss.diagnostics.simple_regressions import show_regression_results_for_solutions
from gemss.diagnostics.performance_tests import run_performance_diagnostics
from gemss.diagnostics.recommendations import display_recommendations
from gemss.utils import show_solution_summary

from gemss.data_handling.data_processing import (
    load_data,
    preprocess_non_numeric_features,
    preprocess_features,
)

pio.renderers.default = "notebook_connected"  # Ensures plotly plots show in notebooks

# Your setup

In this section, define the specifics for your data and parameters for the feature selection algorithm.

## Govern verbosity and outputting

In [None]:
# Show plots of algorithm progress over iterations
show_search_history = True

# Choose overall verbosity for various outputs
verbose = True

# Whether to run regressions for every set of solutions and show (only) the summary metrics
run_regressions_for_solutions = True

# one chosen set of candidate solutions to be analyzed in greater detail
# can be one of these:
# None   ... no detailed analysis
# 'full' ... solutions consisting of all features above the provided minimal |mu| threshold
# 'top'  ... the top few features according to |mu| values. Number determined by 'desired sparsity'
# 'outlier_DEVIATION', e.g. "outlier_2.5"
#        ... solutions given by outlier detection
#        ... DEVIATION must be one of the OUTLIER_DEVIATION_THRESHOLDS
chosen_solution_set = "full"

# Whether to run performance diagnostics
# The diagnostics is necessary for showing recommendations
# but can be run quietly, if desired
run_diagnostics = False
verbose_diagnostics = True

# Whether to show recommendations based on diagnostics
# requires `run_diagnostics = True` too
show_recommendations = True

## Set parameters to handle your dataset

In [None]:
# dataset parameters
# the CSV file should be in the ../data/ directory
# the index and label column names must be included in the dataset
# csv_dataset_name = "shelflife_data_all_preprocessed.csv"
csv_dataset_name = "shelflife_data_all.csv"
index_column_name = "sample ID"
label_column_name = "label"

# NA handling options
# Options are:
# - "response": drop rows with NA in the response column only (default).
# - "all": drop rows with NA in any column.
# - "none": do not drop any rows.
dropna_columns = "response"

# If True, keep only numerical features
drop_non_numeric_features = True

# Apply standard scaling to features
# Used in both feature selection algorithm
# and the simple regression models on the results
apply_standard_scaling = False

### Load your data

In [None]:
if verbose:
    display(Markdown("#### Loading your data"))

df, response = load_data(
    csv_dataset_name,
    index_column_name,
    label_column_name,
)

if drop_non_numeric_features:
    df = preprocess_non_numeric_features(df, how="drop", verbose=verbose)

X, y, feature_to_name = preprocess_features(
    df,
    response,
    dropna=dropna_columns,
    apply_standard_scaling=apply_standard_scaling,
    verbose=verbose,
)

if verbose:
    display(Markdown("#### Your data:"))
    display(Markdown(f"- Number of samples: **{X.shape[0]}**"))
    display(Markdown(f"- Number of features: **{X.shape[1]}**"))
    display(Markdown(f"{list(feature_to_name.values())}"))
    display(Markdown(f"- Number of labels: **{len(np.unique(y))}**"))
    show_label_histogram(y)

## Set parameters for the feature selection algorithm

- First, default contant values are loaded by the config module.
- Then override the settings of select parameters as needed.

- The algorithm usually takes about 1+ minute per 1000 training iterations on CPU for the default 'sss' prior. The 'student' prior is faster.

In [None]:
# DO NOT CHANGE THIS CELL unless you know what you're doing

# First load the default constants defined by the config module (including nonsensical values for your data)
constants = C.as_dict()

# Automated
constants["N_SAMPLES"] = X.shape[0]
constants["N_FEATURES"] = X.shape[1]

In [None]:
# SET YOUR PARAMETERS HERE

# Algorithm settings

### PRIOR_TYPE
### ...do not change unless you know what you are doing or nothing else works
### ...SSS = structured spike and slab priors
### ...method of 2nd recourse: 'student'
constants["PRIOR_TYPE"] = "sss"

### PRIOR_SPARSITY = number of supporting dimensions for the 'sss' prior
### used only with the 'sss' prior
constants["PRIOR_SPARSITY"] = constants["DESIRED_SPARSITY"]

### VAR_SPIKE and VAR_SLAB are only used with 'ss' and 'sss' prior
### ...VAR_SPIKE: parameter with the most influence on solution quality
### ...smaller VAR_SPIKE => more sparsity, i.e. fewer nonzero solutions
### ...increase VAR_SPIKE when all features converge to 0, typically in a uniform manner
### ...decrease VAR_SPIKE when there are too many nonzero features at the end of the run
constants["VAR_SPIKE"] = 0.075
## Prior hyperparameters
constants["VAR_SLAB"] = 100.0

### WEIGHT_SLAB and WEIGHT_SPIKE are only used with 'ss' prior
constants["WEIGHT_SLAB"] = 0.9
constants["WEIGHT_SPIKE"] = 0.1

### STUDENT_DF and STUDENT_SCALE are only used with the Student prior
constants["STUDENT_DF"] = 1
constants["STUDENT_SCALE"] = 1.0

## SGD optimization settings
### N_ITER = number of training iterations (runtime approximately 1+ minutes/1000 iterations)
### increase/decrease depending on both the ELBO and the convergence behavior of μs
constants["N_ITER"] = 800

## Regularization settings to make solutions more distinct
### IS_REGULARIZED = whether to use penalization based on average Jaccard similarity among solutions
### ...preferrably False or True with small LAMBDA_JACCARD at this moment
constants["IS_REGULARIZED"] = True
### LAMBDA_JACCARD: larger values => more differentiated feature sets
### ...it is safer to use smaller values and increase only if needed
### ...if the ELBO converges to a too large value (in absolute terms), your lambda is probably too large
constants["LAMBDA_JACCARD"] = 500.0

## SGD optimization settings
## ...smaller learning rates require more iterations to converge but may yield better optimization processes
## ...batch size can be increased for larger datasets to speed up convergence
constants["LEARNING_RATE"] = 0.004  # Learning rate for Adam optimizer
constants["BATCH_SIZE"] = 16  # Batch size for SGD optimizer


# Properties of the desired solutions

### N_CANDIDATE_SOLUTIONS = number of candidate solutions
###                       = no. components of Gaussian mixture that approximate the posterior
constants["N_CANDIDATE_SOLUTIONS"] = 6
constants["DESIRED_SPARSITY"] = 8  # Expected # of features per solution
### MIN_MU_THRESHOLD = minimum |μ| to consider a feature nonzero
### ...adjust based on the scale of your features and the convergence behavior
constants["MIN_MU_THRESHOLD"] = 0.25

# Settings for outlier detection
# ...affect only outlier-based solution sets
constants["USE_MEDIAN_FOR_OUTLIER_DETECTION"] = False
constants["OUTLIER_DEVIATION_THRESHOLDS"] = [2.5, 3.0, 3.5]

# -------------------- NO NEED TO TOUCH CODE BELOW THIS CELL ---------------------------#

# Run the feature selector on your data

There is no need to touch any code below this cell.

In [None]:
selector = BayesianFeatureSelector(
    n_features=constants["N_FEATURES"],
    n_components=constants["N_CANDIDATE_SOLUTIONS"],
    X=X,
    y=y,
    prior=constants["PRIOR_TYPE"],
    sss_sparsity=constants["PRIOR_SPARSITY"],
    var_slab=constants["VAR_SLAB"],
    var_spike=constants["VAR_SPIKE"],
    weight_slab=constants["WEIGHT_SLAB"],
    weight_spike=constants["WEIGHT_SPIKE"],
    student_df=constants["STUDENT_DF"],
    student_scale=constants["STUDENT_SCALE"],
    lr=constants["LEARNING_RATE"],
    batch_size=constants["BATCH_SIZE"],
    n_iter=constants["N_ITER"],
)

history = selector.optimize(
    regularize=constants["IS_REGULARIZED"],
    lambda_jaccard=constants["LAMBDA_JACCARD"],
    verbose=verbose,
)

In [None]:
if show_search_history:
    show_algorithm_progress(
        history,
        original_feature_names_mapping=feature_to_name,
    )

    show_final_alphas(
        history,
        show_bar_plot=False,
        show_pie_chart=True,
    )

# Results

Analyze the feature selection process to extract solutions.

In [None]:
full_solutions, top_solutions, outlier_solutions, final_parameters = recover_solutions(
    search_history=history,
    desired_sparsity=constants["DESIRED_SPARSITY"],
    min_mu_threshold=constants["MIN_MU_THRESHOLD"],
    original_feature_names_mapping=feature_to_name,
    use_median_for_outlier_detection=constants["USE_MEDIAN_FOR_OUTLIER_DETECTION"],
    outlier_deviation_thresholds=constants["OUTLIER_DEVIATION_THRESHOLDS"],
)

## Solutions by outlier detection

The outlier analysis helps identify features with unusually high importance values (mu, either positive or negative) in each component.

Ideally, the detected outliers are identical to the top solutions.

In [None]:
outlier_features = {}
for deviation, df_outlier in outlier_solutions.items():
    show_solution_summary(
        solution_data=df_outlier,
        title=f"Outlier solutions for {deviation}",
        value_column="Feature",
    )

    show_unique_features_from_full_solutions(df_outlier)

    outlier_features[f"{deviation}"] = get_features_from_solutions(df_outlier)
    show_features_in_components(
        solutions=outlier_features[f"{deviation}"],
        features_to_show=get_unique_features(outlier_features[f"{deviation}"]),
    )

    if run_regressions_for_solutions:
        show_regression_results_for_solutions(
            solutions=outlier_features[f"{deviation}"],
            df=df,
            response=response,
            use_standard_scaler=apply_standard_scaling,
            penalty="l2",
            verbose=False,
        )

    display(Markdown("------------------------------------------------------------"))

## Full solutions

The 'full' solutions are the actual solutions (full sets of features) found by the algorithm. Their sparsity may not be as strong as desired.

In [None]:
show_solution_summary(
    solution_data=full_solutions,
    title="Full solutions found by the feature selector, ordered by importance",
    value_column="Feature",
)

show_unique_features_from_full_solutions(full_solutions)

full_features = get_features_from_solutions(full_solutions)
show_features_in_components(
    solutions=full_features,
    features_to_show=get_unique_features(full_features),
)

if run_regressions_for_solutions:
    show_regression_results_for_solutions(
        solutions=full_features,
        df=df,
        response=response,
        use_standard_scaler=apply_standard_scaling,
        penalty="l2",
        verbose=False,
    )

In [None]:
if chosen_solution_set is not None:
    display(
        Markdown(
            f"## Regression for the chosen solution type **{chosen_solution_set}**"
        )
    )

    if chosen_solution_set == "full":
        chosen_solutions = full_features
    elif chosen_solution_set == "top":
        chosen_solutions = top_features
    elif chosen_solution_set.startswith("outlier"):
        deviation = chosen_solution_set.split("_", 1)[1]
        deviation_str = (
            f"MAD_{deviation}"
            if constants["USE_MEDIAN_FOR_OUTLIER_DETECTION"]
            else f"STD_{deviation}"
        )
        chosen_solutions = outlier_features[deviation_str]

    show_regression_results_for_solutions(
        solutions=chosen_solutions,
        df=df,
        response=response,
        use_standard_scaler=apply_standard_scaling,
        penalty="l2",
        verbose=True,
    )

## Top solutions

The top solutions are just the most important features from the long solutions. The number of features selected is defined by the desired sparsity parameter.

In [None]:
display(Markdown(f"### Required sparsity: {constants["DESIRED_SPARSITY"]}"))

show_solution_summary(
    solution_data=top_solutions,
    title=f"Top solutions with required sparsity {constants["DESIRED_SPARSITY"]}",
    value_column="Feature",
)

show_unique_features_from_full_solutions(top_solutions)

top_features = get_features_from_solutions(top_solutions)
show_features_in_components(
    solutions=top_features,
    features_to_show=get_unique_features(top_features),
)

if run_regressions_for_solutions:
    show_regression_results_for_solutions(
        solutions=top_features,
        df=df,
        response=response,
        use_standard_scaler=apply_standard_scaling,
        penalty="l2",
        verbose=False,
    )

display(Markdown("------------------------------------------------------------"))

# Performance diagnostics and recommendations (work in progress)

Let us assess the feature selector's progress history to evaluate the reliability of the results. Based on the diagnostics, hyperparameter tuning might be recommended.

In [None]:
if run_diagnostics:
    diagnostics = run_performance_diagnostics(
        history,
        desired_sparsity=constants["DESIRED_SPARSITY"],
        verbose=verbose_diagnostics,
    )
else:
    display(Markdown("**Performance diagnostics and recommendations are disabled.**"))

In [None]:
if run_diagnostics and show_recommendations:
    display_recommendations(diagnostics=diagnostics, constants=constants)