# GEMSS Tier Results Analysis

This notebook loads and analyzes the aggregated results from the tiered experiments. 
It reads the `tier_summary_metrics.csv` file generated by the experiment runner and provides visualizations to assess algorithm performance across different parameter configurations.

In [None]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import os
from IPython.display import display, Markdown
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed

from gemss.diagnostics.experiment_results_visualizations import (
    plot_solution_grouped,
    plot_solution_comparison,
    plot_si_asi_scatter,
    analyze_metric_results,
    plot_heatmap,
    plot_metric_vs_hyperparam,
)

## Select tiers and load data

Specify the Tier IDs you want to analyze. The code assumes your results are stored in `../scripts/results/tier{ID}/tier_summary_metrics.csv`.

In [None]:
# tier_id_list = [3]
# tier_id_list = [7]
# tier_id_list = [1, 2, 3, 4]
tier_id_list = [1, 2, 3, 4, 5, 6, 7]

In [None]:
# Identify metric columns (those containing the base name of coverage metrics)
# List synchronized with keys returned by calculate_coverage_metrics in run_experiment.py
# All coverage metrics are numeric (possibly None)
COVERAGE_METRICS = [
    "Recall",
    "Precision",
    "F1_Score",
    "Jaccard",
    "Miss_Rate",
    "FDR",
    "Global_Miss_Rate",
    "Global_FDR",
    "Success_Index",
    "Adjusted_Success_Index",
]

SOLUTION_OPTIONS = [
    "full",
    "top",
    "outlier_STD_2.0",
    "outlier_STD_2.5",
    "outlier_STD_3.0",
]

potential_params = [
    "N_SAMPLES",
    "N_FEATURES",
    "SAMPLE_VS_FEATURE_RATIO",
    "SPARSITY",
    "N_GENERATING_SOLUTIONS",
    "N_CANDIDATE_SOLUTIONS",
    "NOISE_STD",
    "NAN_RATIO",
    "LAMBDA_JACCARD",
    "BINARY_RESPONSE_RATIO",
    "BATCH_SIZE",
]

In [None]:
df = pd.DataFrame()
for tier_id in tier_id_list:

    results_path = f"../scripts/results/tier{tier_id}/tier_summary_metrics.csv"

    if os.path.exists(results_path):
        df_tier = pd.read_csv(results_path)
        print(
            f"Successfully loaded {len(df_tier)} experiment records from Tier {tier_id}."
        )
        # Ensure numeric columns are actually numeric
        metric_cols = [
            c for c in df_tier.columns if any(x in c for x in COVERAGE_METRICS)
        ]
        for col in metric_cols:
            if col in df_tier.columns:
                df_tier[col] = pd.to_numeric(df_tier[col], errors="coerce")

        # Add TIER_ID column
        df_tier["TIER_ID"] = int(tier_id)

        # Add EXPERIMENT_ID column: {tier_id}.{experiment_number_in_tier}
        df_tier["EXPERIMENT_ID"] = str(tier_id) + "." + (df_tier.index + 1).astype(str)

    else:
        print(f"ERROR: File not found at {results_path}")
        print("Please run the experiments for this tier first, or check the path.")
        df_tier = pd.DataFrame()

    # Append to main DataFrame
    df = pd.concat([df, df_tier], ignore_index=True)


# Add the "SAMPLE_VS_FEATURE_RATIO" column
df["SAMPLE_VS_FEATURE_RATIO"] = df["N_SAMPLES"] / df["N_FEATURES"]

## Data Preview

In [None]:
display(Markdown(f"### All results for tiers: {tier_id_list}"))
display(df)

display(Markdown("### Available Metrics"))
print(f"Found {len(metric_cols)} metric columns.")

display(Markdown("### Varied Parameters"))

# Identify which parameters actually vary in this dataset
varied_params = [p for p in potential_params if p in df.columns and df[p].nunique() > 1]
unvaried_params = [
    p for p in potential_params if p in df.columns and p not in varied_params
]
display(Markdown(f"Parameters that vary in this tier:\n {varied_params}"))

In [None]:
# get the df pivoted by solution type
df_pivot_solution = pd.DataFrame()
for solution in SOLUTION_OPTIONS:
    solution_cols = [col for col in df.columns if solution in col]
    df_solution = df[["TIER_ID"] + varied_params + solution_cols].copy()
    df_solution.rename(
        columns={col: col.replace(f"{solution}_", "") for col in solution_cols},
        inplace=True,
    )
    df_solution["solution_type"] = solution
    df_pivot_solution = pd.concat([df_pivot_solution, df_solution], ignore_index=True)

## 1. Quick performance summary

In [None]:
# Thresholds for performance evaluation:

from gemss.diagnostics.experiment_results_visualizations import THRESHOLDS_FOR_METRIC

df_thresholds = pd.DataFrame()
for metric, thresholds in THRESHOLDS_FOR_METRIC.items():
    if thresholds is not None:
        df_thresholds[metric] = pd.Series(thresholds)

display(Markdown(f"#### Performance thresholds for selected metrics"))
display(df_thresholds)

In [None]:
interact(
    analyze_metric_results,
    df=fixed(df),
    tier=widgets.SelectMultiple(
        options=tier_id_list,
        value=tier_id_list,
        description="Tier:",
    ),
    solution_type=widgets.Dropdown(
        options=sorted(SOLUTION_OPTIONS),
        value="outlier_STD_2.0",
        description="Solution:",
    ),
    metric_name=widgets.Dropdown(
        options=sorted(["Recall", "Precision", "F1_Score"]),
        value="Recall",
        description="Metric:",
    ),
    thresholds=fixed(None),
)

## 2. Comparison of solution types

In [None]:
display(Markdown("### Average values for selected metrics"))
display(
    df_pivot_solution[["TIER_ID", "solution_type"] + COVERAGE_METRICS]
    .groupby(["TIER_ID", "solution_type"])
    .mean()
)

For all solution types in one plot, explore how different experimental parameters affect the algorithm's performance.

**Instructions:**
1. Select the **Metric** (e.g., `Success_Index` or `Recall`).
2. Select the **X-Axis** parameter (e.g., `N_SAMPLES` or `NOISE_STD`).

In [None]:
interact(
    plot_solution_comparison,
    df=fixed(df),
    tier=widgets.SelectMultiple(
        options=tier_id_list,
        value=tier_id_list,
        description="Tier:",
    ),
    solution_types=fixed(SOLUTION_OPTIONS),
    metric_name=widgets.Dropdown(
        options=COVERAGE_METRICS, value="Success_Index", description="Metric:"
    ),
    x_axis=widgets.Dropdown(
        options=varied_params,
        value="N_FEATURES" if "N_FEATURES" in varied_params else varied_params[0],
        description="X-Axis:",
    ),
    hover_params=fixed(varied_params + unvaried_params),
)

## 3. Comparison with grouping

Explore how different experimental parameters affect the algorithm's performance.

**Instructions:**
1. Select the **Solution Type** (e.g., `outlier_STD_2.5` is usually recommended for unknown sparsity).
2. Select the **Metric** (e.g., `Success_Index` or `Recall`).
3. Select the **X-Axis** parameter (e.g., `N_SAMPLES` or `NOISE_STD`).
4. Select a **Grouping** parameter (color) to see interaction effects (e.g., `SPARSITY`).

In [None]:
interact(
    plot_solution_grouped,
    df=fixed(df),
    tier=widgets.SelectMultiple(
        options=tier_id_list,
        value=tier_id_list,
        description="Tier:",
    ),
    solution_type=widgets.Dropdown(
        options=SOLUTION_OPTIONS,
        value="outlier_STD_2.5",
        description="Solution:",
    ),
    metric_name=widgets.Dropdown(
        options=COVERAGE_METRICS, value="Success_Index", description="Metric:"
    ),
    x_axis=widgets.Dropdown(
        options=varied_params,
        value="N_FEATURES" if "N_FEATURES" in varied_params else varied_params[0],
        description="X-Axis:",
    ),
    color_by=widgets.Dropdown(
        options=["None"] + varied_params,
        value="NAN_RATIO" if "NAN_RATIO" in varied_params else "None",
        description="Group By:",
    ),
    hover_params=fixed(varied_params + unvaried_params),
)

## 4. Heatmap of a metric vs. 2 features

Heatmap visualizations to compare a metric w.r.t. two parameters

In [None]:
interact(
    plot_heatmap,
    df=fixed(df),
    tier=widgets.SelectMultiple(
        options=tier_id_list,
        value=tier_id_list,
        description="Tier:",
    ),
    solution_type=widgets.Dropdown(
        options=sorted(SOLUTION_OPTIONS),
        value="outlier_STD_2.0",
        description="Solution:",
    ),
    metric_name=widgets.Dropdown(
        options=sorted(COVERAGE_METRICS),
        value="Success_Index",
        description="Metric:",
    ),
    x_axis=widgets.Dropdown(
        options=varied_params + unvaried_params,
        value="N_FEATURES" if "N_FEATURES" in varied_params else varied_params[0],
        description="X-Axis:",
    ),
    y_axis=widgets.Dropdown(
        options=varied_params,
        value="SPARSITY" if "SPARSITY" in varied_params else varied_params[0],
        description="Y-Axis:",
    ),
)

## 5. Effect of hyperparameters

Compare the results of varying hyperparameters ``LAMBDA_JACCARD`` and ``BATCH_SIZE`` within relevant tiers.

In [None]:
hyperparam_list = ["LAMBDA_JACCARD", "BATCH_SIZE"]
select_metrics = [
    "Recall",
    "Precision",
    "F1_Score",
    "Jaccard",
]
# select only those columns that contain one of the select_metrics
select_metric_cols = [
    col for col in metric_cols if any(m in col for m in select_metrics)
]
for hyperparam in hyperparam_list:
    for tier in tier_id_list:
        if df[df["TIER_ID"] == tier][hyperparam].nunique() > 1:
            df_grouped = (
                df[df["TIER_ID"] == tier].groupby(hyperparam)[select_metric_cols].mean()
            )
            display(Markdown(f"### Effect of **{hyperparam}**: Tier {tier}"))
            display(df_grouped)

            plot_metric_vs_hyperparam(
                df_grouped=df_grouped,
                hyperparam=hyperparam,
                solution_options=SOLUTION_OPTIONS,
            )

## 6. Trade-off Analysis: Success Index (Quantity) vs. Adjusted SI (Quality)

Assess the "selectivity" of your solution. High SI but Low ASI means the solution is noisy (selecting too many features). A perfect solution lies on the diagonal because ASI <= SI.

### Success Index (SI)

The Success Index rewards recall but normalizes it by the difficulty of the problem (sparsity). It is defined as:
$$
\text{SI} = \frac{\text{Recall}}{\text{Problem Sparsity}} = \frac{f_{\text{correct}} / p_{\text{generating}}}{p_{\text{generating}} / p} = \frac{p \times f_{\text{correct}}}{p_{\text{generating}}^2}
$$

- $f_{\text{correct}}$: Number of correctly identified generating features (True Positives).
- $p_{\text{generating}}$: Total number of generating features (Ground Truth size).
- $p$: Total number of features in the dataset (Search space size).

#### Interpretation & Values

The SI value scales linearly with the number of correctly identified features ($f_{\text{correct}}$).
**It effectively answers:** "How many times better did we perform compared to random guessing in a sparse environment?"

**Maximum Value (Perfect Recall):** the inverse of the problem's sparsity ratio.
**Takeaway:** A higher maximum achievable SI indicates a harder problem. Achieving a high SI in a high-dimensional problem is a strong signal of success.
**Minimum Value:** When $f_{\text{correct}} = 0$ (Recall = 0.0):$$\text{SI} = 0$$

#### Performance Thresholds ("Good" vs. "Fail")

Because the raw value of SI depends heavily on the dimensionality $p$, absolute thresholds (e.g., "SI > 5 is good") are misleading across different experiments. Evaluate SI relative to the maximum possible SI for that specific experiment.

**Good Performance:** Recall $\ge 0.8$.

$$\text{SI} \ge 0.8 \times \frac{p}{p_{\text{generating}}}$$

**Acceptable / Moderate:** Recall $\approx 0.5$.

$$\text{SI} \approx 0.5 \times \frac{p}{p_{\text{generating}}}$$

**Fail:** Recall $< 0.2$.

$$\text{SI} < 0.2 \times \frac{p}{p_{\text{generating}}}$$



### Adjusted Success Index (ASI)

The Adjusted Success Index adds a penalty for precision. It prevents the algorithm from "cheating" the SI by simply selecting every single feature.
$$\text{ASI} = \text{Precision} \times \text{SI} = \frac{f_{\text{correct}}}{f} \times \frac{p \times f_{\text{correct}}}{p_{\text{generating}}^2}
$$
- $f$: Total number of features selected by the algorithm.

#### Interpretation & Values

ASI balances the difficulty of finding the needle in the haystack (SI) with the efficiency of the search (Precision).

**Maximum Value (Perfect Recall & Precision):** When Recall = 1.0 AND Precision = 1.0 (perfect recovery of exactly the generating set):
$$\text{ASI}_{\text{max}} = \text{SI}_{\text{max}} = \frac{p}{p_{\text{generating}}}$$

**Impact of False Positives:** If the algorithm achieves perfect recall but selects too many extra features (poor precision), ASI drops significantly compared to SI.


#### Performance Thresholds

**Excellent (Perfect Recovery):** $\text{ASI} \approx \text{SI}_{\text{max}}$ (Precision $\approx 1.0$).

**Good (High Recall, Acceptable Noise):** $\text{ASI} \approx 0.5 \times \text{SI}_{\text{max}}$ (e.g., Recall=1.0, Precision=0.5 OR Recall=0.8, Precision=0.6).

**Fail (Noisy or Missed):** $\text{ASI} < 0.1 \times \text{SI}_{\text{max}}$.

In [None]:
interact(
    plot_si_asi_scatter,
    df=fixed(df),
    tier=widgets.SelectMultiple(
        options=tier_id_list,
        value=tier_id_list,
        description="Tier:",
    ),
    solution_type=widgets.Dropdown(
        options=sorted(SOLUTION_OPTIONS),
        value="outlier_STD_2.0",
        description="Solution:",
    ),
    color_by=widgets.Dropdown(
        options=["None"] + varied_params,
        value="NOISE_STD" if "NOISE_STD" in varied_params else "None",
        description="Color By:",
    ),
    hover_params=fixed(varied_params),
)