# GEMSS tier results analysis: experiment evaluation

This notebook loads and analyzes the aggregated results from the tiered experiments. 
It reads the `tier_summary_metrics.csv` file generated by the experiment runner and provides visualizations to assess algorithm performance across different parameter configurations.

In [None]:
from gemss.experiment_assessment.experiment_results_interactive import (
    show_interactive_performance_overview,
    show_interactive_solution_comparison,
    show_interactive_comparison_with_grouping,
    show_interactive_heatmap,
    show_interactive_si_asi_comparison,
)
from gemss.experiment_assessment.experiment_results_analysis import (
    DEFAULT_AGGREGATION_FUNC,
    DEFAULT_METRIC,
    get_all_experiment_results,
    choose_best_solution_per_group,
)

## Select tiers and load data

Specify the Tier IDs you want to analyze. The code assumes your results are stored in `../../scripts/results/tier{ID}/tier_summary_metrics.csv`.

In [None]:
tier_id_list = [1, 2, 3, 4, 5, 6, 7]
df = get_all_experiment_results(tier_id_list, verbose=True)

## 1. Quick performance summary

In [None]:
show_interactive_performance_overview(
    df,
    group_identifier="TIER_ID",
    show_metric_thresholds=True,
)

## 2. Comparison of solution types

In [None]:
best_solutions_per_tier = choose_best_solution_per_group(
    df,
    group_identifier="TIER_ID",
    metric=DEFAULT_METRIC,
    aggregation_func=DEFAULT_AGGREGATION_FUNC,
    verbose=True,
)

For all solution types in one plot, explore how different experimental parameters affect the algorithm's performance.

**Instructions:**
1. Select the **Metric** (e.g., `Success_Index` or `Recall`).
2. Select the **X-Axis** parameter (e.g., `N_SAMPLES` or `NOISE_STD`).

In [None]:
show_interactive_solution_comparison(df)

## 3. Comparison with grouping

Explore how different experimental parameters affect the algorithm's performance.

**Instructions:**
1. Select the **Solution Type** (e.g., `outlier_STD_2.5` is usually recommended for unknown sparsity).
2. Select the **Metric** (e.g., `Success_Index` or `Recall`).
3. Select the **X-Axis** parameter (e.g., `N_SAMPLES` or `NOISE_STD`).
4. Select a **Grouping** parameter (color) to see interaction effects (e.g., `SPARSITY`).

In [None]:
show_interactive_comparison_with_grouping(df)

## 4. Heatmap of a metric vs. 2 features

Heatmap visualizations to compare a metric w.r.t. two parameters

In [None]:
show_interactive_heatmap(df)

## 6. Trade-off Analysis: Success Index (Quantity) vs. Adjusted SI (Quality)

Assess the "selectivity" of your solution. High SI but Low ASI means the solution is noisy (selecting too many features). A perfect solution lies on the diagonal because ASI <= SI.

### Success Index (SI)

The Success Index rewards recall but normalizes it by the difficulty of the problem (sparsity). It is defined as:
$$
\text{SI} = \frac{\text{Recall}}{\text{Problem Sparsity}} = \frac{f_{\text{correct}} / p_{\text{generating}}}{p_{\text{generating}} / p} = \frac{p \times f_{\text{correct}}}{p_{\text{generating}}^2}
$$

- $f_{\text{correct}}$: Number of correctly identified generating features (True Positives).
- $p_{\text{generating}}$: Total number of generating features (Ground Truth size).
- $p$: Total number of features in the dataset (Search space size).

#### Interpretation & Values

The SI value scales linearly with the number of correctly identified features ($f_{\text{correct}}$).
**It effectively answers:** "How many times better did we perform compared to random guessing in a sparse environment?"

**Maximum Value (Perfect Recall):** the inverse of the problem's sparsity ratio.
**Takeaway:** A higher maximum achievable SI indicates a harder problem. Achieving a high SI in a high-dimensional problem is a strong signal of success.
**Minimum Value:** When $f_{\text{correct}} = 0$ (i.e. $\text{Recall} = 0.0$), it is $\text{SI} = 0$.

#### Performance Thresholds ("Good" vs. "Fail")

Because the raw value of SI depends heavily on the dimensionality $p$, absolute thresholds (e.g., "SI > 5 is good") are misleading across different experiments. Evaluate SI relative to the maximum possible SI for that specific experiment.

### Adjusted Success Index (ASI)

The Adjusted Success Index adds a penalty for precision. It prevents the algorithm from "cheating" the SI by simply selecting every single feature.
$$\text{ASI} = \text{Precision} \times \text{SI} = \frac{f_{\text{correct}}}{f} \times \frac{p \times f_{\text{correct}}}{p_{\text{generating}}^2}
$$
- $f$: Total number of features selected by the algorithm.

#### Interpretation & Values

ASI balances the difficulty of finding the needle in the haystack (SI) with the efficiency of the search (Precision).

**Maximum Value (Perfect Recall & Precision):** When Recall = 1.0 AND Precision = 1.0 (perfect recovery of exactly the generating set):
$$\text{ASI}_{\text{max}} = \text{SI}_{\text{max}} = \frac{p}{p_{\text{generating}}}$$

**Impact of False Positives:** If the algorithm achieves perfect recall but selects too many extra features (poor precision), ASI drops significantly compared to SI.


#### Performance Thresholds

**Excellent (Perfect Recovery):** $\text{ASI} \approx \text{SI}_{\text{max}}$ (Precision $\approx 1.0$).

**Good (High Recall, Acceptable Noise):** $\text{ASI} \approx 0.5 \times \text{SI}_{\text{max}}$ (e.g., Recall=1.0, Precision=0.5 OR Recall=0.8, Precision=0.6).

**Fail (Noisy or Missed):** $\text{ASI} < 0.1 \times \text{SI}_{\text{max}}$.

In [None]:
show_interactive_si_asi_comparison(df)