# Group plot algorithm — methods reference

This notebook is a **standalone narrative description** of the algorithm used for the Grouped plot type. It reproduces the exact logic in `group_plot.py` using pandas and numpy only. The pipeline:

1. **Pre-filter** the master (raw) dataframe by column selections.
2. **Extract the y column** with optional transformations (numeric coerce, optional absolute value, optional remove-values filter).
3. **Group by a group column** and compute the chosen y-stat (e.g. mean) per group.
4. **Full stats table**: for one group column and one y column, compute count, min, max, mean, std, sem, and CV per group.

All intermediate results are printed so you can follow the algorithm step by step.

## Setup

Imports, path to the example CSV, and algorithm parameters. The notebook expects to be run with the current working directory set to the **nicewidgets project root** (the directory containing `data/` and `notebooks/`).

In [None]:
import numpy as np
import pandas as pd
from pathlib import Path
from typing import Any, Optional

# Path to example data (run from nicewidgets project root)
DATA_DIR = Path("data")
if not (DATA_DIR / "kym_event_report.csv").exists():
    DATA_DIR = Path("../data")
CSV_PATH = DATA_DIR / "kym_event_report.csv"
assert CSV_PATH.exists(), f"Example CSV not found: {CSV_PATH}"

# Sentinel for "no filter" (matches group_plot.PRE_FILTER_NONE)
PRE_FILTER_NONE = "(none)"

# Algorithm parameters (same as group_plot.py __main__)
PRE_FILTER_COLUMNS = ["roi_id"]
UNIQUE_ROW_ID_COL = "kym_event_id"
PRE_FILTER = {"roi_id": PRE_FILTER_NONE}  # no filter = use all rows
GROUP_COL = "event_type"
YCOL = "score_peak"
YSTAT = "mean"

## Algorithm functions (from group_plot.py)

The following cells define the same functions as in `group_plot.py`, so this notebook is self-contained and does not require importing the package.

### Step 1: Pre-filter

Filter the master dataframe by pre-filter column selections. For each column in `pre_filter_columns`, if the selection is not `PRE_FILTER_NONE`, keep only rows where `df[col].astype(str) == str(selection)`. Selections are ANDed across columns. Rows with missing `unique_row_id_col` are then dropped. (Same logic as `DataFrameProcessor.filter_by_pre_filters()`.)

In [None]:
def filter_by_pre_filters(
    df: pd.DataFrame,
    pre_filter_columns: list[str],
    selections: dict[str, Any],
    unique_row_id_col: str,
) -> pd.DataFrame:
    df_f = df.copy()
    for col in pre_filter_columns:
        val = selections.get(col, PRE_FILTER_NONE)
        if val is None or val == PRE_FILTER_NONE:
            continue
        df_f = df_f[df_f[col].astype(str) == str(val)]
    df_f = df_f.dropna(subset=[unique_row_id_col])
    return df_f

### Step 2: Extract y values

Get the y column as a numeric series. Converts to numeric (coerce errors to NaN), optionally applies `abs()`, and optionally sets values outside `[-threshold, +threshold]` to NaN. (Same logic as `DataFrameProcessor.get_y_values()`.)

In [None]:
def get_y_values(
    df_f: pd.DataFrame,
    ycol: str,
    use_absolute: bool = False,
    use_remove_values: bool = False,
    remove_values_threshold: Optional[float] = None,
) -> pd.Series:
    y = pd.to_numeric(df_f[ycol], errors="coerce")
    if use_absolute:
        y = y.abs()
    if use_remove_values and remove_values_threshold is not None:
        y[(y < -remove_values_threshold) | (y > remove_values_threshold)] = np.nan
    return y

### Step 3: Group and aggregate (single stat)

Compute the aggregated stat per group: build a temporary frame with columns `[group, y]`, then group by `group` and apply the chosen y-stat (e.g. mean, count, std, sem, cv). Returns a Series with index = group labels, values = aggregated y. (Same logic as `_figure_grouped()`.)

In [None]:
def grouped_aggregate(
    df_f: pd.DataFrame,
    group_col: str,
    ycol: str,
    ystat: str,
    use_absolute: bool = False,
    use_remove_values: bool = False,
    remove_values_threshold: Optional[float] = None,
    cv_epsilon: float = 1e-10,
) -> pd.Series:
    g = df_f[group_col].astype(str)
    y = get_y_values(df_f, ycol, use_absolute=use_absolute,
                     use_remove_values=use_remove_values,
                     remove_values_threshold=remove_values_threshold)
    tmp = pd.DataFrame({"group": g, "y": y}).dropna(subset=["group"])
    if ystat == "count":
        return tmp.groupby("group", dropna=False)["y"].count()
    tmp["y"] = pd.to_numeric(tmp["y"], errors="coerce")
    if ystat == "cv":
        grp = tmp.groupby("group", dropna=False)["y"]
        mean_ = grp.mean()
        std_ = grp.std(ddof=1)
        cv = std_ / mean_
        return cv.where(np.abs(mean_) >= cv_epsilon, np.nan)
    if ystat == "sem":
        return tmp.groupby("group", dropna=False)["y"].sem(ddof=1)
    return getattr(tmp.groupby("group", dropna=False)["y"], ystat)()

### Full stats table

For one group column and one y column, compute a full stats table per group: **count, min, max, mean, std, sem, CV**. Same preprocessing as `get_y_values`. std and sem use `ddof=1`; CV = std/mean with NaN when |mean| < cv_epsilon.

In [None]:
def grouped_full_stats_table(
    df_f: pd.DataFrame,
    group_col: str,
    ycol: str,
    use_absolute: bool = False,
    use_remove_values: bool = False,
    remove_values_threshold: Optional[float] = None,
    cv_epsilon: float = 1e-10,
) -> pd.DataFrame:
    g = df_f[group_col].astype(str)
    y = get_y_values(df_f, ycol, use_absolute=use_absolute,
                     use_remove_values=use_remove_values,
                     remove_values_threshold=remove_values_threshold)
    tmp = pd.DataFrame({"group": g, "y": y}).dropna(subset=["group"])
    tmp["y"] = pd.to_numeric(tmp["y"], errors="coerce")
    grp = tmp.groupby("group", dropna=False)["y"]
    count = grp.count()
    min_ = grp.min()
    max_ = grp.max()
    mean_ = grp.mean()
    std_ = grp.std(ddof=1)
    sem_ = grp.sem(ddof=1)
    cv_ = (std_ / mean_).where(np.abs(mean_) >= cv_epsilon, np.nan)
    return pd.DataFrame({"count": count, "min": min_, "max": max_, "mean": mean_, "std": std_, "sem": sem_, "cv": cv_})

---
## Step 0: Master dataframe (raw)

Load the CSV and optionally add a unique row ID column if the schema expects it (e.g. for radon_report_db). Below we show shape and the first rows of the columns used later (group column, y column, unique row ID).

In [None]:
df_master = pd.read_csv(CSV_PATH)
if UNIQUE_ROW_ID_COL not in df_master.columns and "path" in df_master.columns and "roi_id" in df_master.columns:
    df_master[UNIQUE_ROW_ID_COL] = df_master["path"].astype(str) + "|" + df_master["roi_id"].astype(str)

print("--- Step 0: Master dataframe (raw) ---")
print(f"Shape: {df_master.shape}")
cols = [c for c in [GROUP_COL, YCOL, UNIQUE_ROW_ID_COL] if c in df_master.columns]
print(df_master[cols].head(10).to_string(index=False))

## Step 1: After pre-filter

Apply the pre-filter: restrict rows by the selected values for each pre-filter column (here we use no filter), then drop rows with missing unique row ID. Below: shape and first rows of the key columns.

In [None]:
df_f = filter_by_pre_filters(df_master, PRE_FILTER_COLUMNS, PRE_FILTER, UNIQUE_ROW_ID_COL)

print("--- Step 1: After pre-filter ---")
print(f"Shape: {df_f.shape}")
cols = [c for c in [GROUP_COL, YCOL, UNIQUE_ROW_ID_COL] if c in df_f.columns]
print(df_f[cols].head(20).to_string(index=False))

## Step 2: Y values (raw column + computed series)

Extract the y column as a numeric series (coerce to numeric, optional abs/remove-values not used here). Display: raw column, and the computed series used in the rest of the pipeline.

In [None]:
y_series = get_y_values(df_f, YCOL)
step2_display = pd.DataFrame({
    GROUP_COL: df_f[GROUP_COL].values,
    YCOL: df_f[YCOL].values,
    "y (computed)": y_series.values,
})
print("--- Step 2: Y values (raw column + computed series) ---")
print(step2_display.head(20).to_string(index=False))

## Step 3: Per-row (group, y) before aggregation

Build the temporary frame used inside the aggregation: one row per data row, columns `group` (from group column as string) and `y` (numeric). Rows with missing group are dropped. This is the input to the groupby that produces the single-stat result and the full stats table.

In [None]:
g = df_f[GROUP_COL].astype(str)
tmp_display = pd.DataFrame({"group": g, "y": y_series}).dropna(subset=["group"])
tmp_display["y"] = pd.to_numeric(tmp_display["y"], errors="coerce")
print("--- Step 3: Per-row (group, y) before aggregation ---")
print(tmp_display.head(20).to_string(index=False))

## Final grouped plot data (single stat)

Run the full pipeline: pre-filter → grouped aggregation with the chosen y-stat (e.g. mean). The result is one value per group, used for the grouped plot (x = group, y = value).

In [None]:
agg = grouped_aggregate(df_f, group_col=GROUP_COL, ycol=YCOL, ystat=YSTAT)
result_table = pd.DataFrame({"group": agg.index.astype(str), "value": agg.values})

print("--- Final grouped plot data ---")
print(f"Parameters used: group_col = {GROUP_COL!r}, ycol = {YCOL!r}, ystat = {YSTAT!r}")
print("Table (x = group, y = value):")
print(result_table.to_string(index=False))

## Full stats table

For the same group column and y column, compute all summary statistics per group: **count, min, max, mean, std, sem, CV**. This is the final methods output: one row per group, one column per stat.

Below we also print a **bulleted list of the filtering** that was applied to the master dataframe before this analysis (e.g. roi_id, etc.), so the full stats table is fully specified.

In [None]:
print("Filtering applied to the master dataframe (before analysis):")
for col in PRE_FILTER_COLUMNS:
    val = PRE_FILTER.get(col, PRE_FILTER_NONE)
    if val is None or val == PRE_FILTER_NONE:
        print(f"  - {col}: (none) [no filter]")
    else:
        print(f"  - {col}: {val}")
print()

full_stats = grouped_full_stats_table(df_f, group_col=GROUP_COL, ycol=YCOL)
print("--- Full stats table (group_col = {0!r}, ycol = {1!r}) ---".format(GROUP_COL, YCOL))
print(full_stats.to_string())