feat: add probability-aware metric support across search and evaluation#187
feat: add probability-aware metric support across search and evaluation#187marcellodebernardi merged 13 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds probability-aware metric support across the search, baseline validation, and final evaluation phases. It enables probability-based primary metrics (roc_auc, roc_auc_ovr, roc_auc_ovo, log_loss) by: (1) routing evaluation through predict_proba() instead of predict() when these metrics are selected, (2) making search and fallback selection direction-correct for both higher- and lower-is-better metrics, and (3) enforcing clear predictor semantics for probability outputs across all inference templates.
Changes:
- Adds
metric_requires_probabilities(),normalize_probability_predictions()helpers and updates all evaluation paths (baseline validation,_evaluate_predictor,BaselineBuilderAgent) to usepredict_proba()for probability metrics - Adds
predict_proba()to all five inference templates (XGBoost, LightGBM, CatBoost, PyTorch, Keras), with logit-to-probability conversion logic for Keras - Makes
SearchJournaldirection-aware viaoptimization_direction,selection_score(),sort_key(), andis_better()methods; updates all sorting/selection inTreeSearchPolicyandEvolutionarySearchPolicyto respect metric direction - Adds comprehensive unit tests covering helper routing, search direction behavior, predictor probability semantics, baseline validation, and evaluator prompt guidance
Reviewed changes
Copilot reviewed 27 out of 27 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
plexe/helpers.py |
Adds PROBABILITY_METRICS set, metric_requires_probabilities(), normalize_probability_predictions(), and updates _evaluate_predictor() to use predict_proba() for probability metrics |
plexe/search/journal.py |
Adds optimization_direction parameter to SearchJournal, plus selection_score(), is_better(), sort_key() methods; updates serialization/deserialization |
plexe/search/tree_policy.py |
Updates sorting and softmax sampling to use journal.sort_key and journal.selection_score |
plexe/search/evolutionary_search_policy.py |
Updates all performance comparisons to use direction-aware methods; fixes should_stop for lower-is-better metrics |
plexe/workflow.py |
Sets journal.optimization_direction from context.metric during checkpoint restore and fresh journal creation; fixes fallback sort to use journal.sort_key |
plexe/tools/submission.py |
Updates validate_baseline_predictor and evaluate_baseline_performance to use predict_proba() for probability metrics |
plexe/agents/baseline_builder.py |
Adds proba_note and proba_requirement prompt guidance; updates _evaluate_performance for probability metrics |
plexe/agents/model_evaluator.py |
Updates _build_agent and _get_phase_1_prompt with probability metric guidance |
plexe/templates/inference/*.py |
Adds predict_proba() to XGBoost, LightGBM, CatBoost, PyTorch templates; refactors Keras predictor to use _probabilities_from_raw() shared helper |
tests/unit/... |
New tests for all above changes plus direction-aware journal/policy behavior |
Makefile |
Adds run-titanic-proba example target |
plexe/CODE_INDEX.md, tests/CODE_INDEX.md |
Updated documentation indexes |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@greptileai please review again with latest changes |
Greptile SummaryThis PR adds end-to-end probability-aware metric support (ROC-AUC variants, log-loss) across the search, baseline validation, and final evaluation pipeline. It introduces Key changes:
Minor observations:
Confidence Score: 4/5
|
| Filename | Overview |
|---|---|
| plexe/helpers.py | Adds metric_requires_probabilities and normalize_probability_predictions helpers; updates _evaluate_predictor to route through predict_proba for probability metrics. Direction logic and column-count checks are correct; all prior review-thread issues addressed. |
| plexe/search/journal.py | Adds validated optimization_direction property, selection_score, is_better, and sort_key helpers; direction-aware best_node, summary, and history computations are all correct. Serialisation round-trips correctly with backward-compatible default. |
| plexe/search/evolutionary_search_policy.py | All n.performance comparisons replaced with journal.sort_key / selection_score. Early-stopping thresholds now handle both directions and the no-baseline edge case for lower-direction metrics is correctly fixed. |
| plexe/templates/inference/keras_predictor.py | Adds _uses_logits_output and _probabilities_from_raw with NaN/Inf guard, legacy logit heuristic, and missing-metadata fallback. All prior review-thread issues addressed; logit sigmoid/softmax paths are correct. |
| plexe/tools/submission.py | Extends get_save_split_uris_tool with Spark-level split validation and test-split enforcement; baseline validation tools updated to route through predict_proba for probability metrics. Logic is correct. |
| plexe/workflow.py | Sets journal.optimization_direction (now validated by property setter) in all journal creation and restore paths; fallback selection uses journal.sort_key; split-ratio canonicalization and 3-way guard added for final-evaluation path. |
| plexe/agents/dataset_splitter.py | Passes split_ratios through to get_save_split_uris_tool for Spark-level validation; two TODO comments flag that prompts still instruct 3-way output even for 2-way modes, which could cause accidental holdout splits. |
Last reviewed commit: 2aac28d
Additional Comments (2)
A safer approach is to pass the expected number of classes explicitly and fall back only when the caller cannot supply it: def normalize_probability_predictions(
y_true: np.ndarray,
y_pred_proba: Any,
metric_name: str,
n_classes: int | None = None,
) -> np.ndarray:
...
if n_classes is None:
n_classes = len(np.unique(y_true))Alternatively, at the call-sites where Prompt To Fix With AIThis is a comment left during a code review.
Path: plexe/helpers.py
Line: 264
Comment:
**`n_classes` inferred from a potentially-incomplete sample**
`n_classes = len(np.unique(y_true))` counts only the classes that happen to appear in the supplied validation/evaluation sample. If even one class is absent from the sample (e.g., a rare class in a 3-class problem with a small validation split), the function can return a 1-D array instead of the required full probability matrix, silently feeding the wrong shape into `roc_auc_ovr` / `roc_auc_ovo` / `log_loss`.
A safer approach is to pass the expected number of classes explicitly and fall back only when the caller cannot supply it:
```python
def normalize_probability_predictions(
y_true: np.ndarray,
y_pred_proba: Any,
metric_name: str,
n_classes: int | None = None,
) -> np.ndarray:
...
if n_classes is None:
n_classes = len(np.unique(y_true))
```
Alternatively, at the call-sites where `task_analysis["num_classes"]` is already available, pass it through so the function has a reliable class count.
How can I resolve this? If you propose a fix, please make it concise.
The improvement percentage shown in the summary string is: score_delta = self.selection_score(best.performance) - self.selection_score(self.baseline_performance)For However, the displayed raw value is always # For lower-direction metrics, score_delta is positive when performance improves
score_delta = self.selection_score(best.performance) - self.selection_score(self.baseline_performance)Prompt To Fix With AIThis is a comment left during a code review.
Path: plexe/search/journal.py
Line: 481-483
Comment:
**`improvement` percentage sign is wrong for `lower`-direction improvements**
The improvement percentage shown in the summary string is:
```python
score_delta = self.selection_score(best.performance) - self.selection_score(self.baseline_performance)
```
For `lower` direction: `score_delta = -best - (-baseline) = baseline - best`.
When `best < baseline` (genuine improvement), `score_delta > 0`, so the displayed `+X%` is correct.
However, the displayed raw value is always `best.performance` (the raw metric), while the sign of the percentage is computed via the flipped score. This is correct behaviour, but can be confusing at first read. Consider adding a comment so the intent is clear:
```python
# For lower-direction metrics, score_delta is positive when performance improves
score_delta = self.selection_score(best.performance) - self.selection_score(self.baseline_performance)
```
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! |
|
@greptileai please review again with latest changes |
Additional Comments (1)
Prompt To Fix With AIThis is a comment left during a code review.
Path: plexe/workflow.py
Line: 984-985
Comment:
**Redundant `optimization_direction` assignment on restored journals**
`optimization_direction` is already set correctly by `SearchJournal.from_dict` (via `d.get("optimization_direction", "higher")`). Setting it again here, and then a third time inside `search_models` (line 1002), adds noise without benefit. If the intent is "always trust the live metric over the checkpoint value," a single authoritative assignment in `search_models` (where the journal is actually handed to the policy) is sufficient, and this extra write can be removed.
How can I resolve this? If you propose a fix, please make it concise. |
…rics-support # Conflicts: # pyproject.toml
|
@greptileai please review again with latest changes |
|
@greptileai please review again with latest changes |
|
@greptileai please review again with latest changes |
|
@greptileai please review again with latest changes |
|
@greptileai please review again with latest changes |
|
@greptileai please review again with latest changes |
Additional Comments (2)
When the agent calls if test_uri:
test_count = spark.read.parquet(test_uri).count()
total = train_count + val_count + test_countThat inflates which is 15 percentage points below the expected 0.30, triggering a spurious ">10% off" warning (and, once the The simplest fix is to only accumulate and count the test split when it is actually expected: if test_uri and expects_test_split:
test_count = spark.read.parquet(test_uri).count()
total = train_count + val_count + test_count
else:
test_count = 0
total = train_count + val_countThis aligns Prompt To Fix With AIThis is a comment left during a code review.
Path: plexe/validation/validators.py
Line: 387-395
Comment:
**Test-row count inflates `total` when `test_uri` is provided but not expected**
When the agent calls `save_split_uris` with a `test_uri` even though the expected ratios don't include a test split (e.g. `expected_ratios = {"train": 0.7, "val": 0.3}`), the block below reads and counts the test Parquet:
```python
if test_uri:
test_count = spark.read.parquet(test_uri).count()
total = train_count + val_count + test_count
```
That inflates `total`, so the ratio check computes:
```
actual_ratios["val"] = val_count / (train + val + test) # e.g. 0.15 instead of 0.30
```
which is 15 percentage points below the expected 0.30, triggering a spurious ">10% off" warning (and, once the `TODO` escalation lands, a hard rejection). The `test_count` rows also push `actual_ratios["test"]` above 0, but `"test"` is skipped in the loop because it's absent from `normalized_expected`.
The simplest fix is to only accumulate and count the test split when it is actually expected:
```python
if test_uri and expects_test_split:
test_count = spark.read.parquet(test_uri).count()
total = train_count + val_count + test_count
else:
test_count = 0
total = train_count + val_count
```
This aligns `total` with the set of splits that `expected_ratios` describes.
How can I resolve this? If you propose a fix, please make it concise.
raise ValueError(f"Metric '{metric_name}' requires per-class probabilities ...")
raise ValueError(f"Expected probability outputs to be 1D or 2D, got shape {probabilities.shape}")Using the normalised Prompt To Fix With AIThis is a comment left during a code review.
Path: plexe/helpers.py
Line: 320-334
Comment:
**`metric` variable computed but used only conditionally — and `metric_name` is used in the 1-D branch**
`metric` (the lowercase-stripped form) is assigned early but is only referenced in the multiclass column-count guard much later. The 1-D error message and the "not 2D" error message use the raw `metric_name` argument instead:
```python
raise ValueError(f"Metric '{metric_name}' requires per-class probabilities ...")
raise ValueError(f"Expected probability outputs to be 1D or 2D, got shape {probabilities.shape}")
```
Using the normalised `metric` in the first message would be consistent with the rest of the function (and would ensure casing quirks in caller-supplied names don't leak into error text). Minor but worth aligning:
```suggestion
if probabilities.ndim == 1:
if n_classes > 2:
raise ValueError(f"Metric '{metric}' requires per-class probabilities for multiclass tasks.")
return probabilities
```
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! |
|
@greptileai please review again with latest changes |
|
@greptileai we have addressed all issues raised on this PR, all current checks are passing, and this should now be ready to merge. |
Additional Comments (2)
roc_auc_score(y_true, y_pred, multi_class="ovo")sklearn rejects a 1-D
elif metric == StandardMetric.ROC_AUC_OVO.value:
n_classes = len(np.unique(y_true))
if n_classes == 2:
# OVO is undefined for binary; fall back to standard binary AUC
return float(roc_auc_score(y_true, y_pred))
return float(roc_auc_score(y_true, y_pred, multi_class="ovo"))
Two TODO comments were left in the production If these aren't being fixed in this PR, consider tracking them in an issue rather than leaving them as inline TODOs, since the prompts are live agent instructions that can affect correctness of the split phase. |
This PR adds probability-aware metric support across search, baseline validation, and final evaluation while keeping the existing workflow structure intact. The aim is to enable probability-based primary metrics (roc_auc, roc_auc_ovr, roc_auc_ovo, log_loss), make search/fallback selection direction-correct for both higher-is-better and lower-is-better metrics, and enforce clear predictor semantics for probability outputs.
It also adds focused unit coverage for helper routing, search direction behavior, predictor probability semantics, baseline validation, and evaluator prompt guidance.
Testing