Adding ground truth generation #145

bzorn · 2025-06-05T20:22:40Z

No description provided.

* 🎛️ feat: switch evalModel to evalModelSet for test evaluation Replaces evalModel with evalModelSet, allowing multiple evaluation models. * ✨ feat: add multi-model evaluation to metrics and compliance Support evaluating metrics and compliance with multiple models via evalModelSet. * ✨ Refine evaluation model handling and debug logging Improved evalModelSet defaults, header levels, and debugging output. * ✨ Enhance evalModelSet sourcing and logging in promptpex Now supports sourcing evalModelSet from env var, adds validation, and logging. * ✨ refactor test metric evaluation and overview model handling Refined evalModelSet parsing and updated test metric iteration logic. * ✨ feat: Add combined avg metric across eval models Compute and store average metric score for all evaluation models used. * ✨ Enhance promptpex test evaluation and script logic Added separate eval-only/test-run modes, improved metric evaluations * ♻️ Rename evalModelSet to evalModel throughout codebase Standardizes config and variable naming from evalModelSet to evalModel.

…#141) * ✨ Enhance test results saving and eval metrics workflow Improved control of results file writing and evaluation metrics assignment. * ✨ Add evals config flag to control evaluation execution Introduces evals boolean for toggling evaluation of test results. * ✨: Enable direct context-loading from JSON files Refactored CLI to load PromptPexContext from JSON, updating file flow.

* ✨ Add scripts and logic for multi-stage sample evaluations Introduces zx scripts for gen/run/eval sample tests and conditional test executions. * 🔀 rename: Samples scripts renamed to .zx.mjs extensions All run-samples-*.mjs scripts updated to .zx.mjs for zx compatibility. * ♻️ refactor: Rename sample scripts to .zx.mjs extensions Updated script names in package.json and renamed a sample file for zx compatibility

Copilot

Pull Request Overview

This PR enhances ground truth generation by revising evaluation parameters and updating how evaluation models are handled throughout the code.

Introduces a new evaluation flag (evals) and updates the evalModel type in types.mts.
Adjusts test and metric evaluation functions to operate on evalModel as an array, and modifies related scripts and configurations.

Reviewed Changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
src/genaisrc/src/types.mts	Added new evals flag and updated the evalModel type signature.
src/genaisrc/src/testrun.mts	Modified test run logic to extract the first element from evalModel for evaluation.
src/genaisrc/src/testresulteval.mts	Updated evaluateTestResult signature to accept an explicit evalModel parameter.
src/genaisrc/src/testquality.mts	Changed how evalModel is accessed (using index 0) for quality evaluation.
src/genaisrc/src/testevalmetric.mts	Iterates over options.evalModel for computing metrics.
src/genaisrc/src/rulesgroundeness.mts	Uses the first element of evalModel for rule grounding.
src/genaisrc/promptpex.genai.mts	Adjusted option parsing and ensured evalModel is processed as an array.
Other files (samples, package.json, GitHub workflows)	Updated scripts and configuration settings to be consistent with evaluation changes.

Comments suppressed due to low confidence (1)

src/genaisrc/src/types.mts:132

The evalModel property now accepts an array or the literal 'eval'. Ensure that its usage is consistent throughout the codebase, as some modules assume it is always an array.

evalModel?: ModelType[] | "eval"

Copilot · 2025-06-05T20:23:47Z

src/genaisrc/src/testrun.mts

+        const eModel = options.evalModel[0]
        testRes.compliance = undefined
-        const compliance = await evaluateTestResult(files, testRes, options)
+        const compliance = await evaluateTestResult(files, eModel, testRes, options)
        testRes.complianceText = compliance.content
        updateTestResultCompliant(testRes)


Indexing evalModel without verifying its type assumes it is an array. Add a type check or normalization to prevent unexpected behavior when evalModel is a string.

Copilot · 2025-06-05T20:23:47Z

src/genaisrc/src/testquality.mts

-    const { force, evalModel = MODEL_ALIAS_EVAL } = options || {}
+
+    const { force } = options || {}
+    const evalModel = options.evalModel[0]


Accessing the first element of evalModel assumes it is always an array. Ensure that evalModel is normalized to an array type or add a safeguard before indexing.

Copilot · 2025-06-05T20:23:47Z

src/genaisrc/src/testevalmetric.mts

+    // Remove all previous metrics before computing new ones
+    testResult.metrics = {}
+
+    for (const eModel of options.evalModel) {


Iterating over evalModel assumes that it is an array. If evalModel might be a string, ensure it is converted to an array to avoid iterating over its characters.

Suggested change

for (const eModel of options.evalModel) {

let evalModelArr: string[] = []

if (typeof options.evalModel === "string") {

evalModelArr = options.evalModel.split(";").map(s => s.trim()).filter(Boolean)

} else if (Array.isArray(options.evalModel)) {

evalModelArr = options.evalModel

}

for (const eModel of evalModelArr) {

Copilot · 2025-06-05T20:23:48Z

src/genaisrc/src/rulesgroundeness.mts

            {
-                ...modelOptions(evalModel, options),
+                // assume first evalModel is the one to use for rule grounding
+                ...modelOptions(evalModel[0], options),


Using evalModel[0] without ensuring evalModel is an array can lead to errors if a string is passed. Ensure consistent type normalization for evalModel.

Suggested change

...modelOptions(evalModel[0], options),

...modelOptions(normalizedEvalModel[0], options),

bzorn · 2025-06-05T20:56:34Z

Meant to have the PR to dev

pelikhan and others added 5 commits June 2, 2025 23:45

model names

e543499

removed pull

e5fad83

bzorn requested review from Copilot and pelikhan June 5, 2025 20:22

Copilot AI reviewed Jun 5, 2025

View reviewed changes

bzorn closed this Jun 5, 2025

bzorn deleted the ground-truth branch June 5, 2025 21:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding ground truth generation #145

Adding ground truth generation #145

Uh oh!

bzorn commented Jun 5, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jun 5, 2025

Uh oh!

Copilot AI Jun 5, 2025

Uh oh!

Copilot AI Jun 5, 2025

Uh oh!

Copilot AI Jun 5, 2025

Uh oh!

bzorn commented Jun 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-    for (const eModel of options.evalModel) {
+    let evalModelArr: string[] = []
+    if (typeof options.evalModel === "string") {
+        evalModelArr = options.evalModel.split(";").map(s => s.trim()).filter(Boolean)
+    } else if (Array.isArray(options.evalModel)) {
+        evalModelArr = options.evalModel
+    }
+    for (const eModel of evalModelArr) {

	...modelOptions(evalModel[0], options),
	...modelOptions(normalizedEvalModel[0], options),

Adding ground truth generation #145

Adding ground truth generation #145

Uh oh!

Conversation

bzorn commented Jun 5, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

bzorn commented Jun 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants