-
-
Notifications
You must be signed in to change notification settings - Fork 262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: promptfoo eval --filter-failing outputFile.json #742
Conversation
921864d
to
3bd36a3
Compare
const {results} = await readOutput(outputPath); | ||
|
||
if (results.version < 2) { | ||
throw new Error(`Unsupported output version: ${results.version}`); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put this here just cause I'm not familiar with the differences between versions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be safe to remove this check 👍
src/main.ts
Outdated
firstN: cmdObj.firstN, | ||
pattern: cmdObj.pattern, | ||
failing: cmdObj.failing, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if these command line params should be changed to be something like:
filterFirstN
filterPattern
filterFailing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I support this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok... I'll make the change in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - responses to your comments inline. Let me know if you'd like to rename the params to filter...
in this PR or a separate one.
src/main.ts
Outdated
firstN: cmdObj.firstN, | ||
pattern: cmdObj.pattern, | ||
failing: cmdObj.failing, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I support this
const {results} = await readOutput(outputPath); | ||
|
||
if (results.version < 2) { | ||
throw new Error(`Unsupported output version: ${results.version}`); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be safe to remove this check 👍
@@ -1228,3 +1244,26 @@ export function getStandaloneEvals(): StandaloneEval[] { | |||
}); | |||
return flatResults; | |||
} | |||
|
|||
export function providerToIdentifier(provider: TestCase['provider']): string | undefined { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really appreciate you adding these helper functions. I'm aware that ApiProvider vs ProviderOptions and similar variations are some of the ugliest parts of the code :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No prob. I love the flexibility of promptfoo
. You can use it many ways. But it also does brings on complexity. I suspect little utility functions to simplify logic can go a long way.
@@ -552,8 +552,9 @@ async function main() { | |||
'Run providers interactively, one at a time', | |||
defaultConfig?.evaluateOptions?.interactiveProviders, | |||
) | |||
.option('-n, --first-n <number>', 'Only run the first N tests') | |||
.option('--pattern <pattern>', 'Only run tests whose description matches the regular expression pattern') | |||
.option('-n, --filter-first-n <number>', 'Only run the first N tests') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be a breaking change so will need to do a version bump for the next release. Unsure if theres a mechanism you use eg Github labels to determine if a version bump is needed for the next release
Should we keep -n
? Should the other two filters have short forms also?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My selfish preference is to keep -n
because it feels familiar, like head -n
:). Open to other short forms but I don't think it's necessary. If we find ourselves getting tired of typing everything out let's add it separately.
I do version bumps manually - since we're pre-1.0 I've included breaking changes in minor versions, but generally trying to avoid them. I think this is acceptable though. I'll merge with feat!
breaking change notation and make a note in the release notes.
Makes it so that you can iterate more quickly by simply running failing tests given an output file.
Example Flow
First run:
![Screenshot 2024-04-30 at 5 17 28 PM](https://private-user-images.githubusercontent.com/496903/326984674-a58f706d-42bc-4e6e-adfc-916319856b61.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjE4NjU0NzAsIm5iZiI6MTcyMTg2NTE3MCwicGF0aCI6Ii80OTY5MDMvMzI2OTg0Njc0LWE1OGY3MDZkLTQyYmMtNGU2ZS1hZGZjLTkxNjMxOTg1NmI2MS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNzI0JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDcyNFQyMzUyNTBaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1iZjA2MGZiZmFhNmNiZTU0YTRmNWUzMjdmODIxMWNmYTNlOTU5MzUyNTRjNDRlNDZiMjRmNjFmMTMxM2NiMjJiJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.REPZhLFesxtupssA9P807nHd-qJSHG4o47puUfGJYMs)
promptfoo eval --output result.json
Next run:
![Screenshot 2024-04-30 at 5 18 41 PM](https://private-user-images.githubusercontent.com/496903/326984784-a0a88338-8ae2-4240-beda-d1417de16372.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjE4NjU0NzAsIm5iZiI6MTcyMTg2NTE3MCwicGF0aCI6Ii80OTY5MDMvMzI2OTg0Nzg0LWEwYTg4MzM4LThhZTItNDI0MC1iZWRhLWQxNDE3ZGUxNjM3Mi5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNzI0JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDcyNFQyMzUyNTBaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1mMGNmNjgyMzk2YzUxZDM5ZmExMDNjYjAxNGU1ZThhNGRjNzQ0MmU2NWY1NDFhNjA4MmEwMjM2NzFkZDY2OGYyJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.2bxWpM_iU521KOU5eso7rOSolFv45ZrdZLvitMUTCW4)
promptfoo eval --failing result.json