test: tweak smoke test tool bodies to standardize response text #876

njhale · 2024-10-14T21:59:56Z

Smoke tests flake for gpt-4o b/c of non-determinism in how it interpreted the test case instructions (e.g. failed to interpolate a string variable consistently). This change reduces ambiguity in the tool instructions so that it produces consistent results across smoke test runs.

Note: I regenerated golden files across all models and ran the tests 10 times per model to vet this change.

thedadams · 2024-10-14T22:43:47Z

pkg/tests/smoke/testdata/Bob/test.gpt

 args: question: The question to ask Bob.

-When asked how I am doing, respond with exactly "Thanks for asking "${QUESTION}", I'm doing great fellow friendly AI tool!"
+When asked how I am doing, respond with the following exactly: "Thanks for asking '${question}'! I'm doing great fellow friendly AI tool!" with ${question} replaced with the question text as given.


I thought we changed this such that question should be QUESTION?

yup, we did.

Fixed and pushed.

curious if this makes any functional difference since this isnt a code tool - ${QUESTION} is being made to look like an environment variable here, but there isnt anything actually setting or reading env vars... its just the llm being a smarty pants.

Yeah, this is more for convention sake than anything.

The extra text explaining how to "interpolate" the variable is because 4o isn't actually that much of a smarty pants after all

Tweak the tool bodies for smoke test GPTScripts to reduce ambiguity in the response. This prevents models -- like gpt-4o -- from doing things like failing to interpolate strings consistently between runs. Signed-off-by: Nick Hale <4175918+njhale@users.noreply.github.com>

Signed-off-by: Nick Hale <4175918+njhale@users.noreply.github.com>

njhale requested a review from thedadams October 14, 2024 21:59

thedadams approved these changes Oct 14, 2024

View reviewed changes

njhale requested review from StrongMonkey, drpebcak, g-linville, iwilltry42, ryanhopperlowe, thedadams and tylerslaton October 14, 2024 22:34

njhale changed the title ~~test/smoke tweak tc bodies~~ test: tweak smoke test tool bodies to standardize response text Oct 14, 2024

thedadams previously approved these changes Oct 14, 2024

View reviewed changes

njhale requested review from thedadams and removed request for iwilltry42 and ryanhopperlowe October 14, 2024 22:41

thedadams reviewed Oct 14, 2024

View reviewed changes

njhale added 2 commits October 14, 2024 19:06

test: regenerate golden files for smoke tests

1b6f172

Signed-off-by: Nick Hale <4175918+njhale@users.noreply.github.com>

njhale dismissed thedadams’s stale review via 1b6f172 October 14, 2024 23:09

njhale force-pushed the test/smoke-tweak-tc-bodies branch from 1c95f69 to 1b6f172 Compare October 14, 2024 23:09

njhale requested a review from thedadams October 14, 2024 23:18

thedadams approved these changes Oct 14, 2024

View reviewed changes

drpebcak approved these changes Oct 14, 2024

View reviewed changes

njhale merged commit b7d31f2 into gptscript-ai:main Oct 15, 2024
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test: tweak smoke test tool bodies to standardize response text #876

test: tweak smoke test tool bodies to standardize response text #876

Uh oh!

njhale commented Oct 14, 2024 •

edited

Loading

Uh oh!

thedadams Oct 14, 2024

Uh oh!

njhale Oct 14, 2024

Uh oh!

drpebcak Oct 15, 2024 •

edited

Loading

Uh oh!

njhale Oct 15, 2024 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

test: tweak smoke test tool bodies to standardize response text #876

test: tweak smoke test tool bodies to standardize response text #876

Uh oh!

Conversation

njhale commented Oct 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thedadams Oct 14, 2024

Choose a reason for hiding this comment

Uh oh!

njhale Oct 14, 2024

Choose a reason for hiding this comment

Uh oh!

drpebcak Oct 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

njhale Oct 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

njhale commented Oct 14, 2024 •

edited

Loading

drpebcak Oct 15, 2024 •

edited

Loading

njhale Oct 15, 2024 •

edited

Loading