Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Structured data extraction/known results for test cases #15

Open
philpax opened this issue Jul 14, 2023 · 1 comment
Open

Structured data extraction/known results for test cases #15

philpax opened this issue Jul 14, 2023 · 1 comment

Comments

@philpax
Copy link

philpax commented Jul 14, 2023

Hi there!

First off, thanks for this - it's great and as-is it's given me some ideas for prompt design 🙏

I'm working on trying to extract dates from arbitrary text and to produce JSON, so that given an optimised system prompt, I can pass GPT-3.5 an arbitrary string and it'll produce an array that corresponds to this TypeScript schema:

type Result =
    {"Millennium": {"year": number, "metadata"?: string}} |
    {"Century": {"year": number, "metadata"?: string}} |
    {"Decade": {"year": number, "metadata"?: string}} |
    {"Year": {"year": number, "metadata"?: string}} |
    {"Month": {"year": number, "month": number, "metadata"?: string}} |
    {"Day": {"year": number, "month": number, "day": number, "metadata"?: string}} |
    {"Range": {"start": Result, "end": Result, "metadata"?: string}} |
    {"Ambiguous": Result[]} |
    {"Present": {"metadata"?: string}}

My existing prompt is a many-shot prompt where I specify how a given date string should be parsed into JSON. This prompt works pretty well, but the prompt itself is over a thousand tokens, making evaluation costly.

If you're curious, here's a subset of the examples:

2024: `[{"Year":{"year":2024}}]`
c. 2016: `[{"Year":{"year":2016}}]`
1930-1937 1942-1945: `[{"Range":{"start":{"Year":{"year":1930}},"end":{"Year":{"year":1937}}}},{"Range":{"start":{"Year":{"year":1942}},"end":{"Year":{"year":1945}}}}]`
7–12 June 1967: `[{"Range":{"start":{"Day":{"year":1967,"month":6,"day":7}},"end":{"Day":{"year":1967,"month":6,"day":12}}}}]`
16, 20-27 March 1924: `[{"Day":{"year":1924,"month":3,"day":16}},{"Range":{"start":{"Day":{"year":1924,"month":3,"day":20}},"end":{"Day":{"year":1924,"month":3,"day":27}}}}]`
12 June 1723 - 26 September 1726: `[{"Range":{"start":{"Day":{"year":1723,"month":6,"day":12}},"end":{"Day":{"year":1726,"month":9,"day":26}}}}]`
14th century: `[{"Century":{"year":1300}}]`

I was hoping to use GPE to find a more optimised prompt by using my existing many-shot examples as test cases, and presenting what they should evaluate to, and then letting GPE/its GPT instances find a prompt that satisfies those test cases and evaluates to the same value without actually specifying each example.

Unfortunately, at the time of writing, GPE only comes in two flavours - the test cases with GPT evaluation and classification with multiple-choice answers.

Using the former, I was able to find a slightly more optimal prompt prelude, but the many-shot cases are still required. The classification flavour appears to be relatively coupled to multiple-choice evaluation, so it wouldn't work for me.

For my use case, I'd like a flavour in between: test cases with known solutions, where each prompt is graded in its ability to match the solution. I was considering hacking up the classification flavour, but I wasn't sure how best to adapt the prompt to handle this.

Is this something that you think would be feasible? I figure that this might come up in other contexts, too - being able to pass a set of input/output pairs to GPE and have it optimise for the best prompt would be wonderful!


More concretely: I'd like to pass in

test_cases = [
{ 'prompt': "The Bank at Burbank", 'output': '[]' },
{ 'prompt': "Red Bull Studios, AWOLSTUDIO, Avatar Studios, Main and Market, Gymnasium, Fireside Sound Studio", 'output': '[]' },
{ 'prompt': "1980s", 'output': '[{"Decade":{"year":1980}}]' },
{ 'prompt': "3000 BC", 'output': '[{"Year":{"year":-3000}}]' },
# ...
]

and have GPE optimise a prompt that produces the given output for a prompt.

@sudoaza
Copy link

sudoaza commented Jul 20, 2023

I was hoping for something similar, in my case only part of the completion needs to be a specific value but the rest can vary. I was thinking of having an evaluation function, then the test cases would have two fields, input and expected output and one would have to define the evaluation function that takes the input and expected output and returns true / false as a success value, or even a score if that's what you want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants