You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After reading https://aider.chat/2024/04/09/gpt-4-turbo.html I was wondering if there is a way to improve the new model's score in the
refactoring benchmark. I executed a small number of the tests and they all failed. But not in a way I had expected (i.e. "# rest of the code remains unchanged"). They all failed in the same way. I created this minimal testcase to reproduce the failure:
It simply out indents f2, ignoring the fact that now f3 no longer belongs to MyClass.
If the model is given the code and the task without aider's prompt, it generate the correct code from which I concluded that the model understands the task, but is unable to express it in the udiff format.
I then gave the model a diff generate by another model, that produces the correct output, and asked it to revise the prompt. It did (basically it simplified aider's prompt) and with the new prompt gpt-4-turbo was able to correctly solve the minimal test case. But it still failed on the larger cases. The simplified prompt in coders/udiff_prompts.py was:
system_reminder = """# File editing rules:
Return edits similar to unified diffs that `diff -U0` would produce, including:
- The first 2 lines with the file paths without timestamps.
- Start each hunk of changes with `@@ ... @@`.
- Mark all lines that need to be removed or changed with`-`and all new or modified lines with`+`.
- Indentation matters in the diffs.
- Start a new hunk for each section of the file that needs changes.
- Output hunks in whatever order makes the most sense.
- When editing a function, method, loop, etc., use a hunk to replace the entire code block.
- To move code within a file, use 2 hunks: one to delete it from its current location, one to insert it in the new location.
- To make a new file, show a diff from `--- /dev/null` to `+++ path/to/new/file.ext`.
"""
My questions:
Did you publish the raw benchmark results somewhere? If no, would you consider it? I'd like to look at more of the failures and check what the causes are.
Assuming a large part of the failures are caused by the problem described above, do you think it's worthwhile to explore different prompts to "fix" these failures? Maybe you already have some idea what prompting technique to try to make the model better understand the udiff format.
The text was updated successfully, but these errors were encountered:
I'm going to close this for now, given that GPT-4o seems better across the board. But feel free to add a comment here and I will re-open or file a new issue any time.
After reading https://aider.chat/2024/04/09/gpt-4-turbo.html I was wondering if there is a way to improve the new model's score in the
refactoring benchmark. I executed a small number of the tests and they all failed. But not in a way I had expected (i.e. "# rest of the code remains unchanged"). They all failed in the same way. I created this minimal testcase to reproduce the failure:
Given a class that looks like this:
gpt-4-turbo would generate a valid diff that results in this code:
It simply out indents f2, ignoring the fact that now f3 no longer belongs to MyClass.
If the model is given the code and the task without aider's prompt, it generate the correct code from which I concluded that the model understands the task, but is unable to express it in the udiff format.
I then gave the model a diff generate by another model, that produces the correct output, and asked it to revise the prompt. It did (basically it simplified aider's prompt) and with the new prompt gpt-4-turbo was able to correctly solve the minimal test case. But it still failed on the larger cases. The simplified prompt in coders/udiff_prompts.py was:
My questions:
The text was updated successfully, but these errors were encountered: