Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpt-4-turbo refactoring benchmark questions #558

Closed
few opened this issue Apr 14, 2024 · 2 comments
Closed

gpt-4-turbo refactoring benchmark questions #558

few opened this issue Apr 14, 2024 · 2 comments
Labels
question Further information is requested

Comments

@few
Copy link

few commented Apr 14, 2024

After reading https://aider.chat/2024/04/09/gpt-4-turbo.html I was wondering if there is a way to improve the new model's score in the
refactoring benchmark. I executed a small number of the tests and they all failed. But not in a way I had expected (i.e. "# rest of the code remains unchanged"). They all failed in the same way. I created this minimal testcase to reproduce the failure:

Given a class that looks like this:

class MyClass:
    def f1(self):
        self.x = self.f2()
        self.y = self.f3()
    
    def f2(self):
        return 1

    def f3(self):
        return 2

gpt-4-turbo would generate a valid diff that results in this code:

class MyClass:
    def f1(self):
        self.x = f2()
        self.y = self.f3()
    
def f2(self):
    return 1

    def f3(self):
        return 2

It simply out indents f2, ignoring the fact that now f3 no longer belongs to MyClass.
If the model is given the code and the task without aider's prompt, it generate the correct code from which I concluded that the model understands the task, but is unable to express it in the udiff format.

I then gave the model a diff generate by another model, that produces the correct output, and asked it to revise the prompt. It did (basically it simplified aider's prompt) and with the new prompt gpt-4-turbo was able to correctly solve the minimal test case. But it still failed on the larger cases. The simplified prompt in coders/udiff_prompts.py was:

system_reminder = """# File editing rules:
    
Return edits similar to unified diffs that `diff -U0` would produce, including:
- The first 2 lines with the file paths without timestamps.
- Start each hunk of changes with `@@ ... @@`.
- Mark all lines that need to be removed or changed with`-`and all new or modified lines with`+`.
- Indentation matters in the diffs.
- Start a new hunk for each section of the file that needs changes.
- Output hunks in whatever order makes the most sense.
- When editing a function, method, loop, etc., use a hunk to replace the entire code block.
- To move code within a file, use 2 hunks: one to delete it from its current location, one to insert it in the new location.
- To make a new file, show a diff from `--- /dev/null` to `+++ path/to/new/file.ext`.
"""

My questions:

  1. Did you publish the raw benchmark results somewhere? If no, would you consider it? I'd like to look at more of the failures and check what the causes are.
  2. Assuming a large part of the failures are caused by the problem described above, do you think it's worthwhile to explore different prompts to "fix" these failures? Maybe you already have some idea what prompting technique to try to make the model better understand the udiff format.
@paul-gauthier
Copy link
Owner

Thanks for trying aider and filing this issue.

I haven't published all the benchmark transcripts. But you can replicate them yourself, see the README:

https://github.com/paul-gauthier/aider/blob/main/benchmark/README.md

Yes, there is probably some prompt tuning that can help the newest GPT-4 turbo. I'm looking into it.

@paul-gauthier paul-gauthier added the question Further information is requested label Apr 15, 2024
@paul-gauthier
Copy link
Owner

I'm going to close this for now, given that GPT-4o seems better across the board. But feel free to add a comment here and I will re-open or file a new issue any time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants