**The following code only generates the patches and stores them in a file (preds.jsonl)**

Follow the steps below to evaluate the generated patches using SWE-Bench:

1. *Clone the SWE Bench repository*
2. *Install all the required packages in edit mode*
3. *Run Docker*
4. *Ensure this code file is placed *outside* the cloned SWE Bench repository*
5. *Open the terminal in the cloned SWE Bench repository*
6. *Run the built-in SWE-Bench evaluator using the following command:*

    ```bash
    python -m swebench.harness.run_evaluation \
        dataset_name princeton-nlp/SWE-bench_Lite \
        predictions_path ../my_preds.jsonl \
        max_workers 8 \
        run_id testing
    ```


In [1]:
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.schema import HumanMessage
from typing import Dict
import json

import subprocess
import os

  from .autonotebook import tqdm as notebook_tqdm


Creating an LLM object using the model Gemini 2.0 Flash and setting temperature 1e-3 (seemed to work best after trying multiple values).

In [None]:
llm = ChatGoogleGenerativeAI(
    model="gemini-2.0-flash",
    google_api_key="",
    temperature=1e-3
)

Function to refine a generated patch, the system prompt ensures that the prompt is in proper unified-diff format and does not have any unnecessary newlines or whitespaces. This is done to prevent "malformed patch" errors during evaluation. The function takes the patch and the necessary code as input and outputs a formatted patch.

In [3]:
def refine_patch_format(patch,code):
    """
    Ensures a patch is in valid unified-diff format. If malformed, uses an LLM to correct it.
    """
    if not patch.startswith('--- a/') or '@@' not in patch:
        prompt = f"""You are a software engineer fixing an automated code patch.

The following patch is malformed or incomplete: {patch}
Here is the original code which needs to be patched {code}

Please return a corrected patch using valid unified-diff format:
- Make sure the whitespaces match the original file
- Make sure the context lines of the patch point to the same lines in the code
- It must start with '--- a/...' and '+++ b/...'
- It must contain at least one valid hunk beginning with '@@'
- Do not include explanations, comments, markdown, or any text outside the patch.
- Your response must be a pure, corrected unified diff.
- It should be in raw unified-diff format, without any markdown wrapping

Be as minimalistic as possible, but do not let it affect correctness. Just Check cross verify with the input and try to rmeove redudant lines of code.
Return a valid patch only."""
        feedback = llm([HumanMessage(content=prompt)]).content
        return feedback
    return patch


Function to verify that the line numbers being changed by the patch are accurate. The system prompt instructs the LLM to check the given patch against the original code and ensure that the code being changed by the patch is actually present in the file at the specified line numbers. This helps to ensure that the patch is properly applied during evaluation.

In [4]:
def check_line_numbers(instance: Dict,llm,patch,code):
    prompt=f"""
You are working with a group of software engineers. They have generated a patch {patch} for the problem statement {instance['problem_statement']}.

Your job is to make sure their code has the correct line numbers and the patch actually runs. Do not try to change the code or its logic, just make it so that patch runs.

All the relevant info for the problem statement is here {instance}.
Here is the code {code}. 
Do the above changes for the patch {patch}
    """
    return llm([HumanMessage(content=prompt)]).content


This function outputs a fully formed patch for the instance given as input. The system prompt gives the LLM access to the problem statement, original code, and the necessary test cases and instructs it to generate a patch satisfying all the specified conditions. 
The patch initially generated is recursively passed through the reflection loop 3 times. The reflection prompt gives the LLM the same information as earlier and focuses on rechecking the patch for any logical or formatting errors. The output of each call is passed back to itself thrice.
The output of the reflection loop is then recursively passed to the check_line_numbers function three times. 
Finally, the output of the above loop is passed to the refine_patch function as detailed above.
The output of the refine_patch function is the final patch generated by the model and it is returned.

In [5]:
def generate_patch(instance: Dict, code, llm, max_reflections: int = 3):
    #Initial system prompt
    """
    Generate a patch with iterative self-reflection using an LLM.
    """

    problem = instance['problem_statement']
    failtopass = instance['FAIL_TO_PASS']
    passtopass = instance['PASS_TO_PASS']

    initial_prompt = f"""You are an expert software engineer.

Below is a problem description from a GitHub issue:
{problem}

Here is the code you have to edit {code}
These are the error which you should check for{failtopass}
These SHOULD NOT fail {passtopass}
Check all of the above one by one.

Generate a fix for this issue in valid unified-diff format.

Make sure to fix each and every error. Go Through all the lines pertaining to the problem and check. Try simulating a test case yourself to see if the code you have written gives ocrrect output.
Make sure to check the file path and make sure it is correct to avoid context mismatch errors 

Your output must:
- Be in pure raw unified diff format. There should be no wrappers like the ```
- Begin with '--- a/...' and '+++ b/...'
- Contain at least one '@@' hunk
- NOT include explanations, comments, or markdown
- Be syntactically correct
- You may add code if its absolutely necessary (like missing write function for read function)
- Make sure the code which you are changing (the - part in the code) is actually there
- Check if the indices of the line of code and the actual code in the repository actually match

If the above does not work, try again.
Here is the code you have to edit {code}


Output only the patch."""
    current_patch = llm([HumanMessage(content=initial_prompt)]).content

    #Reflection loop
    for i in range(max_reflections):
        reflection_prompt = f"""You are reviewing the following patch:
{current_patch}
The patch was intended to fix this issue:{problem}
These are the error which you should check for{failtopass}
These SHOULD NOT fail {passtopass}
Check all of the above one by one.
Here is the code you have to edit {code}

Make sure to fix each and every error. Go Through all the lines pertaining to the problem and check. Try simulating a test case yourself to see if the code you have written gives ocrrect output.

Check:
- Be in pure raw unified diff format. There should be no wrappers like the ```
- Does it correctly address the issue?
- Is it syntactically valid unified-diff (starts with --- a/, contains @@)?
- Is every edit/newline leading with a + or -?
- Can it be applied cleanly (no formatting or logical errors)?
- Make sure the code which you are changing (the - part in the code) is actually there
- Check if the indices of the line of code and the actual code in the repository actually 
- You may add code if its absolutely necessary (like missing write function for read function)

Make sure that when you generate a patch, the changed code module names, variable names, everything matches how it was before so that there is no error. Check for line numbers and everythign very thoroughly.
Here is the code you have to edit {code}

These are the error which you should check for{failtopass}
These SHOULD NOT fail {passtopass}
Check all of the above one by one.
Make sure to fix each and every error. Go Through all the lines pertaining to the problem and check. Try simulating a test case yourself to see if the code you have written gives ocrrect output.
Make sure to check the file path and make sure it is correct to avoid context mismatch errors 
If any issue is found, rewrite the patch and return only the corrected unified diff. Do not include any explanations or markdown."""
        current_patch = llm([HumanMessage(content=reflection_prompt)]).content

    #Context matching
    for i in range(max_reflections):
        current_patch = check_line_numbers(instance,llm,current_patch,code)

    #Patch refining
    current_patch = refine_patch_format(current_patch,code)

    #Final patch is returned
    return current_patch




The file_extractor function receives an instance as input. It uses the "repo" field of the instance to identify the repository to which the instance belongs. The exact file name is extracted from the "patch" field using string formatting. These two components are put together to form the full file path that the model needs to access.
The "base_commit" field specifies the commit version of the repository that the issue belonged to. The "git checkout" command is used to restore the required file to the required version so that the model may have access to the correct version.
Once checkout is complete, the file is read and its full contents are returned.

In [6]:
def file_extractor(instance):
    #Extracting repository name
    _, _, tail = instance['repo'].partition("/")
    repo_path = "./" + tail

    #Handling unusual repository name
    if(instance['repo'] == 'pallets/flask'):
        repo_path = "./my-flask-project"

    #Extracting commit version
    commit_ver  = instance["base_commit"]    

    #Extracting file name and forming file path
    s = instance['patch']
    substring_to_first_py = s[:s.find('.py') + 3]
    file_rel = substring_to_first_py.replace("diff --git a/","",1).strip()

    #Gut checkout command
    subprocess.run(
        ["git", "checkout", commit_ver],
        cwd=repo_path,
        check=True,
    )

    #Reading and returning file contents
    file_path = os.path.join(repo_path, file_rel)
    with open(file_path, "r") as f:
        contents = f.read() 

    return contents

The SWE-bench dataset is loaded using the huggingface module available in Python. Only the test split is accessed, the dev split is not needed. The "hints" and "test_patch" columns are dropped from the dataset because these details are not needed by the model.
All the columns of the dataset are printed for reference.

In [7]:
from datasets import load_dataset

dataset = load_dataset("princeton-nlp/SWE-bench_Lite", split="test")
dataset = dataset.remove_columns("test_patch")
dataset = dataset.remove_columns("hints_text")

print(dataset.column_names)

['repo', 'instance_id', 'base_commit', 'patch', 'problem_statement', 'created_at', 'version', 'FAIL_TO_PASS', 'PASS_TO_PASS', 'environment_setup_commit']


Each of 300 instances is extracted and passed to generate_patch function

In [8]:
predictions = []
count = 0
for instance in dataset :
    try :
        code = file_extractor(instance)
    except :
        print(instance)
        continue

    patch = generate_patch(instance,code,llm)
    predictions.append({
        "instance_id": instance["instance_id"],
        "model_name_or_path": "repopotamus",
        "model_patch": patch
    })

  current_patch = llm([HumanMessage(content=initial_prompt)]).content


KeyboardInterrupt: 

Writing all the patches along with instance ids to the prediction file to be evaluated

In [None]:
with open("preds.jsonl", "w") as f:
    for p in predictions:
        f.write(json.dumps(p) + "\n")