## Refact evaluation on Python HumanEvalFix with pass@1 metric

Let's evaluate [Refact](https://huggingface.co/smallcloudai/Refact-1_6B-fim) model that used [CommitPackFt](https://huggingface.co/datasets/bigcode/commitpackft) dataset in its training!

Let's do the evaluation following way:
1. Generate the fixed version of the functions
2. Create checker scripts that would import the generated function from the fixed versions
3. Run the scripts and count, how many succeeded
4. Finally, look at the results! 

### Some technicalities

First, some technicalities. Just importing some stuff...

In [2]:
# some imports

import datasets
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import os
import numpy as np
from tqdm import tqdm
import subprocess
from subprocess import TimeoutExpired
from constants import HUMAN_EVAL_PACK_DATASET, HUMAN_EVAL_PACK_LANG, PYTHON, REFACT_MODEL
from utils import get_refact_prompt, parse_refact_response, get_checke_code, get_id
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Let's download a Python subset of [HumanEvalPack](https://huggingface.co/datasets/bigcode/humanevalpack) dataset that is used for HumanEvalFix... 

In [3]:
hep_python = datasets.load_dataset(HUMAN_EVAL_PACK_DATASET, HUMAN_EVAL_PACK_LANG[PYTHON])['test']
print(hep_python)

Dataset({
    features: ['task_id', 'prompt', 'declaration', 'canonical_solution', 'buggy_solution', 'bug_type', 'failure_symptoms', 'entry_point', 'import', 'test_setup', 'test', 'example_test', 'signature', 'docstring', 'instruction'],
    num_rows: 164
})


Now we need to get a model...

In [4]:
checkpoint = REFACT_MODEL
device = "cuda" if torch.cuda.is_available() else "cpu"
refact_tokenizer = AutoTokenizer.from_pretrained(checkpoint)
refact_model = AutoModelForCausalLM.from_pretrained(checkpoint, trust_remote_code=True).to(device)

### Generating

Alright, now that we have everything we need, let me describe the process of generation:

As was suggested, I use the **experimental** chat prompt provided by Refactor model team:

```
<empty_output>SYSTEM {system} 
<empty_output>USER {user} 
<empty_output>ASSISTANT
```

It has a system prompt and a user prompt.

As for system prompt, I haven't found the recommended one by Refactor model team. So I created my own:

>You are a programming assistant that generates bug fixes for Python functions. A user will provide you with a function and will provide some testcases in the following format: '\<function definition\>[whitespace]\<testcase definition\>[whitespace]\<instruction to fix bugs\>'. You need to generate code of the correct implementation of the function. 

It seems like the result will depend a lot on the system prompt. So, we'll try another one later to see, what will happen.

As for user prompt, I took the format from the paper:
```
<buggy function>

<testcases>

Fix bugs in <buggy function name>.
```

I generate an output from the model and get the first function from it. Sometimes the model generates several versions and is cut off by the token limit. This results in not working python files that contain a valid function. So, I take the first function from the output.

We put generated code in `gen{id}.py` files in `EVAL_DATA_DIR` that I define above. Later we will put checker scripts `check{id}.py` there as well. Just for the sake of convenience...

In [5]:
EVAL_DATA_DIR = "eval_data"
if not os.path.exists(EVAL_DATA_DIR):
    os.mkdir(EVAL_DATA_DIR)

Then, let's write a function that will generate the fixed code using a model.

In [6]:
def generate_fix(item: dict, model, tokenizer, output_dir: str):
    """
    A function that generates fixed versions of functions and saves it output_dir by name "gen{id}.py", where id is an int value from item['task_id']
    :param tokenizer: A tokenizer to tokenize a prompt with
    :param model: A model to generate with
    :param item: an item from HumanEvalPack dataset to generate a fix for
    :param output_dir: a directory, where the generated files are stored
    """
    fix_prompt = get_refact_prompt(item)

    device = "cuda" if torch.cuda.is_available() else "cpu"

    inputs = tokenizer.encode(fix_prompt, return_tensors="pt").to(device)
    outputs = model.generate(inputs, max_length=max(1024, 2 * inputs.shape[1]), temperature=0.2)

    code = parse_refact_response(tokenizer.decode(outputs[0]))
    filename = f"gen{get_id(item)}.py"
    with open(os.path.join(output_dir, filename), "w") as file:
        file.write(code)

No we go to Colab and run the following to get generations... (I kept `eval_data` directory with generations in repo, so that you don't have to do it)

In [None]:
for item in hep_python:
    generate_fix(item, refact_model, refact_tokenizer, EVAL_DATA_DIR)

Alright, we have our proposed fixes. Let's write checkers for them. Checkers are not complicated they are just:
```
from gen{id} import {function_name}

{testcases}
```

In [7]:
def create_checker(item: dict, output_dir: str):
    """
    A function that creates a checker script for an item given a directory where to put the checker script. The file is saved by name "check{id}.py", where id is an int value from item['task_id']
    :param item: An item from HumanEvalPack dataset to generate a script for
    :param output_dir: A directory to save the script to. Should contain gen{id}.py for a checker to work properly
    """
    code = get_checke_code(item)

    filename = f"check{get_id(item)}.py"
    with open(os.path.join(output_dir, filename), "w") as file:
        file.write(code)

In [8]:
for item in hep_python:
    create_checker(item, EVAL_DATA_DIR)

Finally, we can look at the results. Let's run all checkers and store the result of their completion. 

A result is successful if a checker finished with error code 0 in two seconds. A two-second threshold is introduced to cut off infinite loops. 

In [11]:
def get_results(eval_dir):
    """
    Returns a numpy array of shape 164 (the number of python items in HumanEvalFix task). An array consists of 0s and 1s. A value at index i is 1 iff the checker with id i has successfully passed.
    :param eval_dir: A directory with checkers. Should contain all 164 checkers with filenames in a format "check{id}.py"
    :return: An array with results
    """
    result = np.zeros(len(hep_python)).astype(np.bool_)
    for id in tqdm(range(len(hep_python))):
        checker_filename = f"check{id}.py"
        checker_path = os.path.join(eval_dir, checker_filename)

        try:
            cur_result = subprocess.run(['python', checker_path], capture_output=True, text=True, timeout=2)

            result[id] = cur_result.returncode == 0
        except TimeoutExpired:
            print(f"Checker#{id} timed out")

    return result

In [12]:
os.putenv("TOKENIZERS_PARALLELISM",
          "false")  # a line to remove transformers warnings complaining about creating subprocesses

results = get_results(EVAL_DATA_DIR)

 93%|█████████▎| 152/164 [00:01<00:00, 85.26it/s]

Checker#156 timed out


100%|██████████| 164/164 [00:06<00:00, 27.08it/s]

Checker#160 timed out





(threshold turned out to be useful)

Let's look at the results!

In [13]:
print(f"Pass@1 score: {np.count_nonzero(results) / results.size}")

Pass@1 score: 0.07317073170731707


So, 7% pass@1... Not that impressive. Although the Refact guys claimed that they have 18% on the same task. I assume it depends significantly on the system prompt.

So why not trying to change it! (but first we also print out the scores of each bug type)

In [14]:
item_bug_types = np.array([item['bug_type'] for item in hep_python])
bug_types, bug_type_counts = np.unique(item_bug_types, return_counts=True)
type_to_id = {val: i for i, val in enumerate(bug_types)}

In [15]:
results_by_types = np.zeros(len(bug_types))
for i, result in enumerate(results):
    results_by_types[type_to_id[item_bug_types[i]]] += int(result)

In [16]:
result_percents = results_by_types / bug_type_counts
print("Result across bug types:")
for i, tpe in enumerate(bug_types):
    print(f"{tpe}{' ' * (max([len(t) for t in bug_types]) - len(tpe))}: {result_percents[i]}")

Result across bug types:
excess logic   : 0.0967741935483871
function misuse: 0.0
missing logic  : 0.030303030303030304
operator misuse: 0.12
value misuse   : 0.06818181818181818
variable misuse: 0.08695652173913043


It handled "operator misuse" best and it didn't fix any of the function misuse bugs. They models proposed in the paper have operator misuse as one of the best as well, but the worse performance they have on excess logic, which is not the case for Refactor model with the system prompt that I wrote.
![picture](../resources/results_across_types.png)

In [17]:
print("Number of items across bug types:")
for i, tpe in enumerate(bug_types):
    print(f"{tpe}{' ' * (max([len(t) for t in bug_types]) - len(tpe))}: {bug_type_counts[i]}")

Number of items across bug types:
excess logic   : 31
function misuse: 8
missing logic  : 33
operator misuse: 25
value misuse   : 44
variable misuse: 23


### New prompt!

Alright, let's create a better system prompt and achieve better results! (hopefully)

The second version of the prompt specifies rigorously the input and output formats. I also ask the model to generate a comment on how to fix the bug before writing a function:
> You are a programming assistant that generates bug fixes for Python functions. A user will provide you with a bugged function and will provide some testcases in the following format: '<bugged_function>[whitespace]<check_function>[whitespace]<fix_instruction>', where "bugged_function" is the function you need to corred, "check_function" is the function with testcases that the correct solution must pass, "fix_instruction" is the instruction to fix a bug.  
> 
> Your response should be in the following format: '<imports>[whitespace]<description_comment>[whitespace]<correct implementation>', where "imports" are required imports, "description_comment" is a python comment that describes a bug and a way to fix it with natural language, "correct_implementation" is the correct implementation of a bugged function.
> 
> Before generating the correct function, you should describe a bug and how to fix it in a natural language Python comment.

Let's see how it will perform this time! (I kept `eval_data_new_prompt` directory with generation in repo, so that you don't have to generate it)

In [18]:
EVAL_DATA_NEW_DIR = "eval_data_new_prompt"

In [19]:
for item in hep_python:
    create_checker(item, EVAL_DATA_NEW_DIR)

In [20]:
os.putenv("TOKENIZERS_PARALLELISM",
          "false")  # a line to remove transformers warnings complaining about creating subprocesses

new_results = get_results(EVAL_DATA_NEW_DIR)

 95%|█████████▍| 155/164 [00:02<00:00, 81.59it/s]

Checker#156 timed out


100%|██████████| 164/164 [00:06<00:00, 26.56it/s]

Checker#160 timed out





In [22]:
print(f"Pass@1 score: {np.count_nonzero(new_results) / new_results.size}")

Pass@1 score: 0.024390243902439025


Cool! With a new prompt the results are more than two times worse!

At least it proves that the results depend on the prompt a lot...

Here are the results with the new prompt across different bug types...

In [23]:
new_results_by_types = np.zeros(len(bug_types))
for i, result in enumerate(new_results):
    new_results_by_types[type_to_id[item_bug_types[i]]] += int(result)

In [24]:
new_result_percents = new_results_by_types / bug_type_counts
for i, tpe in enumerate(bug_types):
    print(f"{tpe}{' ' * (max([len(t) for t in bug_types]) - len(tpe))}: {new_result_percents[i]}")

excess logic   : 0.03225806451612903
function misuse: 0.0
missing logic  : 0.0
operator misuse: 0.08
value misuse   : 0.0
variable misuse: 0.043478260869565216


### Looking and generations

Let's look at generations and try to understand, what went wrong!

In [25]:
def get_generations(required_bug_type: str, required_result: bool, check_results: list, gen_dir: str,
                    return_indexes: bool = False):
    """
    A function that returns a list of strings: generations for the required bug type and required result.
    :param required_bug_type: A required bug type
    :param required_result: A required result
    :param check_results: A list of results
    :param gen_dir: A directory, where generations lie 
    :param return_indexes: If true, ids are returned as well
    :return: A list of matching generations
    """
    gens = []
    gen_ids = []
    for i in range(len(hep_python)):
        if check_results[i] == required_result and required_bug_type == item_bug_types[i]:
            with open(os.path.join(gen_dir, f"gen{i}.py"), "r") as file:
                gens.append(file.read())
            gen_ids.append(i)

    if return_indexes:
        return gens, gen_ids
    return gens


#### What went wrong with "function misuse"

First let's look at some generations for "function misuse" class:

In [26]:
fm_gens, fm_ids = get_generations("function misuse", False, results, EVAL_DATA_DIR, return_indexes=True)

In [27]:
print(("\n" + "=" * 100 + "\n").join(fm_gens[:3]))

def flip_case(string: str) -> str:
    return string.lower()
from typing import List



def filter_by_prefix(strings: List[str], prefix: str) -> List[str]:
    return [x for x in strings if x.endswith(prefix)]
def digitSum(s):
    if s == "": return 0
    return sum(ord(char) if char.islower() else 0 for char in s)


In [28]:
def get_buggy_function(item: dict) -> str:
    """
    Returns a buggy function that passes testcases for an item from Python HumanEvalPack dataset
    :param item: An item to get a function from
    :return: A string representing a function definition
    """
    return item['declaration'] + item['buggy_solution']

In [29]:
fm_buggy = [get_buggy_function(hep_python[id]).strip() for id in fm_ids]

print(("\n" + "=" * 100 + "\n").join(fm_buggy[:3]))

def flip_case(string: str) -> str:
    return string.lower()
from typing import List


def filter_by_prefix(strings: List[str], prefix: str) -> List[str]:
    return [x for x in strings if x.endswith(prefix)]
def digitSum(s):
    if s == "": return 0
    return sum(ord(char) if char.islower() else 0 for char in s)


Seems like it kept the solutions the same. How many functions did it change?

In [69]:
def remove_comments(impl: str) -> str:
    """
    Removes trivial comments from the implementation
    :param impl: An implementation to remove the comments from
    :returns: An implementation cleared from trivial comments
    """
    lines = [line for line in impl.split("\n") if not line.strip().startswith("#")]
    docstring_begin = -1
    docstring_end = -1
    for i in range(len(lines)):
        if '"""' in lines[i]:
            docstring_end = i
            if docstring_begin == -1:
                docstring_begin = i
    if docstring_begin != -1:
        lines = lines[:docstring_begin] + lines[docstring_end + 1:]
    return "\n".join(lines).strip()


def is_same(impl1: str, impl2: str) -> bool:
    """
    A function that determines, whether two implementations are the same character-wise. It pays attention only to meaningful characters.
    
    :param impl1: One implementation
    :param impl2: Other implementation
    :returns: True is both implementations are the same
    """

    clear_impl1 = remove_comments(impl1)
    splitted_impl1 = clear_impl1.split("def")

    clear_impl2 = remove_comments(impl2)
    splitted_impl2 = clear_impl2.split("def")

    if len(splitted_impl1) < 2 or len(splitted_impl2) < 2:
        return False
    return splitted_impl1[1].split() == splitted_impl2[1].split()

In [66]:
print("Number of function that were left the same across bug types (only failed fixes are considered):")

total_same = 0
total_wrong = 0
for tpe in bug_types:
    tpe_gens, tpe_ids = get_generations(tpe, False, results, EVAL_DATA_DIR, return_indexes=True)
    equal = [is_same(tpe_gens[k], get_buggy_function(hep_python[tpe_ids[k]])) for k in range(len(tpe_gens))]
    total_same += np.count_nonzero(equal)
    total_wrong += len(tpe_gens)
    print(f"{tpe}{' ' * (max([len(t) for t in bug_types]) - len(tpe))}: {np.count_nonzero(equal)}/{len(tpe_gens)}")
print(f"TOTAL SAME: {total_same}/{total_wrong}")

Number of function that were left the same across bug types (only failed fixes are considered):
excess logic   : 24/28
function misuse: 7/8
missing logic  : 27/32
operator misuse: 16/22
value misuse   : 36/41
variable misuse: 19/21
TOTAL SAME: 129/152


So the main reason of a failure is that the model was keeping the functions the same.

### What went wrong with a new prompt?

First, we need to pay attention that the new prompt changed the behaviour of the model.

It started generating docstrings (some are correct, according to my eyes) for a lot of functions:

```
def encode_shift(s: str):
    """
    returns encoded string by shifting every character by 5 in the alphabet.
    """
    return "".join([chr(((ord(ch) + 5 - ord("a")) % 26) + ord("a")) for ch in s])
```

But it didn't help with changing the code and functions mostly are kept the same as well (checked using eyes).

Sometimes it even did, what I asked it to do (generate a comment on how to fix a bug), but in a different order. 
In the following example it didn't change the code, but outlined the correct change:

```
def fizz_buzz(n: int):
    ns = []
    for i in range(n):
        if i % 11 == 0 and i % 13 == 0:
            ns.append(i)
    s = ''.join(list(map(str, ns)))
    ans = 0
    for c in s:
        ans += (c == '7')
    return ans

# The bug in fizz_buzz is that it does not correctly handle numbers that are multiples of both 11 and 13.
# To fix this, we need to add an additional condition to the if statement that checks if the current number is a multiple of both 11 and 13.
# We can do this by adding an else statement that appends the current number to the list if it is not a multiple of either 11 or 13.
```

In the last example with the original prompt, the model failed as well. But it "sees" the bug. Meaning that most likely with some smart prompt corrections results may be way better.

And, finally, just for the sake of interest, some functions that were correct with the original prompt and that are incorrect with the current one:

In [175]:
fake_results = np.logical_and(results, np.logical_not(new_results))

N = 3
count = 0

for tpe in bug_types:
    tpe_orig = get_generations(tpe, True, fake_results, EVAL_DATA_DIR)
    tpe_new = get_generations(tpe, True, fake_results, EVAL_DATA_NEW_DIR)
    for i in range(len(tpe_orig)):
        count += 1
        print(f"{'=' * 40} ORIGINAL PROMPT {'=' * 40}")
        print(tpe_orig[i])
        print(f"{'=' * 40} NEW PROMPT {'=' * 40}")
        print(tpe_new[i])
        print("=" * 100)
        if count == N:
            break
    if count == N:
        break

from typing import List



def all_prefixes(string: str) -> List[str]:
    result = []

    for i in range(len(string)):
        result.append(string[:i+1])
    return result
from typing import List



def all_prefixes(string: str) -> List[str]:
    """
    Returns a list of all possible prefixes of a given string.

    Example:
    >>> all_prefixes('')
    []
    >>> all_prefixes('asdfgh')
    ['a', 'as', 'asd', 'asdf', 'asdfg', 'asdfgh']
    >>> all_prefixes('WWW')
    ['W', 'WW', 'WWW']
    """
    result = []

    for i in range(len(string)-1):
        result.append(string[:i+1])
    return result
def digits(n):
    product = 1
    odd_count = 0
    for digit in str(n):
        int_digit = int(digit)
        if int_digit%2 == 1:
            product*= int_digit
            odd_count+=1
    if odd_count ==0:
        return 0
    else:
        return product
def digits(n):
    product = 1
    odd_count = 0
    for digit in str(n):
        int_digit = int(digit)
        if int_digit%2 

### Ok, I returned in a few hours unsatisfied that the new prompt preformed worse

Let's create a ***very*** new prompt!!!

Here's, what changes (a lot actually):
* Now it's a one-shot generation: I included an example that I generated using my brains.
* I don't specify the format that rigorously hoping that the model would derive a format form the example.
* I deliberately require the model not to forget to change a function and not to forget that the function should compile.

Here is the full prompt: (I noticed a typo in a "here is an example" sentence too late)
```
You're a bug fixing model. User will provide you a bugged python function and testcases function named "check". You need to generate a python file containing the corrected version of a python function. 
Before the correct function you need to reason why there is a bug in the function and how to correct it. The reasoning should be in a form of python comments.
Pay attention that you actually correct the function. Also pay attention to keeping all imports. The code must be able to compile.
Here is an example of your a conversation:

<empty_output>USER
from typing import List

def descending_sort(a: List[int]) -> List[int]:
    return sorted(a)

def check(descending_sort):
    assert descending_sort([1, 2, 3, 4, 5]) == [5, 4, 3, 2, 1]
    assert descending_sort([1]) == [1]
    assert descending_sort([2, 5, 1, 9]) == [9, 5, 2, 1]

check(descending_sort)

Fix bugs in descending_sort
<empty_output>ASSISTANT
from typing import List

# the bug in descending_sort is that the sorting happens in a wrong order
# the sort should be in a descendent order, but the bugged function sorts the list in an ascending order
# to fix it we need to slice the array inversely after sorting
# here is the correct implementation
def descending_sort(a: List[int]) -> List[int]:
    return sorted(a)[::-1]
<empty_output><end of sequence>
```

Let's see if the new prompt would be better!

In [40]:
EVAL_DATA_VERY_NEW_DIR = "eval_data_very_new_prompt"

In [41]:
for item in hep_python:
    create_checker(item, EVAL_DATA_VERY_NEW_DIR)

In [42]:
os.putenv("TOKENIZERS_PARALLELISM",
          "false")  # a line to remove transformers warnings complaining about creating subprocesses

very_new_results = get_results(EVAL_DATA_VERY_NEW_DIR)

 95%|█████████▌| 156/164 [00:02<00:00, 53.71it/s]

Checker#156 timed out


100%|██████████| 164/164 [00:07<00:00, 23.36it/s]

Checker#160 timed out





In [43]:
print(f"Pass@1 score: {np.count_nonzero(very_new_results) / very_new_results.size}")

Pass@1 score: 0.09146341463414634


### Nice! At least it's better than the very first prompt!

In [44]:
very_new_results_by_types = np.zeros(len(bug_types))
for i, result in enumerate(very_new_results):
    very_new_results_by_types[type_to_id[item_bug_types[i]]] += int(result)

In [45]:
very_new_result_percents = very_new_results_by_types / bug_type_counts
for i, tpe in enumerate(bug_types):
    print(f"{tpe}{' ' * (max([len(t) for t in bug_types]) - len(tpe))}: {very_new_result_percents[i]}")

excess logic   : 0.06451612903225806
function misuse: 0.25
missing logic  : 0.030303030303030304
operator misuse: 0.08
value misuse   : 0.09090909090909091
variable misuse: 0.17391304347826086


And the values scores tasks are a bit different as well. Here are the scores for the first prompt to compare:
```
Result across bug types:
excess logic   : 0.0967741935483871
function misuse: 0.0
missing logic  : 0.030303030303030304
operator misuse: 0.12
value misuse   : 0.06818181818181818
variable misuse: 0.08695652173913043
```

We lost a bit in some places, but then significantly improved the "misuse" types (apart from "operator").

Let's look at how our instruction on making functions different affected the situation.

In [73]:
print("Number of function that were left the same across bug types (only failed fixes are considered):")

total_same = 0
total_wrong = 0
for tpe in bug_types:
    tpe_gens, tpe_ids = get_generations(tpe, False, very_new_results, EVAL_DATA_VERY_NEW_DIR, return_indexes=True)
    equal = [is_same(tpe_gens[k], get_buggy_function(hep_python[tpe_ids[k]])) for k in range(len(tpe_gens))]
    # print("="*100)
    # print(tpe_gens[0])
    # print("="*50)
    # print(get_buggy_function(hep_python[tpe_ids[0]]))
    # print("="*100)
    total_same += np.count_nonzero(equal)
    total_wrong += len(tpe_gens)
    print(f"{tpe}{' ' * (max([len(t) for t in bug_types]) - len(tpe))}: {np.count_nonzero(equal)}/{len(tpe_gens)}")
print(f"TOTAL SAME: {total_same}/{total_wrong}")

Number of function that were left the same across bug types (only failed fixes are considered):
excess logic   : 9/29
function misuse: 2/6
missing logic  : 12/32
operator misuse: 7/23
value misuse   : 16/40
variable misuse: 13/19
TOTAL SAME: 59/149


Very good! Now we change the functions way more often!

With this, keeping the function the same becomes not the major issue. Now the major issue is changing a function wrongly.

Looking at the generations (a reader can look at them in the end of this notebook), it seems like the model misses the right bug often.
For example, here:
```
from typing import List

# the bug in has_close_elements is that it checks for the distance between each element and the next element
# it should check for the distance between each element and the previous element
# here is the correct implementation

def has_close_elements(numbers: List[float], threshold: float) -> bool:
    for idx, elem in enumerate(numbers):
        for idx2, elem2 in enumerate(numbers):
            if idx!= idx2:
                distance = elem - elem2
                if distance < threshold:
                    return True

    return False
```

The actual bug here is that the distance can be negative. And the explanation says some nonsense.

---

Sometimes (although very rarely) the model is getting confused, when generating comments. In this case it didn't generate a function:
```
from typing import List

# the bug in fib4 is that the function is not generating the correct sequence
# the function is generating the sequence [0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0
```

---

And sometimes (rarely, but happens) the model doesn't generate any comments:
```
from typing import List


def triangle_area(a, h):
    return a * h / 2
```

P.S. With the "very new" prompt I also kept raw generations in files "raw_gen{id}.py" just for the sake of interest.

P.S.S. If you want to change the prompt for generation, you need to replace a `REFACT_SYSTEM_PROMPT` constant with `REFACT_SYSTEM_VERY_NEW_PROMPT` in a `get_refact_prompt` function in `utils.py`


### All generations with the last prompt listed!

In [72]:
for i in range(164):
    with open(os.path.join("eval_data_very_new_prompt", f"gen{i}.py"), "r") as file:
        code = file.read()
        print("=" * 100)
        print(code)

from typing import List

# the bug in has_close_elements is that it checks for the distance between each element and the next element
# it should check for the distance between each element and the previous element
# here is the correct implementation

def has_close_elements(numbers: List[float], threshold: float) -> bool:
    for idx, elem in enumerate(numbers):
        for idx2, elem2 in enumerate(numbers):
            if idx!= idx2:
                distance = elem - elem2
                if distance < threshold:
                    return True

    return False
from typing import List

# the bug in separate_paren_groups is that it does not handle nested parentheses correctly
# it only checks for the first level of nesting
# to fix it we need to use a stack to keep track of the current level of nesting
# here is the correct implementation

def separate_paren_groups(paren_string: str) -> List[str]:
    result = []
    current_string = []
    current_depth = 0
    stack = []

    for c