# Analyzing Results of Generation with Nuggets

### Meher Mankikar, 6/3/24

### Purpose

The purpose of this notebook is to analyze the initial results of performing generation with Nuggets. We aim to see if there are any commonalities/patterns in the tasks that benefit from using Nuggets. 

Note: when running cells that have long output (code examples), run the cell and then click "View as scrollable element" for easier viewing. 

### Setup

For all experiments, the following task was used as the one shot example. Various solutions were provided to modify the quality of the sample (good, decent, or bad). For example, to create a bad quality sample, the solution was modified to be near gibberish. Some characteristics of this sample task to note

- Solution requires some mathematical statements
- Solution requires iterating over some range
- Correct/High Quality solution uses a list comprehension with a conditional statement

```
def generate_integers(a, b): 
    """ Given two positive integers a and b, return the even digits between a and b, in ascending order. 
    For example: 
    generate_integers(2, 8) => [2, 4, 6, 8] 
    generate_integers(8, 2) => [2, 4, 6, 8] 
    generate_integers(10, 14) => [] 
    """
```

##### Good Quality Solution

```
lower = max(2, min(a, b)) 
upper = min(8, max(a, b)) 
return [i for i in range(lower, upper+1) if i % 2 == 0]
```

##### Decent Quality Solution

```
lower = 2
upper = 8
return [i if i % 2 = 0 for i in range(lower, upper)]
```

##### Bad Quality Solution

```
cat: cat cat
dog dog dog;
return [giraffe if giraffe for giraffe in giraffe]
```

In [13]:
import json
from IPython.display import display, JSON


with open('good_nuggets_humaneval.json') as f:
    good_nuggets_data = json.load(f)

with open('bad_nuggets_humaneval.json') as f:
    bad_nuggets_data = json.load(f)

with open('baseline_humaneval.json') as f:
    baseline_data = json.load(f)


In [21]:
# Find task_ids that are incorrect in the baseline and then correct with Nuggets good. Is there any pattern? 
def get_list_of_incorrect_tasks(data):
    incorrect_tasks = [task_id for task_id in data if not data[task_id]["passed"]]
    return incorrect_tasks


baseline_incorrect_tasks = get_list_of_incorrect_tasks(baseline_data)
good_nuggets_incorrect_tasks = get_list_of_incorrect_tasks(good_nuggets_data)

# Get list of Task IDs that are in the baseline but not the other

incorrect_to_correct = [task_id for task_id in baseline_incorrect_tasks if task_id not in good_nuggets_incorrect_tasks]
print(f"The following tasks went from incorrect to correct with Nuggets: {incorrect_to_correct}")

correct_to_incorrect = [task_id for task_id in good_nuggets_incorrect_tasks if task_id not in baseline_incorrect_tasks]
print(f"The following tasks went from correct to incorrect with Nuggets: {correct_to_incorrect}")

The following tasks went from incorrect to correct with Nuggets: ['18', '51', '57', '79', '83', '92', '114', '150', '154']
The following tasks went from correct to incorrect with Nuggets: ['59', '73', '68', '74', '95', '81', '124', '155']


### Inspect the tasks that went from incorrect to correct when using Nuggets with a good quality sample


In [33]:
def print_generations(data, result_subset):
    '''
    Prints the generations given a list of task IDs to print for 
    Args:
        data: the JSON object with the results to get generations from
        result_subset: the list of task ids that we want to print the generations of 
    '''
    samples_to_inspect = {}
    for task_id in result_subset:
        samples_to_inspect[task_id] = data[task_id]

    for task_id, sample in samples_to_inspect.items():
        print(sample['generation'][0])

In [34]:
# Inspect the tasks that went from incorrect to correct when using Nuggets with a good quality sample

print_generations(good_nuggets_data, incorrect_to_correct)

def how_many_times(string: str, substring: str) -> int:
    """ Find how many times a given substring can be found in the original string. Count overlaping cases.
    >>> how_many_times('', 'a')
    0
    >>> how_many_times('aaa', 'a')
    3
    >>> how_many_times('aaaa', 'aa')
    3
    """
    if substring == '':
        return 0
    return sum(1 for i in range(len(string)) if string.startswith(substring, i))

def remove_vowels(text):
    """
    remove_vowels is a function that takes string and returns string without vowels.
    >>> remove_vowels('')
    ''
    >>> remove_vowels("abcdef\nghijklm")
    'bcdf\nghjklm'
    >>> remove_vowels('abcdef')
    'bcdf'
    >>> remove_vowels('aaaaa')
    ''
    >>> remove_vowels('aaBAA')
    'B'
    >>> remove_vowels('zbcd')
    'zbcd'
    """
    vowels = ['a', 'e', 'i', 'o', 'u']
    return ''.join([char for char in text if char.lower() not in vowels])

def monotonic(l: list):
    """Return True is list elements are monotonically increasing o

## Analysis

From visual inspection of the above functions, there are some commonalities that can be seen among the tasks that changed from incorrect to correct when using Nuggets. 

- Iterative Processing: many solutions required iterating over all the elements in a data structure. Several solutions could be written using list comprehensions. 
- Mathematical Reasoning: many of the solutions require some math (sum, binary, power, mod, ..)

It is important to note that these characteristics are also present in the task that was used as the one shot example. From this, we can assume that using Nuggets, tasks that are similar to the provided example in terms of the types of reasoning or programming skills required for the solution will be improved. 



### Inspect the tasks that went from correct to incorrect when using Nuggets with a good quality sample


#### Solutions provided by Nuggets model

In [35]:
print_generations(good_nuggets_data, correct_to_incorrect)

def largest_prime_factor(n: int):
    """Return the largest prime factor of n. Assume n > 1 and is not a prime.
    >>> largest_prime_factor(13195)
    29
    >>> largest_prime_factor(2048)
    2
    """
    for i in range(int(n**0.5), 1, -1):
        if n % i == 0 and is_prime(i):
            return i
    return 1

def smallest_change(arr):
    """
    Given an array arr of integers, find the minimum number of elements that
    need to be changed to make the array palindromic. A palindromic array is an array that
    is read the same backwards and forwards. In one change, you can change one element to any other element.

    For example:
    smallest_change([1,2,3,5,4,7,9,6]) == 4
    smallest_change([1, 2, 3, 4, 3, 2, 2]) == 1
    smallest_change([1, 2, 3, 2, 1]) == 0
    """

    # Find the smallest palindrome
    smallest_palindrome = min(arr)

    # Find the largest palindrome
    largest_palindrome = max(arr)

    # Find the number of elements that need to be changed
    return l

#### Solutions provided by baseline model

In [36]:
print_generations(baseline_data, correct_to_incorrect)

def largest_prime_factor(n: int):
    """Return the largest prime factor of n. Assume n > 1 and is not a prime.
    >>> largest_prime_factor(13195)
    29
    >>> largest_prime_factor(2048)
    2
    """
    i = 2
    while i * i <= n:
        if n % i:
            i += 1
        else:
            n //= i
    return n


def smallest_change(arr):
    """
    Given an array arr of integers, find the minimum number of elements that
    need to be changed to make the array palindromic. A palindromic array is an array that
    is read the same backwards and forwards. In one change, you can change one element to any other element.

    For example:
    smallest_change([1,2,3,5,4,7,9,6]) == 4
    smallest_change([1, 2, 3, 4, 3, 2, 2]) == 1
    smallest_change([1, 2, 3, 2, 1]) == 0
    """
    # Write your code here
    if not arr:
        return 0

    # Find the middle index of the array
    mid = len(arr) // 2

    # Find the minimum number of elements that need to be changed
    # to make 

## Analysis

The analysis of the tasks that go from correct in the baseline to incorrect when using Nuggets is less clear. I will document insights that I see here to try to find commonalities. 

- In general, the model is doing worse at problems that require reasoning with strings. This may make sense because the one shot example provided required working with math only. 
- The model tries to use list comprehension when not required or it makes the problem harder to do so. (e.g. valid_date)
- The model misses small edge cases (or gets these cases incorrect) that it did not miss in the baseline (e.g even_odd_count, total_match)

#### Case \#1

```def pluck(arr):
    """
    "Given an array representing a branch of a tree that has non-negative integer nodes
    your task is to pluck one of the nodes and return it.
    The plucked node should be the node with the smallest even value.
    If multiple nodes with the same smallest even value are found return the node that has smallest index.

    The plucked node should be returned in a list, [ smalest_value, its index ],
    If there are no even values or the given array is empty, return [].

    Example 1:
        Input: [4,2,3]
        Output: [2, 1]
        Explanation: 2 has the smallest even value, and 2 has the smallest index.

    Example 2:
        Input: [1,2,3]
        Output: [2, 1]
        Explanation: 2 has the smallest even value, and 2 has the smallest index. 

    Example 3:
        Input: []
        Output: []
    
    Example 4:
        Input: [5, 0, 3, 0, 4, 2]
        Output: [0, 1]
        Explanation: 0 is the smallest value, but  there are two zeros,
                     so we will choose the first zero, which has the smallest index.

    Constraints:
        * 1 <= nodes.length <= 10000
        * 0 <= node.value
    """

    ## BASELINE SOLUTION ##
    # Find the smallest even value
    smallest_even_value = float('inf')
    smallest_even_index = -1
    for i, node in enumerate(arr):
        if node % 2 == 0 and node < smallest_even_value:
            smallest_even_value = node
            smallest_even_index = i

    # If there are no even values or the given array is empty, return []
    if smallest_even_index == -1:
        return []

    return [smallest_even_value, smallest_even_index]
    
    
    ## NUGGETS SOLUTION ##
    if not arr:
        return []

    even_nodes = [node for node in arr if node % 2 == 0]
    if not even_nodes:
        return []

    return min(even_nodes, key=lambda x: (x, arr.index(x)))
    
    ```


### Notes 
- The nuggets solution tries to use methods/tricks from the one shot method (list comprehension) but does so in an incorrect way. 
- The solution is nearly correct but the format of the solution is not what is asked for in the function header. 