In [4]:
import data
import evaluation
dataset = data.Data("config.json", "nagini_examples/tree_master_finetune")
ev = evaluation.Evaluation(dataset, "codellama", predictor_endpoint="huggingface-pytorch-inference-2024-03-21-09-55-48-749")

### Evaluating the finetuned mar6 model

Training data was:
- list: [prepend, append, join_lists, contains, drop, reverse, merge]
- tree: [val_head, count, sum, contains, inorder, min, mirror, subtree]
- lseg: [lemma_append, lemma_assoc, prepend, remove_first, contains, contains_iter, insert, append, index_of]

Held out:
- list: [insert_sorted ❌, drop_iter ❌]
- tree: [insert ❌, height ✅]
- lseg: [remove_last ❌, insert_iter ❌, reverse ❌]

Additionally evaluated on:
- list: [remove_first ✅, remove_last ✅, split ✅, index_of ✅, insert ✅, count ✅, insertion_sort ❌, merge_sort ❌]
- tree: [min_depth ✅]
- lseg: [insert_sorted ❌, insertion_sort ❌, count ✅, split ❌, merge ❌, merge_sort ❌]

Datasets to use: list_mar13_finetune, lseg_finetune, tree_finetune.

In [None]:
ev.run_eval(k=5, n=3, key="lseg")

### Failure modes:

list::drop_iter: uses undefined lemma_extend => translation error => does not learn with error depth. Possibly because no "translation error" in the training set? try higher temperature?

list::insert_sorted: Requires(is_list(node) and Acc(node.val) and Acc(node.next)) Uncallable!

list::insertion_sort: try renaming insert_sorted to insert_node_in_sorted_list(node, head) and variables inside insertion_sort appropriately.

list::merge_sort: extra fold unfolds. try giving specs of dependent methods in the prompt => does not help.

-----------------

tree::insert: if key < node.key: without Unfold/Unfolding => does not learn with error depth. Try higher temperature.

~~tree::subtree: 1. extra Unfold after return statement 2. when 1. solved, extra Fold (does not understand that permissions are to be leaked). Solved at temperature 1.5 at k=2.~~

-----------
lseg::remove_last
```python
    if Unfolding(lseg(first, last), first.next is last):
        return first
```
misses a Fold(first, first) required to satisfy postcondition `Ensures(lseg(first, Result()))`

lseg::insert_iter: No Unfold before Node(val, ptr.next)

lseg::reverse: wrong invariants. even then, does not respond to error message that loop invariant does not hold on entry.

lseg::insert_sorted: changes code logic! node.next = head; return node; to head.next = node; return head;
attempt 4, error depth 3 -- almost got it right. # temperature was 1.5

lseg::insertion_sort: attempt 2, depth 1: almost correct. except unfolding in conditional. (3,2) extra Unfold/Fold.

lseg::split: missing precondition on head being list. Missing Fold(None, None) before Fold(head, None)

lseg::merge: invents Unfold(head1, head2). Does not respond to error message at any temperature.

lseg::merge_sort same problems as list::merge_sort. Having examples with extra fold/unfold in the training data might help.

----------------

Some common issues:
- misses some basic Unfold()/Unfolding() statements.
- does not respond to error messages / line number information.

### Evaluating the finetuned mar13 model


In [5]:
datasets = {
    "list": "list_mar13_finetune",
    "tree": "tree_finetune",
    "lseg": "lseg_mar13_finetune"
}
hold_out = {
    "list": ["insert", "remove_last", "drop_iter", "count", "split", "merge_sort"],
    "tree": ["insert", "height", "min_depth"],
    "lseg": ["remove_last", "count_iter", "index_of", "reverse", "count", "split", "merge_sort", "insert_sorted_iter", "merge_iter"]
}
training_examples = {}

for key in datasets:
    dataset = data.Data("config.json", f"nagini_examples/{datasets[key]}")
    examples = dataset.get_list_of_examples(key)
    training_examples[key] = [ex for ex in examples if ex not in hold_out[key]]

print("training data was: ")
training_examples

training data was:  {'list': ['prepend', 'append', 'remove_first', 'join_lists', 'contains', 'remove', 'index_of', 'drop', 'reverse', 'insert_sorted', 'insertion_sort', 'merge'], 'tree': ['val_head', 'count', 'sum', 'contains', 'inorder', 'min', 'mirror', 'subtree'], 'lseg': ['join', 'prepend', 'remove_first', 'contains', 'contains_iter', 'insert', 'insert_iter', 'append', 'append_iter', 'insert_sorted', 'insertion_sort', 'insertion_sort_iter', 'merge', 'split_iter']}


In [None]:
ev.run_eval(k=5, n=3, key="lseg")

Training data was:
list: ['prepend', 'append', 'remove_first', 'join_lists', 'contains', 'remove', 'index_of', 'drop', 'reverse', 'insert_sorted' ❌, 'insertion_sort' ❌, 'merge']
tree: ['val_head', 'count', 'sum', 'contains', 'inorder', 'min', 'mirror', 'subtree']
lseg: ['join', 'prepend', 'remove_first', 'contains', 'contains_iter', 'insert', 'insert_iter', 'append', 'append_iter', 'insert_sorted', 'insertion_sort', 'insertion_sort_iter', 'merge', 'split_iter']

Held out / evalauted on:
list: ["insert" ✅, "remove_last" ✅, "drop_iter" ❌, "count" ✅, "split" ✅, "merge_sort" ❌],
tree: ["insert" ❌, "height" ✅, "min_depth" ❌],
lseg: ["remove_last" ❌, "count_iter" ❌, "index_of" ❌, "reverse" ❌, "count" ❌, "split" ✅, "merge_sort" ❌]

6 / 16 (unseen)

list:
training
+ remove_first: responds to the error message (i.e. verifies at (1,2))
- insert_sorted: missing precondition on is_list(head). Zero progress from error message at all temperatures i.e. outputs the exact same program. Try separating the two preconditions in the training data.
- insertion_sort: same problem as mar6 model. Extra folds after method calls. Having examples with extra fold/unfold in the training data might help.

hold_out
+ insert, remove_last, count, (gets right at (1,1)!)
+ split: gets "right" at (2,1), but without the postcondition that head is also a list -- similar to drop in this sense.
- drop_iter: extra Fold(is_list(head)) before entering the loop. At temp 1.1: inserts `join' => Translation error. does not learn from error. comes close to getting it right in attempt 4 i.e. no join, no redundant Fold before loop, but extra Folds inside the loop. try at temp ~1.2 for 2/3 more attempts.

tree:
insert: inserts Invariant => results in nagini runtime error. does respond to error message "conditional stmt failed.." in that it adds an Unfold (attempt 2, temp 0.6 -> try a higher error depth?)
min_depth: comes close

lseg:
- remove_last: wrong postcondition: Ensures(lseg(Result(), last)) instead of Ensures(lseg(first, Result()))
- index_of: almost correct! responds to the error message about loop inv not holding on entry. Extra fold before `join'; return; At higher temp extra Unfold before the join;return; inside the loop, but outside as expected. Having examples with extra fold/unfold in the training data might help.
- reverse: attempt 1: wrong invariants. attempt 2: right invariants but an extra wrong invariant. in either case does not get past the basic error conditional stmt might fail (unfold head.next) 
- count, count_iter: Ensures(Result(), None) 🤦‍♂️, does not learn from error at any temperature. Correct otherwise!
- 



In [2]:
ev.run_eval(k=5, n=3, key="lseg")

Running example: join ; attempt: 1 ; error depth: 1
Using temperature: 0.1
Generated program from model:
def join(a: Optional[Node], b: Optional[Node], c: Optional[Node]) -> None:
    """Join two list segments."""
    Requires(lseg(a, b) and lseg(b, c))
    Ensures(lseg(a, c))
    if b is c:
        return
    if a is b:
        return
    Unfold(lseg(a, b))
    join(a.next, b, c)
    Fold(lseg(a, c))


response ['', 'Verification successful', 'Verification took 1.75 seconds.']
Verification result:
 Verification successful 


Running example: prepend ; attempt: 1 ; error depth: 1
Using temperature: 0.1
Generated program from model:
def prepend(head: Optional[Node], val: int) -> Node:
    """Prepend a new node with value val to the list."""
    Requires(lseg(head, None))
    Ensures(lseg(Result(), None))
    n = Node(val, head)
    Fold(lseg(None, None))
    Fold(lseg(n, None))
    return n


response ['', 'Verification successful', 'Verification took 1.79 seconds.']
Verification result

EndpointConnectionError: Could not connect to the endpoint URL: "https://runtime.sagemaker.eu-west-1.amazonaws.com/endpoints/huggingface-pytorch-inference-2024-03-21-09-55-48-749/invocations"

### Evaluating the "master" model
Training data was:
list: ['prepend', 'append', 'remove_first', 'join_lists', 'contains', 'remove', 'index_of', 'drop', 'reverse', 'insert_sorted', 'insertion_sort', 'merge']
tree: ['val_head', 'count', 'sum', 'contains', 'inorder', 'min', 'mirror', 'subtree']
lseg: ['join', 'prepend', 'remove_first', 'contains', 'contains_iter', 'insert', 'insert_iter', 'append', 'append_iter', 'insert_sorted', 'insertion_sort', 'insertion_sort_iter', 'split_iter']

Held out / evalauted on:
list: ["insert" ✅, "remove_last" ✅, "drop_iter" ❌, "count" ✅, "split" ✅, "merge_sort" ❌],
vs pretrained:
EvalResult(results={'prepend': False, 'append': False, 'remove_first': False, 'remove_last': True, 'join_lists': True, 'contains': True, 'insert': False, 'remove': False, 'index_of': True, 'drop': False, 'drop_iter': False, 'reverse': FP, 'insert_sorted': FP, 'insertion_sort': False, 'count': False, 'split': False, 'merge': False, 'merge_sort': False}, verified_at={'remove_last': (1, 1), 'join_lists': (1, 1), 'contains': (1, 2), 'index_of': (1, 1), 'reverse': FP(4, 1), 'insert_sorted': FP(4, 3)})

tree: ["insert" ✅, "height" ✅, "min_depth" ✅],
vs. pretrained:
EvalResult(results={'val_head': True, 'height': True, 'count': True, 'sum': True, 'insert': False, 'contains': True, 'inorder': True, 'min': False, 'mirror': True, 'subtree': FP, 'min_depth': True}, verified_at={'val_head': (1, 1), 'height': (1, 1), 'count': (1, 1), 'sum': (1, 1), 'contains': (4, 2), 'inorder': (1, 1), 'mirror': (1, 1), 'subtree': (2, 1), 'min_depth': (1, 1)})


lseg: ["remove_last" ❌, "count_iter" ✅, "index_of" ✅, "reverse" ❌, "count" ✅, "split" ✅, "merge" ✅, "merge_sort" ❌]
vs. pretrained:
EvalResult(results={'join': False, 'prepend': True, 'remove_first': False, 'remove_last': FP, 'contains': True, 'contains_iter': True, 'insert': True, 'insert_iter': False, 'append': False, 'index_of': True, 'reverse': FP, 'insert_sorted': False, 'merge': True, 'count': True, 'count_iter': True, 'split': FP, 'split_iter': False, 'merge_sort': False}, verified_at={'prepend': (1, 1), 'remove_last': (4, 1), 'contains': (1, 2), 'contains_iter': (1, 1), 'insert': (1, 1), 'index_of': (1, 1), 'reverse': (4, 3), 'merge': (1, 1), 'count': (1, 1), 'count_iter': (1, 1), 'split': (4, 3)})
12 / 17 (unseen)

list:
remove_first: responds to the error message (i.e. verifies at (1,2))

held out:
remove_last: responds to the error message (i.e. verifies at (1,2))

drop_iter: temp 0.1: still invokes join, extra fold before loop.. in style of lseg.
merge_sort: off by one spec (extra unfold which it moves around)

tree:
insert: (1,3) (2,3) off by one --> try higher n. --> verifies at (3,2)

lseg:
remove_last: same as previous model
reverse: off by one,
merge_sort: same as list::merge_sort, tendency to Unfold before a method call. probably learned from recursive methods that call .next.
 -off by one



In [5]:
ev.run_example("insert", k=7, n=7, key="tree", examples=[])

Running example: insert ; attempt: 1 ; error depth: 1
Using temperature: 0.1
Generated program from model:
def insert(node: TreeNode, key: int) -> None:
    """Insert a node with given key into a binary tree."""
    Requires(tree(node))
    Ensures(tree(node))
    if key < node.key:
        if node.left is None:
            n = TreeNode(key)
            Fold(tree(n))
            node.left = n
        else:
            Unfold(tree(node))
            insert(node.left, key)
            Fold(tree(node))
    else:
        if node.right is None:
            n = TreeNode(key)
            Fold(tree(n))
            node.right = n
        else:
            Unfold(tree(node))
            insert(node.right, key)
            Fold(tree(node))
    Fold(tree(node))

response ['', 'Verification failed', 'Errors:', 'Conditional statement might fail. There might be insufficient permission to access node.key. (tmp.py@39.7)', 'Verification took 1.90 seconds.']
Verification result:
 Verification failed: Con

(True,
 (3, 2),
 'def insert(node: TreeNode, key: int) -> None:\n    """Insert a node with given key into a binary tree."""\n    Requires(tree(node))\n    Ensures(tree(node))\n    Unfold(tree(node))\n    if key < node.key:\n        if node.left is None:\n            n = TreeNode(key)\n            Fold(tree(n))\n            node.left = n\n        else:\n            insert(node.left, key)\n    else:\n        if node.right is None:\n            n = TreeNode(key)\n            Fold(tree(n))\n            node.right = n\n        else:\n            insert(node.right, key)\n    Fold(tree(node))\n')

In [7]:
import data
dataset = data.Data("config.json", "nagini_examples/lseg_master_finetune")

In [10]:
print("method, dependencies")
for m in dataset.get_list_of_examples("lseg"):
    print(m, ",", dataset.get_dependencies("lseg", m))

method, dependencies
join , []
prepend , []
remove_first , []
remove_last , []
contains , []
contains_iter , ['join']
insert , ['prepend']
insert_iter , ['prepend', 'join']
append , []
append_iter , ['join']
index_of , ['join']
reverse , []
insert_sorted , []
insertion_sort , ['insert_sorted']
insert_sorted_iter , ['join']
insertion_sort_iter , ['insert_sorted']
merge , []
merge_iter , ['join']
count , []
count_iter , ['join']
split , []
split_iter , ['join']
merge_sort , ['count', 'split', 'merge']
