# Building the Seed Dataset

This is the most biased way to to create a seed dataset: we are going to use working solutions from MultiPL-HumanEval-Racket generated by MultiPLCoder-Racket-34B. It gets pass@1 of 29.1%, so we should get a number of solutions.

In [33]:
import json
from pathlib import Path
import gzip
from typing import Optional
import pandas as pd

In [30]:
def gunzip_json(path: Path) -> Optional[dict]:
    """
    Reads a .json.gz file, but produces None if any error occurs.
    """
    try:
        with gzip.open(path, "rt") as f:
            return json.load(f)
    except Exception as e:
        return None

def remove_hashlang(prompt):
    to_remove = "#lang racket\n\n"
    assert prompt.startswith(to_remove)
    return prompt[len(to_remove):]

def process_executions(executions_path: Path):
    executions = gunzip_json(executions_path)
    completions_path = executions_path.parent / (executions_path.name[:-16] + ".json.gz")
    completions = gunzip_json(completions_path)
    if completions is None or executions is None:
        return None
    prompt = completions["prompt"]
    # Get the index of the first value of result["status"] that is "OK"
    statuses = [ result["status"] for result in executions["results"] ]
    try:
        ok_index = statuses.index("OK")
    except ValueError:
        return None
    completion = completions["completions"][ok_index]
    return {
        "contents": remove_hashlang(prompt.rstrip()) + completion,
        "src_file": str(completions_path),
        "src_index": ok_index,
    }

In [31]:
EXPERIMENT_PATH = Path("/work/arjunguha-research-group/projects/MultiPL-T/eval/multiplcoder_34b_humaneval_rkt")
    
results = [ ]
for executions_path in EXPERIMENT_PATH.glob("*.results.json.gz"):
        maybe_dict = process_executions(executions_path)
        if maybe_dict is not None:
                results.append(maybe_dict)
print(f"Dataset size: {len(results)}")

Dataset size: 78


We found 78 seed instructions. CodeParrot Self Instruction had
80. This is close enough.

Looking at the instructions to ensure that we have correctly removed
the `#lang racket\n\n`.

In [32]:
print(results[70]["contents"])

;; For a given list of integers, return a list consisting of a sum and a product of all the integers in a list.
;; Empty sum should be equal to 0 and empty product should be equal to 1.
;; >>> (sum_product (list ))
;; (list 0 1)
;; >>> (sum_product (list 1 2 3 4))
;; (list 10 24)
(define (sum_product numbers)
	(list (apply + numbers) (apply * numbers)))


In [None]:
pd.DataFrame(results).to_json("multipl_humaneval_rkt_seeds.jsonl", orient="records", lines=True)