# Numina 1st Place Solution

Our solution was based on a simple tree search algorithm to generate and prune a diverse set of reasoning traces with code execution from the Python REPL. Concretely, the algorithm works as follows:

![](./images/tree-search.png)

1. For each problem, copy the input $M$ times to define the initial batch of prompts to provide the model. These effectively define the number of candidates one uses for majority voting.
2. Sample $M$ completions until a complete block of Python code is produced (like the DeepSeekMath Instruct/RL models, our model produces code blocks in the ToRA format).
3. Execute each Python block and concatenate the output, including tracebacks if they appear.
4. Repeat $N$ times to produce a tree of width $M$ and depth $N$. If a branch fails to produce sensible outputs (e.g. incomplete code blocks or no `\boxed{}` output) prune that branch.
5. Postprocess the solution candidates and the apply majority voting to select the final answer

To accelerate inference we used [vLLM](https://github.com/vllm-project/vllm) and 8-bit models that were quantized with [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ). On modern hardware, one can skip the quantization step and run inference in standard 16-bit precision.


## Setup

In [1]:
import os

# Are we running on Kaggle?
is_kaggle : bool = True if "KAGGLE_KERNEL_RUN_TYPE" in os.environ else False

In [None]:
# Install dependencies
INSTALL_DEPS_FROM_PIP = False
if INSTALL_DEPS_FROM_PIP:
    %pip install vllm==0.4.2
    %pip install grpcio==1.62.2 # fixes a bug with ray dashboard
    %pip install antlr4-python3-runtime==4.11.0

    # For the python env used in Python code execution
    #%pip install networkx shapely sage matplotlib gmpy2 scipy numpy sympy mpmath

INSTALL_FROM_KAGGLE_DATASET=True
if INSTALL_FROM_KAGGLE_DATASET:
    
    !pip uninstall -y torch
    !pip install -U --no-index --find-links=/kaggle/input/vllm-whl -U vllm
    !pip install -U --upgrade /kaggle/input/vllm-t4-fix/grpcio-1.62.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
    !pip install -U --upgrade /kaggle/input/vllm-t4-fix/ray-2.11.0-cp310-cp310-manylinux2014_x86_64.whl
    !pip install -U --upgrade /kaggle/input/antlr4-python3-runtime-package-4-11/antlr4_python3_runtime-4.11.0-py3-none-any.whl
    

## Imports

In [2]:
import re
from vllm import LLM, SamplingParams
import pandas as pd
from collections import Counter
from datasets import load_dataset, Dataset, concatenate_datasets
from dataclasses import dataclass
import os
from typing import Dict, Any, List
# code execution
import os
import re
import signal
import subprocess
import tempfile
from contextlib import contextmanager
from typing import Tuple
from transformers import PreTrainedTokenizer
import torch
from tqdm import tqdm
import time
from sympy import N, simplify
from sympy.parsing.latex import parse_latex


## Configuration

We found it useful to define a single `Config` class that gathers all the setting used for a single submission:

In [3]:
@dataclass
class Config:
    model_id : str

    system_prompt : str # Append an optional system prompt to each problem

    # Tree search parameters
    num_samples : int # Number of candidates to generate (width of the tree)
    num_generations : int # Number of steps to generate per candidate (depth of the tree)
    restart_on_fail: bool # Regenerate a step if it fails to generate Python codeblocks

    # Generation parameters
    do_sample : bool
    temperature : float
    top_p : float
    top_k: int
    max_new_tokens : int
    

    # Run on train or test data?
    is_submission : bool = True if os.getenv('KAGGLE_IS_COMPETITION_RERUN') else False
    validation_set : str = "kaggle-validation-set-medium"

    notebook_time_limit : int = 9 * 60 * 60 - 15 * 60 # 9 hours - 15 minute buffer

    # Debug by solving only the first problem
    debug : bool = False



## Utility functions

In [4]:
def apply_template(example: Dict[str, Any], tokenizer: PreTrainedTokenizer, prompt: str) -> Dict[str, Any]:
    messages = [
        {"role": "user", "content": prompt.format(example["prompt"], "{}")},
    ]
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    example["text"] = text
    return example

In [5]:
def get_kaggle_env(config: Config):
    """Adapted from: https://www.kaggle.com/code/eabdullin/mathgenie-interlm-20b-interactive-code-running"""

    def get_train_data():
        dataset = load_dataset(f"AI-MO/{config.validation_set}", split="train[:5]")
        dataset = dataset.map(lambda x: {'answer': str(int(x['answer']) % 1000)})
        df = dataset.to_pandas()
        return df

    if not config.is_submission:    
        class train_env():
            def __init__(self, randomize=False):
                self.randomize = randomize
                
                self.df = get_train_data()
                self.df['ground_truth'] = self.df['answer']
                self.df['answer'] = -1
                
                if self.randomize:
                    self.df = self.df.reset_index().sample(frac=1).reset_index(drop=True)
                
                self.predict_called = True
                self.counter = 0
                self.len = len(self.df)
            
            
            def iter_test(self):
                while self.counter<self.len:
                    if self.predict_called:
                        self.predict_called = False
                        yield (self.df.loc[[self.counter]][['id','problem']]),(self.df.loc[[self.counter]][['id','answer']])
                    else:
                        print("You must call `predict()` successfully before you can continue with `iter_test()`")
                        yield None 
                    
            def predict(self, answer):
                self.df[self.counter,'answer'] = answer['answer']
                self.predict_called = True
                self.counter+=1

        env = train_env(randomize=True)
        iter_test = env.iter_test()
    else:
        # Set up the evaluation API
        import aimo

        env = aimo.make_env()
        iter_test = env.iter_test()

    return env, iter_test

### Python REPL for executing code

In [6]:
class PythonREPL:
    def __init__(self, timeout=5):
        self.timeout = timeout

    @contextmanager
    def time_limit(self, seconds):
        def signal_handler(signum, frame):
            raise TimeoutError(f"Timed out after {seconds} seconds.")

        signal.signal(signal.SIGALRM, signal_handler)
        signal.alarm(seconds)
        try:
            yield
        finally:
            signal.alarm(0)  # Disable the alarm

    def __call__(self, query: str) -> Tuple[bool, str]:
        query = "import math\nimport numpy as np\nimport sympy as sp\n" + query
        query = query.strip().split("\n")
        if "print(" not in query[-1]:
            if "#" in query[-1]:
                query[-1] = query[-1].split("#")[0]
            query[-1] = "print(" + query[-1] + ")"
        query = "\n".join(query)

        with tempfile.TemporaryDirectory() as temp_dir:
            temp_file_path = os.path.join(temp_dir, "tmp.py")

            with open(temp_file_path, "w") as f:
                f.write(query)

            with self.time_limit(self.timeout):
                result = subprocess.run(
                    ["python3", temp_file_path],
                    capture_output=True,
                    check=False,
                    text=True,
                    timeout=self.timeout,
                )

                if result.returncode == 0:
                    output = result.stdout
                    return True, output.strip()
                else:
                    error_msg = result.stderr.strip()
                    msgs = error_msg.split("\n")
                    new_msgs = []
                    want_next = False
                    for m in msgs:
                        if "Traceback" in m:
                            new_msgs.append(m)
                        elif m == msgs[-1]:
                            new_msgs.append(m)
                        elif temp_file_path in m:
                            st = m.index('"/') + 1 if '"/' in m else 0
                            ed = m.index(temp_file_path) + 1 if temp_file_path in m else None
                            clr = m[st:ed] if not ed else m[st:]
                            m = m.replace(clr, "")
                            new_msgs.append(m)
                            want_next = True
                        elif want_next:
                            new_msgs.append(m)
                            want_next = False
                    error_msg = "\n".join(new_msgs)
                    return False, error_msg.strip()


def execute_completion(
    executor: PythonREPL, completion: str, return_status: bool = False, last_code_block: bool = False
) -> str | Tuple[str, bool]:
    # executions = ["!" + code for code in re.findall(r"```bash(.*?)```", completion, re.DOTALL) if "!" not in code]
    executions= re.findall(r"```python(.*?)```", completion, re.DOTALL)
    
    if len(executions) == 0:  # directly return cot result
        return completion, False if return_status else completion
    else:
        if last_code_block:
            executions = [executions[-1]]

        # Python
        execution_outputs = []
        successes = []
        for code in executions:
            success = False
            
            if "subprocess" in code:
                output = "subprocess is not allowed"
                execution_outputs.append(output)
                successes.append(success)
                continue
            
            if "venv" in code:
                output = "venv is not allowed"
                execution_outputs.append(output)
                successes.append(success)
                continue
            
            try:
                success, output = executor(code)
            except TimeoutError as e:
                print("time out")
                output = e

            if not success and not return_status:
                output = ""

            execution_outputs.append(output)
            successes.append(success)

        output = str(execution_outputs[-1]).strip()
        success = successes[-1]

        if return_status:
            return output, success
        else:
            return output


def postprocess_completion(
    text: str, return_status: bool = False, last_code_block=False, timeout=5
) -> str | Tuple[str, bool]:
    executor = PythonREPL(timeout=timeout)

    result = execute_completion(executor, text, return_status=return_status, last_code_block=last_code_block)
    del executor

    return result


### Post-processing functions to extract final answers

In [7]:
def last_boxed_only_string(string):
    """
    Extracts the last LaTeX boxed or framed expression from a string.
    Args:
        string (str): The input string containing LaTeX expressions.
    Returns:
        str or None: The last boxed or framed expression, if found;
        otherwise, None.
    """

    idx = string.rfind("\\boxed")
    if idx < 0:
        idx = string.rfind("\\fbox")
        if idx < 0:
            return None

    i = idx
    right_brace_idx = None
    num_left_braces_open = 0
    while i < len(string):
        if string[i] == "{":
            num_left_braces_open += 1
        if string[i] == "}":
            num_left_braces_open -= 1
            if num_left_braces_open == 0:
                right_brace_idx = i
                break
        i += 1

    if right_brace_idx is None:
        retval = None
    else:
        retval = string[idx : right_brace_idx + 1]

    return retval


def remove_boxed(s):
    """
    Removes the LaTeX boxed command, returning the content inside the braces.
    Args:
        s (str): The string containing a LaTeX boxed expression.
    Returns:
        str or None: The content inside the boxed command, if valid;
        otherwise, None.
    """

    left = "\\boxed{"
    try:
        assert s[: len(left)] == left
        assert s[-1] == "}"
        length = len(left)
        return s[length:-1]
    except Exception:
        return None


def extract_boxed_answer(pred_str, strip_double_curly_brace=False):
    """
    Extracts the answer from a LaTeX boxed expression within
    a prediction string.
    Args:
        pred_str (str): The string containing one or more LaTeX
        boxed expressions.
        strip_double_curly_brace (bool): If True, removes an additional
        layer of braces.
    Returns:
        str or None: The extracted answer, if any; otherwise, None.
    """

    boxed_str = last_boxed_only_string(pred_str)
    if boxed_str is None:
        return None
    answer = remove_boxed(boxed_str)
    if answer is None:
        return None
    if strip_double_curly_brace:
        match = re.match("^\{(.*)\}$", answer)  # noqa: W605
        if match:
            answer = match.group(1)
    return answer


def normalize_final_answer(final_answer: str) -> str:
    """
    Normalizes a final answer string by removing or replacing various LaTeX
    and text elements.
    Args:
        final_answer (str): The answer string to normalize.
    Returns:
        str: The normalized answer string.
    """

    match = re.search(r"(.*?)Problem:", final_answer, flags=re.S)
    if match:
        final_answer = match.group(1)  # 返回匹配的第一部分，即"Problem"之前的所有文本
    """Normalize a final answer to a quantitative reasoning question."""
    # final_answer = final_answer.split('=')[-1]
    SUBSTITUTIONS = [
        ("an ", ""),
        ("a ", ""),
        (".$", "$"),
        ("\\$", ""),
        (r"\ ", ""),
        (" ", ""),
        ("mbox", "text"),
        (",\\text{and}", ","),
        ("\\text{and}", ","),
        ("\\text{m}", "\\text{}"),
        ("\\le", "<"),
    ]
    REMOVED_EXPRESSIONS = [
        "square",
        "ways",
        "integers",
        "dollars",
        "mph",
        "inches",
        "ft",
        "hours",
        "km",
        "units",
        "\\ldots",
        "sue",
        "points",
        "feet",
        "minutes",
        "digits",
        "cents",
        "degrees",
        "cm",
        "gm",
        "pounds",
        "meters",
        "meals",
        "edges",
        "students",
        "childrentickets",
        "multiples",
        "\\text{s}",
        "\\text{.}",
        "\\text{\ns}",
        "\\text{}^2",
        "\\text{}^3",
        "\\text{\n}",
        "\\text{}",
        r"\mathrm{th}",
        r"^\circ",
        r"^{\circ}",
        r"\;",
        r",\!",
        "{,}",
        '"',
        "\\dots",
        "\n",
        "\r",
        "\f",
        "\%",
    ]
    for before, after in SUBSTITUTIONS:
        final_answer = final_answer.replace(before, after)
    for expr in REMOVED_EXPRESSIONS:
        final_answer = final_answer.replace(expr, "")

    # Extract answer that is in LaTeX math, is bold,
    # is surrounded by a box, etc.
    final_answer = re.sub(r"(\\text\{)(.*?)(\})", "\\2", final_answer)
    final_answer = re.sub(r"(\\textbf\{)(.*?)(\})", "\\2", final_answer)
    final_answer = re.sub(r"(\\overline\{)(.*?)(\})", "\\2", final_answer)
    final_answer = re.sub(r"(\\boxed\{)(.*)(\})", "\\2", final_answer)
    assert "\n" not in final_answer
    assert "\r" not in final_answer
    assert "\f" not in final_answer
    if len(re.findall(r"finalansweris(.*)", final_answer)) > 0:
        final_answer = re.findall(r"finalansweris(.*)", final_answer)[-1]

    if len(re.findall(r"answer?is:?(.*)", final_answer)) > 0:
        final_answer = re.findall(r"answer?is:?(.*)", final_answer)[-1]

    if len(re.findall(r"oxed\{(.*?)\}", final_answer)) > 0:
        final_answer = re.findall(r"oxed\{(.*?)\}", final_answer)[-1]

    if len(re.findall(r"\$(.*?)\$", final_answer)) > 0:
        final_answer = re.findall(r"\$(.*?)\$", final_answer)[-1]
    final_answer = final_answer.strip()
    if "rac" in final_answer and "\\frac" not in final_answer:
        final_answer = final_answer.replace("rac", "\\frac")

    final_answer = re.sub(r"(frac)([^{])(.)", "frac{\\2}{\\3}", final_answer)
    final_answer = re.sub(r"(sqrt)([^{])", "sqrt{\\2}", final_answer)
    final_answer = final_answer.replace("$", "")

    if final_answer.replace(",", "").isdigit():
        final_answer = final_answer.replace(",", "")

    return final_answer


def naive_parse(answer: str) -> str:
    """
    Extracts and returns the numeric digits from the input string, processing them in reverse order
    until a non-numeric character is encountered after encountering the first numeric character. 
    Adapted from: https://www.kaggle.com/code/abdurrafae/improved-code-interpretation

    Args:
        answer (str): The input string to parse.

    Returns:
        str: A string consisting of the numeric digits extracted from the input, in their original order.

    Example:
        >>> naive_parse("abc123def")
        '123'
        >>> naive_parse("def456ghi")
        '456'
        >>> naive_parse("no numbers here")
        ''
    """
    out = []
    start = False
    end = False
    for l in reversed(list(answer)):
        if l in "0123456789" and not end:
            start = True
            out.append(l)
        else:
            if start:
                end = True

    out = reversed(out)
    return "".join(out)


def validate_answer_is_numeric(x: str | int | float) -> int:
    FLOAT_TOLERANCE = 0.2
    try:
        x = round(float(x))
        f = float(x)
        if abs(x-f) > FLOAT_TOLERANCE:
            x = -1
    except Exception:
        x = -1
    return x


def get_majority_vote(responses: List[int]) -> int:
    if len(responses) < 1:
        return 0
    else:
        c = Counter(responses)
        value, count = c.most_common()[0]
        return value


def filter_answers(answers: List[str]) -> List[int]:
    formatted_answers = [validate_answer_is_numeric(a) for a in answers]

    # Filter for non-negative answers
    formatted_answers = [a for a in formatted_answers if a >= 0]
        # Compute modulo
    formatted_answers = [a % 1_000 for a in formatted_answers]
    # less than 2.1 billion or cannot convert to C int (32-bit)
    formatted_answers = [a for a in formatted_answers if a <= 999]
    return formatted_answers


def check_sympy_equivalence(ref_answer: str, model_answer: str) -> bool:
    def do_answers_match(ref_answer: str, model_answer: str) -> bool:
        ref_sympy = parse_latex(ref_answer)
        model_sympy = parse_latex(model_answer)
        diff = simplify(ref_sympy - model_sympy)
        return True if -1e-12 < N(diff) < 1e-12 or diff.is_zero else False

    try:
        result = do_answers_match(ref_answer, model_answer)
        return result
    except Exception as e:
        print(e)
        return False


def check_string_match(ref_answer: str, model_answer: str) -> bool:
    try:
        return ref_answer == model_answer
    except Exception as e:
        print(e)
    return False


def check_answer(ref_answer: str, model_answer: str) -> bool:
    # check if strings are the same
    correct = check_string_match(ref_answer, model_answer)
    if correct:
        return True

    # use the sympy library to check if the expressions are the same
    correct = check_sympy_equivalence(ref_answer, model_answer)
    if correct:
        return True

    return False

### vLLM with code generation and execution

In [8]:
def build_vllm(config) -> LLM:
    model_id = config.model_id
        
    num_gpus = torch.cuda.device_count()
    if "awq" in model_id.lower():
        quantization = "AWQ"
    elif "gptq" in model_id.lower():
        quantization = "gptq"
    else:
        quantization = None

    vllm = LLM(
        model=model_id,
        tensor_parallel_size=num_gpus,
        quantization=quantization,
        swap_space=0,
    )
    return vllm


def generate_batched(examples: Dict[str, Any], vllm: LLM, sampling_params: SamplingParams) -> Dict[str, Any]:
    outputs = vllm.generate(examples["gen_texts"], sampling_params, use_tqdm=True)
    examples["gen_texts"] = [o.prompt + o.outputs[0].text for o in outputs]
    return examples


def process_code(example: Dict[str, Any], config: Config, restart_on_fail: bool = False, last_step: bool = False) -> Dict[str, Any]:
    gen_text = example["gen_texts"]
    num_python_blocks = len(re.findall(r"```python(.*?)```", gen_text, re.DOTALL))

    if num_python_blocks == 0:
        if restart_on_fail:
            print("no code has ever been generated, RESTARTING")
            # reset the text to the original
            example["gen_texts"] = example["text"]
        else:
            print("no code has ever been generated, STOP")
            example["should_prune"] = True
            example["has_code"] = False
        return example

    if gen_text[-10:] != "```output\n" and ("answer is" in gen_text[-100:] or "\\boxed" in gen_text[-100:]):
        num_output_blocks = len(re.findall(r"```output(.*?)```", gen_text, re.DOTALL))
        if num_output_blocks == 0:
            print("the model hallucinated the code answer")
            example["should_prune"] = True
            return example

        if "boxed" in gen_text[-100:]:
            try:
                answer = normalize_final_answer(extract_boxed_answer(gen_text[-100:]))
            except Exception:
                answer = "-1"
        else:
            answer = normalize_final_answer(gen_text[-100:])

        example["model_answers"] = answer
        if not config.is_submission:
            example["corrects"] = check_answer(example["ground_truth"], answer)
        example["should_prune"] = True
        print("Answer is: ", answer, example["ground_truth"], example["corrects"])
        return example
    
    if last_step:
        # no point in continuing if we are at the last step
        return example
    
    if gen_text[-10:] != "```output\n":
        # something else has gone wrong with the generation
        print("warning: output block not found: ", gen_text[-40:])
        if restart_on_fail:
            example["gen_texts"] = example["text"]
        else:
            example["should_prune"] = True
        return example

    code_result, status = postprocess_completion(gen_text, return_status=True, last_code_block=True)
    # add the code result for the next round of generation
    TRUNCATION_LIMIT = 200
    if len(code_result) > TRUNCATION_LIMIT:
        code_result = code_result[:TRUNCATION_LIMIT] + " ... (output truncated)"
    example["gen_texts"] = gen_text + f"{code_result}\n```"

    return example


## Define the run configuration and load model

In [9]:
debug = False
model_id = "AI-MO/Numina-Math-7B-GPTQ"
system_prompt = "{}"
validation_set = "kaggle-validation-set-medium"
is_submission = False   # Set to True if you are running on the private test set
num_samples = 48        # Number of candidates to generate (width of the tree)
num_generations = 4     # Number of steps to generate per candidate (depth of the tree)
temperature = 0.8
restart_on_fail=True
top_p = 1.0
top_k = 0
max_new_tokens = 2048

In [10]:
config = Config(
    debug=debug,
    model_id=model_id,
    system_prompt=system_prompt,
    validation_set=validation_set,
    restart_on_fail=restart_on_fail,
    is_submission=is_submission,
    num_samples=num_samples,
    num_generations=num_generations,
    do_sample=True,
    temperature=temperature,
    top_p=top_p,
    top_k=top_k,
    max_new_tokens=max_new_tokens
    )
print(f"=== Running submission with config ===\n\n{config}")

=== Running submission with config ===

Config(model_id='AI-MO/Numina-Math-7B-GPTQ', system_prompt='{}', num_samples=48, num_generations=4, restart_on_fail=True, do_sample=True, temperature=0.8, top_p=1.0, top_k=0, max_new_tokens=2048, is_submission=False, validation_set='kaggle-validation-set-medium', notebook_time_limit=31500, debug=False)


In [11]:
# load the vllm instance and set sampling parameters
sampling_params = SamplingParams(
    temperature=config.temperature,
    max_tokens=config.max_new_tokens,
    stop=["```output\n"],
    include_stop_str_in_output=True,
)
vllm = build_vllm(config)

INFO 07-09 08:31:16 gptq_marlin.py:137] Detected that the model can run with gptq_marlin, however you specified quantization=gptq explicitly, so forcing gptq. Use quantization=gptq_marlin for faster inference
INFO 07-09 08:31:16 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='AI-MO/Numina-Math-7B-GPTQ', speculative_config=None, tokenizer='AI-MO/Numina-Math-7B-GPTQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=AI-MO/Numina-Math-7B-GPTQ)


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


INFO 07-09 08:31:18 weight_utils.py:218] Using model weights format ['*.safetensors']
INFO 07-09 08:31:18 weight_utils.py:261] No model.safetensors.index.json found in remote.
INFO 07-09 08:31:33 model_runner.py:160] Loading model weights took 7.3827 GB
INFO 07-09 08:31:34 gpu_executor.py:83] # GPU blocks: 8544, # CPU blocks: 0
INFO 07-09 08:31:34 model_runner.py:889] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 07-09 08:31:34 model_runner.py:893] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 07-09 08:31:42 model_runner.py:965] Graph capturing finished in 8 secs.


## Generate solutions with tree search

In [12]:
all_start = time.time()
NUM_PROCS = os.cpu_count()
env, iter_test = get_kaggle_env(config)

final_answers = []
time_taken_per_problem = []

# Loop through problems
for test, sample_submission in tqdm(iter_test, desc="Solving problems"):
    start_time = time.time()

    # Wrap each problem with chat template
    example = apply_template(
        {"prompt": test.problem.values[0]},
        tokenizer=vllm.get_tokenizer(),
        prompt=config.system_prompt,
    )
    print(f"=== INPUT FOR PROBLEM ID {test.id.values[0]} ===\n{example}\n")

    # Copy the prompt M times to define the width of the tree
    sample_dataset = []
    for i in range(config.num_samples):
        repeated_example = {
            "problem": "remove",  # not used for the submission TODO Remove
            "ground_truth": "unknown",  # not used for the submission TODO Remove
            "text": example["text"],
            "gen_texts": example["text"],  # used to store all the generated text
            "should_prune": False,
            "problem_index": -1,  # not used for the submission TODO Remove
            "model_answers": "-1",
            "has_code": True,
            "corrects": False,  # not used for the submission TODO Remove
        }
        sample_dataset.append(repeated_example)

    sample_dataset = Dataset.from_list(sample_dataset)
    pruned_datasets = []

    # For each candidate, we generate N times to define the depth of the tree (e.g. 6 steps = 5 code blocks)
    for step in range(config.num_generations):
        sample_dataset = sample_dataset.map(
            generate_batched,
            batch_size=128,
            batched=True,
            fn_kwargs={"vllm": vllm, "sampling_params": sampling_params},
            load_from_cache_file=False,
        )
        # Execute Python code blocks from step i
        sample_dataset = sample_dataset.map(
            process_code,
            num_proc=NUM_PROCS,
            load_from_cache_file=False,
            fn_kwargs={
                "config": config,
                "restart_on_fail": config.restart_on_fail,
                "last_step": step == (config.num_generations - 1),
            },
        )
        # Define which nodes should be pruned
        pruned_dataset = sample_dataset.filter(
            lambda x: x["should_prune"] is True, load_from_cache_file=False
        )
        if len(pruned_dataset) > 0:
            pruned_datasets.append(pruned_dataset)
        sample_dataset = sample_dataset.filter(
            lambda x: x["should_prune"] is False, load_from_cache_file=False
        )

    # Gather candidates from all the nodes
    pruned_datasets.append(sample_dataset)
    sample_dataset = concatenate_datasets(pruned_datasets)
    candidates = sample_dataset["model_answers"]

    # Postprocess the results and submit
    print(f"=== CANDIDATE ANSWERS ({len(candidates)}) ===\n{candidates}\n")
    filtered_answers = filter_answers(candidates)
    print(f"=== FILTERED ANSWERS ({len(filtered_answers)}) ===\n{filtered_answers}\n")
    majority_answer = get_majority_vote(filtered_answers)
    print(f"=== MAJORITY ANSWER (mod 1000) ===\n{majority_answer}\n")
    sample_submission["answer"] = majority_answer
    env.predict(sample_submission)

    # Store answers to compute accuracy on validation set
    test["model_answer"] = majority_answer
    test["candidates"] = [candidates]
    final_answers.append(test)

    # Log time taken
    time_taken_per_problem.append(time.time() - start_time)

    if config.debug:
        break

print("Time taken", time.time() - all_start)
print("Time per problem", time_taken_per_problem)

Map:   0%|          | 0/5 [00:00<?, ? examples/s]



=== INPUT FOR PROBLEM ID 2 ===
{'prompt': 'What is the degree measure of the acute angle formed by lines with slopes $2$ and $\\frac{1}{3}$?', 'text': '### Problem: What is the degree measure of the acute angle formed by lines with slopes $2$ and $\\frac{1}{3}$?\n### Solution: '}



Map:   0%|          | 0/48 [00:00<?, ? examples/s]


[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
Processed prompts: 100%|██████████████████████████████████████████████████████████████████| 48/48 [00:09<00:00,  5.13it/s, est. speed input: 179.48 toks/s, output: 1809.98 toks/s]
num_proc must be <= 48. Reducing num_proc to 48 for dataset of size 48.


Map (num_proc=48):   0%|          | 0/48 [00:00<?, ? examples/s]

Filter:   0%|          | 0/48 [00:00<?, ? examples/s]

Filter:   0%|          | 0/48 [00:00<?, ? examples/s]

Map:   0%|          | 0/48 [00:00<?, ? examples/s]


[A
[A
[A
Processed prompts: 100%|██████████████████████████████████████████████████████████████████| 48/48 [00:04<00:00, 11.96it/s, est. speed input: 4851.00 toks/s, output: 890.99 toks/s]
num_proc must be <= 48. Reducing num_proc to 48 for dataset of size 48.


Map (num_proc=48):   0%|          | 0/48 [00:00<?, ? examples/s]

cannot determine truth value of Relational
Answer is:  45 unknown False
cannot determine truth value of Relational
Answer is:  45 unknown cannot determine truth value of Relational
FalseAnswer is: 
 45 unknown False
cannot determine truth value of Relational
Answer is:  cannot determine truth value of Relational
45 unknown Falsecannot determine truth value of Relational

Answer is: Answer is:   4545  unknown cannot determine truth value of RelationalFalseunknown 
FalseAnswer is: 

 45.0 unknown False
cannot determine truth value of Relational
Answer is:  45 unknown False
cannot determine truth value of Relational
Answer is:  45 cannot determine truth value of Relational
Answer is:  45 unknown False
unknown cannot determine truth value of Relationalcannot determine truth value of Relational
False

Answer is: Answer is:  45cannot determine truth value of Relational
  unknown45 unknown False
Answer is:   False45
 unknown False
cannot determine truth value of Relationalcannot determine tru

Filter:   0%|          | 0/48 [00:00<?, ? examples/s]

Filter:   0%|          | 0/48 [00:00<?, ? examples/s]

Map:   0%|          | 0/8 [00:00<?, ? examples/s]


[A
[A
[A
[A
[A
Processed prompts: 100%|████████████████████████████████████████████████████████████████████| 8/8 [00:02<00:00,  3.26it/s, est. speed input: 2123.73 toks/s, output: 349.12 toks/s]
num_proc must be <= 8. Reducing num_proc to 8 for dataset of size 8.


Map (num_proc=8):   0%|          | 0/8 [00:00<?, ? examples/s]

cannot determine truth value of Relational
Answer is:  45 unknowncannot determine truth value of Relational 
Answer is: False 
45 unknown Falsecannot determine truth value of Relational

Answer is:  45 unknown False
cannot determine truth value of Relationalcannot determine truth value of Relational

Answer is: Answer is:  45  unknown 45False 
unknown False


Filter:   0%|          | 0/8 [00:00<?, ? examples/s]

Filter:   0%|          | 0/8 [00:00<?, ? examples/s]

Map:   0%|          | 0/3 [00:00<?, ? examples/s]


[A
[A
Processed prompts: 100%|████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  1.61it/s, est. speed input: 1371.41 toks/s, output: 193.45 toks/s]
num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3.


Map (num_proc=3):   0%|          | 0/3 [00:00<?, ? examples/s]

cannot determine truth value of Relational
Answer is:  45 unknown False
cannot determine truth value of Relational
Answer is:  45.0 unknown False


Filter:   0%|          | 0/3 [00:00<?, ? examples/s]

Filter:   0%|          | 0/3 [00:00<?, ? examples/s]

Solving problems: 1it [00:40, 40.47s/it]

=== CANDIDATE ANSWERS (48) ===
['45', '45', '45', '45', '45', '45', '45', '45', '45', '45.0', '45', '45', '45', '45', '45.0', '45', '45', '45', '45.0', '45', '45.0', '45', '45', '45', '45', '45', '45', '45', '45', '45.0', '45', '45', '45', '45', '45', '45', '45', '45', '45', '45.0', '45', '45', '45', '45', '45', '45', '45.0', '-1']

=== FILTERED ANSWERS (47) ===
[45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45]

=== MAJORITY ANSWER (mod 1000) ===
45

=== INPUT FOR PROBLEM ID 0 ===
{'prompt': 'Cities $A$ and $B$ are $45$ miles apart. Alicia lives in $A$ and Beth lives in $B$. Alicia bikes towards $B$ at 18 miles per hour. Leaving at the same time, Beth bikes toward $A$ at 12 miles per hour. How many miles from City $A$ will they be when they meet?', 'text': '### Problem: Cities $A$ and $B$ are $45$ miles apart. Alicia lives in $A$ and Beth lives in $

Map:   0%|          | 0/48 [00:00<?, ? examples/s]


[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
Processed prompts: 100%|██████████████████████████████████████████████████████████████████| 48/48 [00:12<00:00,  3.95it/s, est. speed input: 343.74 toks/s, output: 1874.92 toks/s]
num_proc must be <= 48. Reducing num_proc to 48 for dataset of size 48.


Map (num_proc=48):   0%|          | 0/48 [00:00<?, ? examples/s]

Filter:   0%|          | 0/48 [00:00<?, ? examples/s]

Filter:   0%|          | 0/48 [00:00<?, ? examples/s]

Map:   0%|          | 0/48 [00:00<?, ? examples/s]


[A
[A
[A
[A
[A
[A
[A
Processed prompts: 100%|██████████████████████████████████████████████████████████████████| 48/48 [00:04<00:00, 10.87it/s, est. speed input: 6192.45 toks/s, output: 693.16 toks/s]
num_proc must be <= 48. Reducing num_proc to 48 for dataset of size 48.


Map (num_proc=48):   0%|          | 0/48 [00:00<?, ? examples/s]

cannot determine truth value of Relational
Answer is:  27 unknown False
cannot determine truth value of Relationalcannot determine truth value of Relational
Answer is: 
 cannot determine truth value of Relational27 unknown 
Answer is: False
Answer is:  27 unknown cannot determine truth value of Relational
Answer is:  False 
2727 unknown  False
unknown Falsecannot determine truth value of Relational
cannot determine truth value of Relationalcannot determine truth value of Relational 
Answer is: 
 Answer is: 27 cannot determine truth value of Relational 27unknown  
False
Answer is:  27 unknownFalsecannot determine truth value of Relational

unknownAnswer is: 
  False27Answer is:  27cannot determine truth value of Relational unknown 

 Answer is: unknownFalse  
27False 
unknown False
cannot determine truth value of Relational
Answer is:  27 unknown False
cannot determine truth value of Relational
Answer is:  27 unknown False
cannot determine truth value of Relational
Answer is:  27 unknow

Filter:   0%|          | 0/48 [00:00<?, ? examples/s]

Filter:   0%|          | 0/48 [00:00<?, ? examples/s]

Map:   0%|          | 0/2 [00:00<?, ? examples/s]


[A
[A
Processed prompts: 100%|████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.28it/s, est. speed input: 1177.66 toks/s, output: 256.99 toks/s]
num_proc must be <= 2. Reducing num_proc to 2 for dataset of size 2.


Map (num_proc=2):   0%|          | 0/2 [00:00<?, ? examples/s]

cannot determine truth value of Relational
Answer is:  27 unknown False


Filter:   0%|          | 0/2 [00:00<?, ? examples/s]

Filter:   0%|          | 0/2 [00:00<?, ? examples/s]

Map:   0%|          | 0/1 [00:00<?, ? examples/s]


[A
Processed prompts: 100%|████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.83it/s, est. speed input: 3319.66 toks/s, output: 147.98 toks/s]
num_proc must be <= 1. Reducing num_proc to 1 for dataset of size 1.


Map:   0%|          | 0/1 [00:00<?, ? examples/s]

cannot determine truth value of Relational
Answer is:  27 unknown False


Filter:   0%|          | 0/1 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1 [00:00<?, ? examples/s]

Solving problems: 2it [01:20, 40.20s/it]

=== CANDIDATE ANSWERS (48) ===
['27', '27', '27', '27', '18', '27', '27', '27', '27', '27', '27', '27', '27', '27', '27', '27', '27', '27', '27', '27', '27', '27', '27', '27', '27', '27', '27', '27', '27', '27', '27', '27', '27', '27', '27', '27', '27', '27', '27', '27', '27', '27', '27', '27', '27', '27', '27', '27']

=== FILTERED ANSWERS (48) ===
[27, 27, 27, 27, 18, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27]

=== MAJORITY ANSWER (mod 1000) ===
27

=== INPUT FOR PROBLEM ID 1 ===
{'prompt': 'Positive real numbers $x$ and $y$ satisfy $y^3=x^2$ and $(y-x)^2=4y^2$. What is $x+y$?', 'text': '### Problem: Positive real numbers $x$ and $y$ satisfy $y^3=x^2$ and $(y-x)^2=4y^2$. What is $x+y$?\n### Solution: '}



Map:   0%|          | 0/48 [00:00<?, ? examples/s]


[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
Processed prompts: 100%|██████████████████████████████████████████████████████████████████| 48/48 [00:14<00:00,  3.31it/s, est. speed input: 165.28 toks/s, output: 1576.39 toks/s]
num_proc must be <= 48. Reducing num_proc to 48 for dataset of size 48.


Map (num_proc=48):   0%|          | 0/48 [00:00<?, ? examples/s]

Filter:   0%|          | 0/48 [00:00<?, ? examples/s]

Filter:   0%|          | 0/48 [00:00<?, ? examples/s]

Map:   0%|          | 0/48 [00:00<?, ? examples/s]


[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
Processed prompts: 100%|█████████████████████████████████████████████████████████████████| 48/48 [00:07<00:00,  6.50it/s, est. speed input: 3589.90 toks/s, output: 1102.05 toks/s]
num_proc must be <= 48. Reducing num_proc to 48 for dataset of size 48.


Map (num_proc=48):   0%|          | 0/48 [00:00<?, ? examples/s]

cannot determine truth value of Relational
cannot determine truth value of RelationalAnswer is: cannot determine truth value of Relational 

2Answer is:  cannot determine truth value of Relational
Answer is:   Answer is: unknown36 cannot determine truth value of Relational36unknown unknown  Falsecannot determine truth value of Relational
 
 FalseAnswer is: 
False2Answer is: 
 
 0.128188  unknown Falsecannot determine truth value of Relational
36 unknown unknowncannot determine truth value of Relational
 False
FalseAnswer is: 
 cannot determine truth value of Relationalcannot determine truth value of Relational36Answer is:   36unknown

 
Answer is: Answer is: unknown    False36cannot determine truth value of RelationalFalse36
  
unknownunknown 
 FalseAnswer is: False
 
36 unknown False
cannot determine truth value of Relational
Answer is:  cannot determine truth value of Relational
unknownAnswer is:  36 2unknown  False
 False
cannot determine truth value of Relational
Answer is:  36 unk

Filter:   0%|          | 0/48 [00:00<?, ? examples/s]

Filter:   0%|          | 0/48 [00:00<?, ? examples/s]

Map:   0%|          | 0/19 [00:00<?, ? examples/s]


[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
Processed prompts: 100%|██████████████████████████████████████████████████████████████████| 19/19 [00:05<00:00,  3.59it/s, est. speed input: 2578.18 toks/s, output: 454.32 toks/s]
num_proc must be <= 19. Reducing num_proc to 19 for dataset of size 19.


Map (num_proc=19):   0%|          | 0/19 [00:00<?, ? examples/s]

cannot determine truth value of Relationalcannot determine truth value of Relational
Answer is:  36 cannot determine truth value of Relational
unknown Answer is: 
False 
36Answer is:   36.0 cannot determine truth value of Relationalcannot determine truth value of Relationalunknown unknown False
False
Answer is: 

cannot determine truth value of Relational Answer is: 36
   36unknownAnswer is:   Falseunknowncannot determine truth value of Relational 0
Falsecannot determine truth value of Relational
 
unknown FalseAnswer is: 

 Answer is: 36  36unknown  unknownFalse 
False

Answer is:  Answer is: The value \( x + y = 36 \) is accurate.
  36
36  cannot determine truth value of Relationalunknowncannot determine truth value of Relationalunknown  False
False
Answer is: 

 Answer is: 36  unknown36  False
unknown False
cannot determine truth value of Relational
Answer is:  9+3^{2/3} unknown False


Filter:   0%|          | 0/19 [00:00<?, ? examples/s]

Filter:   0%|          | 0/19 [00:00<?, ? examples/s]

Map:   0%|          | 0/6 [00:00<?, ? examples/s]


[A
[A
[A
[A
[A
[A
Processed prompts: 100%|█████████████████████████████████████████████████████████████████████| 6/6 [00:05<00:00,  1.11it/s, est. speed input: 850.89 toks/s, output: 314.05 toks/s]
num_proc must be <= 6. Reducing num_proc to 6 for dataset of size 6.


Map (num_proc=6):   0%|          | 0/6 [00:00<?, ? examples/s]

cannot determine truth value of Relational
Answer is:  36 unknown False


Filter:   0%|          | 0/6 [00:00<?, ? examples/s]

Filter:   0%|          | 0/6 [00:00<?, ? examples/s]

Solving problems: 3it [02:16, 47.58s/it]

=== CANDIDATE ANSWERS (48) ===
['0.128188', '2', '2', '36', '36', '36', '36', '36', '36', '36', '36', '2', '36', '36', '36', '2', '36', '36', '36', '0', '36', '36', '36', '36', '36', '36', '36', '36', '36', '36.0', '36', '36', '36', '36', '36', '36', '36', '0', '9+3^{2/3}', '36', '36', '36', '36', '-1', '-1', '-1', '-1', '-1']

=== FILTERED ANSWERS (42) ===
[0, 2, 2, 36, 36, 36, 36, 36, 36, 36, 36, 2, 36, 36, 36, 2, 36, 36, 36, 0, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 0, 36, 36, 36, 36]

=== MAJORITY ANSWER (mod 1000) ===
36

=== INPUT FOR PROBLEM ID 3 ===
{'prompt': 'What is the value of\n\\[2^3 - 1^3 + 4^3 - 3^3 + 6^3 - 5^3 + \\dots + 18^3 - 17^3?\\]', 'text': '### Problem: What is the value of\n\\[2^3 - 1^3 + 4^3 - 3^3 + 6^3 - 5^3 + \\dots + 18^3 - 17^3?\\]\n### Solution: '}



Map:   0%|          | 0/48 [00:00<?, ? examples/s]


[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
Processed prompts: 100%|██████████████████████████████████████████████████████████████████| 48/48 [00:11<00:00,  4.29it/s, est. speed input: 261.77 toks/s, output: 1780.09 toks/s]
num_proc must be <= 48. Reducing num_proc to 48 for dataset of size 48.


Map (num_proc=48):   0%|          | 0/48 [00:00<?, ? examples/s]

Filter:   0%|          | 0/48 [00:00<?, ? examples/s]

Filter:   0%|          | 0/48 [00:00<?, ? examples/s]

Map:   0%|          | 0/48 [00:00<?, ? examples/s]


[A
[A
[A
[A
[A
Processed prompts: 100%|█████████████████████████████████████████████████████████████████| 48/48 [00:02<00:00, 20.56it/s, est. speed input: 9906.12 toks/s, output: 1359.04 toks/s]
num_proc must be <= 48. Reducing num_proc to 48 for dataset of size 48.


Map (num_proc=48):   0%|          | 0/48 [00:00<?, ? examples/s]

cannot determine truth value of Relational
Answer is:  3159 unknown False
cannot determine truth value of Relationalcannot determine truth value of Relational
Answer is: cannot determine truth value of Relational 3159

Answer is: Answer is:   5831 unknown False
 3159 unknown unknownFalse False

cannot determine truth value of Relationalcannot determine truth value of Relational
Answer is:  3159 cannot determine truth value of Relationalunknown
 
Answer is: FalseAnswer is:  3159 unknown False
 3159
 unknown cannot determine truth value of Relationalcannot determine truth value of Relational
False
Answer is: Answer is:  28519  3159unknowncannot determine truth value of Relational
 unknown
 Answer is:   False3159
False unknown 
Falsecannot determine truth value of Relational
cannot determine truth value of Relational
Answer is:  3159
 unknown 
FalseAnswer is:   3159unknown False
cannot determine truth value of Relationalcannot determine truth value of Relational
cannot determine truth val

Filter:   0%|          | 0/48 [00:00<?, ? examples/s]

Filter:   0%|          | 0/48 [00:00<?, ? examples/s]

Solving problems: 4it [02:43, 39.36s/it]

=== CANDIDATE ANSWERS (48) ===
['3159', '5831', '3159', '3159', '3159', '28519', '3159', '3159', '3159', '3159', '3159', '3159', '3159', '3159', '5831', '3159', '3159', '3159', '3159', '3249', '3159', '3159', '511', '3159', '3159', '487', '3159', '3159', '3159', '5831', '3159', '3159', '3159', '4199', '3159', '3159', '3700', '511', '3159', '3159', '3159', '3159', '-511', '-3159', '3159', '3159', '3159', '-56979']

=== FILTERED ANSWERS (45) ===
[159, 831, 159, 159, 159, 519, 159, 159, 159, 159, 159, 159, 159, 159, 831, 159, 159, 159, 159, 249, 159, 159, 511, 159, 159, 487, 159, 159, 159, 831, 159, 159, 159, 199, 159, 159, 700, 511, 159, 159, 159, 159, 159, 159, 159]

=== MAJORITY ANSWER (mod 1000) ===
159

=== INPUT FOR PROBLEM ID 4 ===
{'prompt': 'In a table tennis tournament every participant played every other participant exactly once. Although there were twice as many right-handed players as left-handed players, the number of games won by left-handed players was $40\\%$ more than th

Map:   0%|          | 0/48 [00:00<?, ? examples/s]


[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
Processed prompts: 100%|██████████████████████████████████████████████████████████████████| 48/48 [00:19<00:00,  2.51it/s, est. speed input: 223.03 toks/s, output: 1746.42 toks/s]
num_proc must be <= 48. Reducing num_proc to 48 for dataset of size 48.


Map (num_proc=48):   0%|          | 0/48 [00:00<?, ? examples/s]

Filter:   0%|          | 0/48 [00:00<?, ? examples/s]

Filter:   0%|          | 0/48 [00:00<?, ? examples/s]

Map:   0%|          | 0/48 [00:00<?, ? examples/s]


[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
Processed prompts: 100%|█████████████████████████████████████████████████████████████████| 48/48 [00:15<00:00,  3.18it/s, est. speed input: 2614.51 toks/s, output: 1204.68 toks/s]
num_proc must be <= 48. Reducing num_proc to 48 for dataset of size 48.


Map (num_proc=48):   0%|          | 0/48 [00:00<?, ? examples/s]


cannot determine truth value of Relationalcannot determine truth value of Relational

Answer is: 36 Answer is: 
 Answer is:  3unknown unknown   False36False

 unknown False
cannot determine truth value of Relational
Answer is: cannot determine truth value of Relational
 33 unknown False
Answer is:  85cannot determine truth value of Relational 
unknownAnswer is:  False
 36 unknown False


Filter:   0%|          | 0/48 [00:00<?, ? examples/s]

Filter:   0%|          | 0/48 [00:00<?, ? examples/s]

Map:   0%|          | 0/42 [00:00<?, ? examples/s]


[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
Processed prompts: 100%|██████████████████████████████████████████████████████████████████| 42/42 [00:15<00:00,  2.71it/s, est. speed input: 3398.82 toks/s, output: 936.23 toks/s]
num_proc must be <= 42. Reducing num_proc to 42 for dataset of size 42.


Map (num_proc=42):   0%|          | 0/42 [00:00<?, ? examples/s]

cannot determine truth value of Relational
Answer is:  36 unknown cannot determine truth value of Relational
Falsecannot determine truth value of Relationalcannot determine truth value of Relational
Answer is: Answer is:  36 unknown False

 44850 unknown False

Answer is:  18 unknown False
cannot determine truth value of Relational
Answer is:  36cannot determine truth value of Relationalcannot determine truth value of Relational

Answer is:  unknown  Answer is:  Falsecannot determine truth value of Relational

1683240 Answer is: unknown   cannot determine truth value of RelationalFalse3unknown
  
unknownAnswer is: False  
36False 
unknown False


Filter:   0%|          | 0/42 [00:00<?, ? examples/s]

Filter:   0%|          | 0/42 [00:00<?, ? examples/s]

Map:   0%|          | 0/33 [00:00<?, ? examples/s]


[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
Processed prompts: 100%|██████████████████████████████████████████████████████████████████| 33/33 [00:18<00:00,  1.78it/s, est. speed input: 2944.39 toks/s, output: 803.63 toks/s]
num_proc must be <= 33. Reducing num_proc to 33 for dataset of size 33.


Map (num_proc=33):   0%|          | 0/33 [00:00<?, ? examples/s]

cannot determine truth value of RelationalAnswer is: 
 False 105 unknown
cannot determine truth value of Relational
Answer is:  36 unknown False


Filter:   0%|          | 0/33 [00:00<?, ? examples/s]

Filter:   0%|          | 0/33 [00:00<?, ? examples/s]

Solving problems: 5it [04:20, 52.06s/it]

=== CANDIDATE ANSWERS (48) ===
['36', '36', '3', '33', '85', '36', '36', '36', '44850', '3240', '168', '18', '36', '3', '36', '105', '36', '-1', '-1', '-1', '-1', '-1', '-1', '-1', '-1', '-1', '-1', '-1', '-1', '-1', '-1', '-1', '-1', '-1', '-1', '-1', '-1', '-1', '-1', '-1', '-1', '-1', '-1', '-1', '-1', '-1', '-1', '-1']

=== FILTERED ANSWERS (17) ===
[36, 36, 3, 33, 85, 36, 36, 36, 850, 240, 168, 18, 36, 3, 36, 105, 36]

=== MAJORITY ANSWER (mod 1000) ===
36

Time taken 261.3344883918762
Time per problem [40.46649241447449, 40.00853776931763, 56.36067175865173, 26.762176990509033, 96.70355486869812]





## Compute local validation accuracy

In [13]:
if not config.is_submission:
    answers_df = env.df.merge(pd.concat(final_answers))
    answers_df["correct"] = answers_df["ground_truth"] == answers_df["model_answer"].astype(str)
    print("Accuracy", answers_df["correct"].mean())

Accuracy 1.0
