# Zero-shot MMOS-DeepSeekMath-7B with self-consistency and generated code reasoning evaluation

Self-consistency is a modification of the standard greedy decoding in reasoning pipelines via sampling several diverse answers followed by aggregation, e.g., most common answer ([SC-CoT paper](https://arxiv.org/pdf/2203.11171.pdf)).

In this kernel, we will consider MMOS-DeepSeekMath-7B RL-tuned backbone; in my experiments, this model produces more consistent code reasoning and the code block execution will allow us to decrease arithmetic hallucinations.

## References

- https://www.kaggle.com/code/ironbar/autobots-roll-out/notebook
- https://www.kaggle.com/code/abdurrafae/improved-code-interpretation
- https://kaggle.com/code/xiaoz259/pure-rng/notebook
- https://www.kaggle.com/code/olyatsimboy/aimo-openmath-mistral-baseline
- https://www.kaggle.com/code/aatiffraz/prompt-prediction-w-mixtral-mistral7b-gemma-llama
- https://www.kaggle.com/code/thedrcat/aimo-mixtral-baseline

## Configuration

In [1]:
import os

def is_running_at_home():
    return os.environ.get('USERNAME', 'Kaggle') == 'gbarbadillo'

if is_running_at_home():
    class ConfigurationPaths:
        dataset = '/mnt/hdd0/Kaggle/aimo/external_data/filtered_MATH_test_5.csv'
        model_path = "/home/gbarbadillo/data/deepseekmath"
        few_shot_dataset = '/mnt/hdd0/Kaggle/aimo/data/AIMO_train_with_solutions.csv'
    vllm_args = ['--gpu-memory-utilization', '0.80']
else:
    class ConfigurationPaths:
        dataset = '/kaggle/input/filtered-math/filtered_MATH_test_5.csv'
        model_path = "/kaggle/input/deepseek-math"
        few_shot_dataset = '/kaggle/input/filtered-math/AIMO_train_with_solutions.csv'
    vllm_args = [
        '--dtype', 'half', #bfloat16 gives error on T4 GPUs
        '--gpu-memory-utilization', '0.99', # 0.625 should be the same memory as T4
        '--kv-cache-dtype', 'fp8_e5m2',
        '--swap-space', '3',
        '--max-num-seqs', '256', # default is 256, 
        '--enforce-eager',
    ]

class CFG:
    # Data parameters
    quick_save = False # If true it will set the time limit to 1 so the saving of the notebook is really quick
    submission_mode = False # If True it will use aimo.env otherwise a mock environment
    ## Data parameters only used when submission_mode is False
    dataset = ConfigurationPaths.dataset
    problem_indices = None # list(range(580))[::11] # If not None will restrict the evaluation to the given problem idx of the dataset
    # Model and VLLM parameters
    model_path = ConfigurationPaths.model_path
    cuda_visible_devices = [0, 1] # A list of cuda visible devices
    context_window_size = 1024
    vllm_verbose = False
    vllm_args = vllm_args
    max_workers = 80 # If None the number of workers will match the number of repetitions
    # Run parameters
    time_limit = 31500 # seconds, 31500 by default which is 8.75 hours
    verbose = False
    save_results = True
    result_priority = ['code_answer', 'text_answer'] #['code_answer', 'boxed_answer', 'text_answer'] # Select which answers will be used as result
    # few-shot parameters
    few_shot_dataset = ConfigurationPaths.few_shot_dataset
    few_shot_samples = 1
    max_sample_tokens = 300 # problems with more than this tokens will be filtered
    max_prompt_tokens = 1024 # 3072 # only prompts with less than this tokens will be used
    difficulty_levels = None # levels outside this range won't be used
    # Inference parameters
    confidence_level = 0.95 # this will be used to stop sampling solutions if the difference between the first and second most voted options is significative
    n_repetitions = 100
    random_seed = None # None or int
    max_new_tokens = 640 #2048
    max_coding_errors = 2
    code_output_truncate_length = 125 # max number of output parameters
    default_answer = 0 # this will be the response when the system does not have a valid answer
    stop_words = ["```output", "```python", "```\nOutput" , ")\n```" , "``````output", 'Problem:', 'User:']
    # https://community.openai.com/t/cheat-sheet-mastering-temperature-and-top-p-in-chatgpt-api/172683
    # temperature for text generation
    temperature_text = 0.9
    top_p_text = 0.9
    # temperature for coding generation
    temperature_code = 0.9
    top_p_code = 0.9

## Imports

In [2]:
import logging

for handler in logging.root.handlers[:]:
    logging.root.removeHandler(handler)
logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s - %(levelname)s - %(message)s')

In [3]:
import time
NOTEBOOK_START_TIME = time.time()

In [4]:
def install_vllm_and_openai():
    if is_running_at_home():
        logging.info("VLLM is already installed, skipping installation")
        return
    try:
        import openai
    except ImportError:
        logging.info("Installing VLLM and openai")
        !pip uninstall -q -y torch
        !pip install --no-index --find-links=/kaggle/input/making-wheels-of-necessary-packages-for-vllm vllm -q
        !pip install --no-index --find-links=/kaggle/input/making-wheels-of-necessary-packages-for-vllm openai -q

# I need to do the installation here because otherwise torch will be imported before the vllm installation
install_vllm_and_openai()

2024-06-18 10:13:35,678 - INFO - VLLM is already installed, skipping installation


In [5]:
import os
os.environ['TOKENIZERS_PARALLELISM'] = 'false'

import sys
import subprocess
from IPython.display import display, Markdown
import pandas as pd
from tqdm.auto import tqdm
import torch
import gc
import re
import math
import random
import json
from collections import Counter
import numpy as np
import tempfile
from pydantic import BaseModel
from typing import Optional
import datetime
from scipy.stats import norm

from transformers import AutoTokenizer
import transformers
print(f"Transformers Version: {transformers.__version__}")

import matplotlib.pyplot as plt
import matplotlib as mpl

from openai import OpenAI
import threading
import subprocess
import requests
from concurrent.futures import ProcessPoolExecutor
import psutil

plt.plot()
plt.close('all')
plt.rcParams["figure.figsize"] = (20, 5)
mpl.rcParams['lines.linewidth'] = 3
mpl.rcParams['font.size'] = 16
logging.info('Imported all libraries.')

Transformers Version: 4.41.2


2024-06-18 10:13:39,725 - INFO - Imported all libraries.


In [6]:
# Disable OpenAI and httpx logging
# Configure logging level for specific loggers by name
logging.getLogger("openai").setLevel(logging.ERROR)
logging.getLogger("httpx").setLevel(logging.ERROR)
# # Disable specific loggers by name
# logging.getLogger("openai").disabled = True
# logging.getLogger("httpx").disabled = True

## Code

### Load data

In [7]:
class MockEnvWithDataframe:
    """
    This class has the same interface as aimo.env, thus you can reuse the same code
    for making submissions or evaluating other datasets
    """
    def __init__(self, df):
        """
        Initializes the mock environment with a dataframe containing problems.
        """
        self.df = df
        self.submissions = []

    def iter_test(self):
        """
        Simulates the iter_test function by yielding each problem with an accompanying sample_submission.
        """
        for _, row in self.df.iterrows():
            problem = pd.DataFrame([row])
            sample_submission = pd.DataFrame({'id': problem.id, 'answer': [None]})
            yield problem, sample_submission

    def predict(self, sample_submission):
        self.submissions.append(sample_submission)
        
    def get_all_submissions(self):
        return pd.concat(self.submissions)

In [8]:
if CFG.submission_mode:
    import aimo
    env = aimo.make_env()
    if os.getenv('KAGGLE_IS_COMPETITION_RERUN'):
        N_PROBLEMS = 50
    else:
        N_PROBLEMS = 3
else:
    df = pd.read_csv(CFG.dataset)
    if CFG.problem_indices is not None:
        df = df.iloc[CFG.problem_indices].reset_index(drop=True)
    if 'answer' in df.columns:
        df['ground_truth'] = df['answer']
    elif CFG.dataset == '/kaggle/input/ai-mathematical-olympiad-prize/test.csv': 
        df['ground_truth'] = 0
    N_PROBLEMS = len(df)
    display(df)
    env = MockEnvWithDataframe(df)
iter_test = env.iter_test()

Unnamed: 0,problem,level,type,solution,stage,source,n_boxed,answer,input_tokens,output_tokens,total_tokens,id,ground_truth
0,John computes the sum of the elements of each ...,Level 5,Algebra,"Among the two-element subsets of $\{1,2,3,4,5,...",test,MATH,1,105,43,94,137,5572,105
1,Rationalize the denominator: $\frac{1}{1 + \sq...,Level 5,Algebra,We begin by grouping terms in the denominator ...,test,MATH,1,12,75,405,480,5583,12
2,Given that $f(2)=5$ and $f^{-1}(x+4)=2f^{-1}(x...,Level 5,Algebra,Note that $f(2)=5$ implies $f^{-1}(5)=2$. Appl...,test,MATH,1,23,42,148,190,5584,23
3,Simplify $\frac{3}{\sqrt[5]{16}}+\frac{1}{\sqr...,Level 5,Algebra,Rationalizing each of the two fractions on its...,test,MATH,1,5,81,250,331,5600,5
4,Let $f$ be defined by \[f(x) = \left\{\n\begi...,Level 5,Algebra,The number $f^{-1}(0)$ is the value of $x$ suc...,test,MATH,1,0,89,744,833,5633,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
575,An angle $x$ is chosen at random from the inte...,Level 5,Precalculus,Because $\cos(90^{\circ}-x)=\sin x$ and $\sin(...,test,MATH,1,92,120,241,361,6464,92
576,The equation\n\[4 \cos 27^\circ = \sqrt{a + \s...,Level 5,Precalculus,"First, we derive the values of $\cos 36^\circ....",test,MATH,1,18,92,791,883,6465,18
577,"Let $x_1,$ $x_2,$ $x_3,$ $y_1,$ $y_2,$ and $y_...",Level 5,Precalculus,"In general,\n\[\frac{1}{2} \begin{vmatrix} x_1...",test,MATH,1,144,178,278,456,6466,144
578,Find the smallest positive real number $C$ for...,Level 5,Precalculus,Let $\bold{v} = \begin{pmatrix} x \\ y \end{pm...,test,MATH,1,4,94,632,726,6467,4


### Model

In [9]:
def get_tokenizer(model_path):
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    tokenizer.pad_token_id = tokenizer.eos_token_id
    return tokenizer

tokenizer = get_tokenizer(CFG.model_path)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [10]:
class TextGenerator():
    """
    Abstraction that allows to generate text and code in different steps efficiently
    """
    def __init__(self, cfg, port=8000):
        self.cfg = cfg
        self.reset()
        self.client = OpenAI(api_key='EMPTY', base_url=f"http://localhost:{port}/v1")

    def reset(self):
        self.prompt_tokens = 0
        self.generated_tokens = 0
        self.past_key_values = None
        self.set_generation_mode('text')
        self.max_new_tokens = self.cfg.max_new_tokens

    def set_generation_mode(self, mode):
        if mode == 'text':
            self.set_sampling_parameters(self.cfg.temperature_text, self.cfg.top_p_text)
        elif mode == 'code':
            self.set_sampling_parameters(self.cfg.temperature_code, self.cfg.top_p_code)
        else:
            raise KeyError(mode)

    def set_sampling_parameters(self, temperature, top_p):
        self.sampling_parameters = dict(temperature=temperature, top_p=top_p)

    def are_generation_tokens_available(self):
        return self.generated_tokens < self.max_new_tokens

    def verify_max_new_tokens(self):
        if self.max_new_tokens > self.cfg.context_window_size - self.prompt_tokens:
            self.max_new_tokens = self.cfg.context_window_size - self.prompt_tokens
            logging.warning(f'Reducing max_new_tokens to {self.max_new_tokens} to avoid exceeding the context window of {self.cfg.context_window_size}')

    def generate(self, prompt, mode='text'):
        self.set_generation_mode(mode)
        if not self.are_generation_tokens_available():
            logging.warning(f'Input text exceeded the available generation tokens. This is likely happening because a big code output.')
            return prompt

        t0 = time.time()
        clear_memory()
        completion = self.client.completions.create(
            model=self.cfg.model_path,
            prompt=prompt,
            max_tokens=self.max_new_tokens - self.generated_tokens,
            **self.sampling_parameters,
            echo=True,
            stop=self.cfg.stop_words,
        )
        decoded_output = completion.choices[0].text
        if completion.choices[0].finish_reason == 'stop' and completion.choices[0].stop_reason is not None:
            decoded_output += completion.choices[0].stop_reason

        if self.prompt_tokens == 0:
            self.prompt_tokens = completion.usage.prompt_tokens
            # logging.info(f'Prompt has {self.prompt_tokens} tokens.')
            self.verify_max_new_tokens()
        newly_generated_tokens = completion.usage.completion_tokens
        self.generated_tokens = completion.usage.total_tokens - self.prompt_tokens
        # logging.info(f'Generating {mode} speed: {newly_generated_tokens/(time.time() - t0):.1f} tokens/s ({newly_generated_tokens}) ({self.generated_tokens}/{self.max_new_tokens})')
        return decoded_output

    def __call__(self, prompt, mode):
        return self.generate(prompt, mode)

In [11]:
def log_gpu_memory():
    for device in range(torch.cuda.device_count()):
        # logging.info(f'GPU {device} memory allocated: {torch.cuda.memory_allocated(device)/1024**3:.1f} GB, max memory allocated: {torch.cuda.max_memory_allocated(device)/1024**3:.1f} GB')
        available, total = torch.cuda.mem_get_info(device)
        logging.info(f'GPU {device} memory available: {available/1024**3:.1f}/{total/1024**3:.1f} GB')
    logging.info(f'RAM memory available: {psutil.virtual_memory().available/1024**3:.1f}/{psutil.virtual_memory().total/1024**3:.1f} GB')

def empty_gpu_vram():
    logging.info('Emptying GPU VRAM...')
    global tokenizer
    variables_to_delete = ['tokenizer']
    for variable_name in variables_to_delete:
        if variable_name in globals():
            del globals()[variable_name]
    gc.collect()
    gc.collect()
    torch.cuda.empty_cache()
    log_gpu_memory()

log_gpu_memory()

2024-06-18 10:13:40,197 - INFO - GPU 0 memory available: 23.4/23.7 GB
2024-06-18 10:13:40,277 - INFO - GPU 1 memory available: 23.2/23.7 GB
2024-06-18 10:13:40,278 - INFO - RAM memory available: 57.6/62.5 GB


In [12]:
def run_vllm_server(device=0, port=8000):
    # https://docs.vllm.ai/en/stable/serving/openai_compatible_server.html#command-line-arguments-for-the-server
    env = os.environ.copy()
    env['CUDA_VISIBLE_DEVICES'] = str(device)
    if CFG.vllm_verbose:
        stdout_kwargs = dict()
    else:
        # this captures all the output of the server
        stdout_kwargs = dict(stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
    process = subprocess.Popen(['python', '-m', 'vllm.entrypoints.openai.api_server',
                    '--model', CFG.model_path,
                    '--port', str(port),
                    '--seed', str(CFG.random_seed or random.randint(0, 1000000)),
                    '--max-model-len', str(CFG.context_window_size),
                    ] + CFG.vllm_args,
                    env=env,
                    **stdout_kwargs)
    while not stop_event.is_set():
        time.sleep(1)
    # logging.info(f"Stopping VLLM server on device {device}...")
    process.kill()

def is_server_running(server_url):
    try:
        response = requests.get(server_url)
        return response.status_code == 200
    except requests.exceptions.ConnectionError:
        return False

def wait_for_server(server_url):
    while not is_server_running(server_url):
        time.sleep(1)
    logging.info(f"Server is running at {server_url}!")

def get_device_port(device):
    return 8000 + device

def create_model_and_inference_artifacts():
    global stop_event
    if 'stop_event' in globals():
        return
    stop_event = threading.Event()
    for device in CFG.cuda_visible_devices:
        port = get_device_port(device)
        server_thread = threading.Thread(target=run_vllm_server, kwargs=dict(device=device, port=port))
        logging.info(f'Starting VLLM server on device {device}...')
        server_thread.start()
        wait_for_server(f"http://localhost:{port}/v1/models")
    log_gpu_memory()

def stop_vllm_servers():
    global stop_event
    if 'stop_event' in globals():
        stop_event.set()
        logging.info('Stopping VLLM servers...')
        time.sleep(5)
        empty_gpu_vram()
        del stop_event

In [13]:
def is_gpu_memory_available(required_memory=14):
    for device in range(torch.cuda.device_count()):
        available_memory = torch.cuda.mem_get_info(device)[0]/1024**3
        logging.info(f'Available memory on GPU {device} is {available_memory:.1f} GB')
        if available_memory < required_memory:
            return False
    return True

def wait_for_gpu_memory(wait_time=60, required_memory=14):
    while not is_gpu_memory_available(required_memory):
        logging.info(f'Waiting for GPU memory to be available...')
        time.sleep(wait_time)
    logging.info(f'GPU memory is available. Let\'s go training!')
    time.sleep(1) # wait a bit more to ensure the server is ready

is_gpu_memory_available()

2024-06-18 10:13:40,321 - INFO - Available memory on GPU 0 is 23.4 GB
2024-06-18 10:13:40,322 - INFO - Available memory on GPU 1 is 23.2 GB


True

### Prompts

#### Define problems

In [14]:
prompts_df = pd.read_csv(CFG.few_shot_dataset)
prompts_df.head()

Unnamed: 0,problem,solution,answer,input_tokens,output_tokens,total_tokens,id
0,"Let $k, l > 0$ be parameters. The parabola $y ...",```python\nfrom sympy import *\n# Symbolic Cal...,52,78,327,405,0
1,Each of the three-digits numbers $111$ to $999...,```python\n# Greedy Algorithm\ndef is_yellow(x...,250,60,168,228,1
2,Let the `sparkle` operation on positive intege...,"1. Sum the digits\n2. Compute the factorial, f...",702,125,209,334,2
3,What is the minimum value of $5x^2+5y^2-8xy$ w...,```python\nimport numpy as np\nfrom scipy.opti...,800,55,194,249,3
4,There exists a unique increasing geometric seq...,```python\ndef find_solution():\n for a1 in...,211,21,182,203,4


In [15]:
logging.info(f'The number of problems for few-shot prompting is {len(prompts_df)} previous to filtering')
logging.info(f'Filtering problems longer than {CFG.max_sample_tokens} tokens and outside levels {CFG.difficulty_levels}')
prompts_df = prompts_df[prompts_df.total_tokens < CFG.max_sample_tokens]
if CFG.difficulty_levels is not None:
    prompts_df = prompts_df[prompts_df.level.isin([f'Level {i}' for i in CFG.difficulty_levels])]
prompts_df.reset_index(drop=True, inplace=True)
logging.info(f'The number of problems for few-shot prompting is {len(prompts_df)} after filtering')

2024-06-18 10:13:40,371 - INFO - The number of problems for few-shot prompting is 10 previous to filtering
2024-06-18 10:13:40,372 - INFO - Filtering problems longer than 300 tokens and outside levels None
2024-06-18 10:13:40,375 - INFO - The number of problems for few-shot prompting is 7 after filtering


#### Create few shot prompts

In [16]:
# https://github.com/deepseek-ai/DeepSeek-Math/tree/main
few_shot_prompt_templates = [
"""
User: QUESTION_PLACEHOLDER
Please integrate natural language reasoning with programs to solve the problem above, and put your final answer within \\boxed{}. The answer is a non negative integer.

Assistant: Sure, we can solve the problem by writing a Python script.

ANSWER_PLACEHOLDER
""",
"""
User: QUESTION_PLACEHOLDER
Please reason step by step, and put your final answer within \\boxed{}. The answer is a non negative integer.

Assistant: Sure, we can solve the problem by writing a Python program.

ANSWER_PLACEHOLDER
""",
"""
Problem:

QUESTION_PLACEHOLDER

You are an expert mathematical programmer. Solve the above mathematical problem by writing a Python program.
Express your answer as a numeric type or a SymPy object. The answer must be an integer greater or equal to zero.
Please reason step by step, and always end with "The answer is $\\boxed{}$".

ANSWER_PLACEHOLDER
"""
]


def create_random_few_shot_prompt(n=CFG.few_shot_samples, template_idx=0):
    prompt_template = few_shot_prompt_templates[template_idx % len(few_shot_prompt_templates)]
    prompt = ''
    problem_indices = np.random.choice(np.arange(len(prompts_df)), n, replace=False)
    for problem_idx in problem_indices:
        row = prompts_df.loc[problem_idx]
        prompt += prompt_template.replace('QUESTION_PLACEHOLDER', row['problem']).replace('ANSWER_PLACEHOLDER', row['solution'])
        #prompt += f'\nThe final answer is $\\boxed{{{row["answer"]}}}$\n'
    prompt += prompt_template.replace('QUESTION_PLACEHOLDER', 'PROBLEM_PLACEHOLDER').replace('ANSWER_PLACEHOLDER', '```python')
    return prompt.strip()

def create_random_few_shot_prompt_with_token_limit(token_limit=CFG.max_prompt_tokens, template_idx=0):
    while 1:
        prompt = create_random_few_shot_prompt(template_idx=template_idx)
        if len(tokenizer.tokenize(prompt)) < token_limit:
            return prompt


prompt = create_random_few_shot_prompt_with_token_limit()
print(f'Number of tokens in prompt: {len(tokenizer.tokenize(prompt))}')

Number of tokens in prompt: 320


In [17]:
%%time
print('Create some random prompts to see token length distribution')
sorted([len(tokenizer.tokenize(create_random_few_shot_prompt())) for _ in range(10)])

Create some random prompts to see token length distribution
CPU times: user 21.2 ms, sys: 161 µs, total: 21.3 ms
Wall time: 19.3 ms


[279, 288, 345, 345, 345, 345, 351, 366, 399, 399]

#### Simple prompts

In [18]:
prompt_options = [
"""Below is a math problem you are to solve (non negative integer answer):

\"PROBLEM_PLACEHOLDER\"

To accomplish this, first determine a sympy-based approach for solving the problem by listing each step to take and what functions need to be called in each step. Be clear so even an idiot can follow your instructions, and remember, your final answer should be a non negative integer, not an algebraic expression!
Write the entire script covering all the steps (use comments and document it well) and print the result. After solving the problem, output the final numerical answer within \\boxed{}.

Approach:

```python""",
"""Below is a math problem you are to solve (non negative integer answer):

\"PROBLEM_PLACEHOLDER\"

Analyze this problem and think step by step to come to a solution with programs. After solving the problem, output the final numerical answer within \\boxed{}.

```python""",
"""User: PROBLEM_PLACEHOLDER
Please reason step by step, and put your final answer within \\boxed{}. The answer is a non negative integer.

Assistant: Sure, we can solve the problem by writing a Python script.

```python""",
"""User: PROBLEM_PLACEHOLDER
Please integrate natural language reasoning with programs to solve the problem above, and put your final answer within \\boxed{}. The answer is a non negative integer.

Assistant: Sure, we can solve the problem by writing a Python program.

```python""",
"""You are an expert mathematical programmer. Solve the mathematical problem below by writing a Python program.

- Express your answer as a numeric type or a sympy object. The answer must be an integer greater or equal to zero.
- Please reason step by step, and write clean and readable code.
- You can use python libraries such as sympy, math or numpy to solve the problem.
- Finally always end with "The answer is $\\boxed{}$".

PROBLEM_PLACEHOLDER

```python""",
"""
User: PROBLEM_PLACEHOLDER
Please reason step by step, and put your final answer within \\boxed{}. The answer is a non negative integer.
Use all the available information in the problem description, and be very careful with the assumptions and simplifications you make.
You might use python libraries such as sympy, math, scipy or numpy to solve the problem, use the right tool.
Use code even for the simpler calculations to avoid mistakes.

Assistant: Sure, we can solve the problem by writing a Python program.

```python""",
]

print(len(prompt_options))
sorted([len(tokenizer.tokenize(prompt)) for prompt in prompt_options])

6


[55, 62, 64, 110, 119, 136]

In [19]:
del tokenizer

#### Prompt examples

In [20]:
few_shot_prompt_templates = [] # TODO: remove this line


def get_formatted_prompt(problem, repetition_idx):
    prompt_idx = (repetition_idx + len(problem))% (len(prompt_options) + len(few_shot_prompt_templates))
    if prompt_idx < len(prompt_options):
        prompt = prompt_options[prompt_idx]
    else:
        prompt = create_random_few_shot_prompt(template_idx=repetition_idx)
    prompt = prompt.replace('PROBLEM_PLACEHOLDER', problem)
    return prompt

In [21]:
for problem_idx in range(len(prompt_options) + len(few_shot_prompt_templates)):
    display(Markdown(get_formatted_prompt('What is $1 + 10$?', problem_idx)))
    display(Markdown('---'))


User: What is $1 + 10$?
Please reason step by step, and put your final answer within \boxed{}. The answer is a non negative integer.
Use all the available information in the problem description, and be very careful with the assumptions and simplifications you make.
You might use python libraries such as sympy, math, scipy or numpy to solve the problem, use the right tool.
Use code even for the simpler calculations to avoid mistakes.

Assistant: Sure, we can solve the problem by writing a Python program.

```python

---

Below is a math problem you are to solve (non negative integer answer):

"What is $1 + 10$?"

To accomplish this, first determine a sympy-based approach for solving the problem by listing each step to take and what functions need to be called in each step. Be clear so even an idiot can follow your instructions, and remember, your final answer should be a non negative integer, not an algebraic expression!
Write the entire script covering all the steps (use comments and document it well) and print the result. After solving the problem, output the final numerical answer within \boxed{}.

Approach:

```python

---

Below is a math problem you are to solve (non negative integer answer):

"What is $1 + 10$?"

Analyze this problem and think step by step to come to a solution with programs. After solving the problem, output the final numerical answer within \boxed{}.

```python

---

User: What is $1 + 10$?
Please reason step by step, and put your final answer within \boxed{}. The answer is a non negative integer.

Assistant: Sure, we can solve the problem by writing a Python script.

```python

---

User: What is $1 + 10$?
Please integrate natural language reasoning with programs to solve the problem above, and put your final answer within \boxed{}. The answer is a non negative integer.

Assistant: Sure, we can solve the problem by writing a Python program.

```python

---

You are an expert mathematical programmer. Solve the mathematical problem below by writing a Python program.

- Express your answer as a numeric type or a sympy object. The answer must be an integer greater or equal to zero.
- Please reason step by step, and write clean and readable code.
- You can use python libraries such as sympy, math or numpy to solve the problem.
- Finally always end with "The answer is $\boxed{}$".

What is $1 + 10$?

```python

---

### Utils

In [22]:
def clear_memory():
    for _ in range(2):
        torch.cuda.empty_cache()
        gc.collect()
        time.sleep(0.01)

In [23]:
def is_ending_time(max_time=CFG.time_limit):
    is_ending_time = get_time_spent() > max_time
    if is_ending_time:
        logging.warning('Reached limit time, inference will be skipped.')
    return is_ending_time

def get_time_spent():
    return time.time() - NOTEBOOK_START_TIME

assert not is_ending_time(10000)
assert is_ending_time(0)



In [24]:
def is_quick_save_condition(idx, test):
    if CFG.quick_save and idx == 0 and CFG.submission_mode:
        if test['id'].values[0] == '000aaa':
            if test['problem'].values[0] == 'What is $1-1$?':
                logging.info('Quick save condition reached. Skipping inference')
                return True
    return False

In [25]:
def get_timestamp():
    return datetime.datetime.now().strftime("%Y-%m-%d_%H:%M:%S")

print(get_timestamp())

2024-06-18_10:13:40


In [26]:
N_REPETITIONS = CFG.n_repetitions
PROBLEM_REPETITIONS = []

def adjust_repetitions_to_meet_ending_time(answered_problems,
                                           max_time=CFG.time_limit,
                                           min_problem_threshold=20,
                                           hysteresis=0.975):
    global N_REPETITIONS, PROBLEM_REPETITIONS
    PROBLEM_REPETITIONS.append(N_REPETITIONS)
    if answered_problems < min_problem_threshold:
        return
    spent_time = get_time_spent()
    mean_problem_time = spent_time/sum(PROBLEM_REPETITIONS)
    estimated_ending_time = (N_PROBLEMS - answered_problems)*mean_problem_time*N_REPETITIONS + spent_time
    if estimated_ending_time > max_time and N_REPETITIONS > 1:
        N_REPETITIONS -= 1
        logging.warning(f'Decreasing the number of repetitions to {N_REPETITIONS} to try to meet ending time')
        logging.info(f'Mean problem time: {mean_problem_time:.1f} seconds, estimated ending time {estimated_ending_time/3600:.1f} hours')
    elif estimated_ending_time < max_time*hysteresis and N_REPETITIONS < CFG.n_repetitions:
        N_REPETITIONS += 1
        logging.warning(f'Increasing the number of repetitions to {N_REPETITIONS} because it seems to be enough time to meet the ending time')
        logging.info(f'Mean problem time: {mean_problem_time:.1f} seconds, estimated ending time {estimated_ending_time/3600:.1f} hours')

### Response parsing

In [27]:
def text_to_int_answer(text):
    try:
        answer = float(text)
        if answer < 0 or not answer.is_integer():
            return None
        return int(answer)
    except (ValueError, OverflowError):
        return None

assert 5 == text_to_int_answer('5')
assert 5 == text_to_int_answer('5.0')
assert text_to_int_answer('-1') is None
assert text_to_int_answer('0.5') is None
assert text_to_int_answer('pi') is None

In [28]:
def parse_boxed_answer(text):
    matches = re.findall(r'\\boxed\{(\d+)\}', text)
    if matches:
        return text_to_int_answer(matches[-1])
    return None

test_text = """

blah blah \\boxed{5} 7
"""
assert parse_boxed_answer(test_text) == 5

test_text = """

blah blah {5} 7
"""
assert parse_boxed_answer(test_text) == None

In [29]:
def parse_response_in_text(text):
    response = parse_boxed_answer(text)
    if response is not None:
        return response
    return parse_last_answer(text)

def parse_last_answer(text):
    pattern = r'(?:the answer is|the final answer is)\s*:?\s*\$?(\d+(\.\d+)?)\$?'
    matches = re.findall(pattern, text, re.IGNORECASE)
    if matches:
        return text_to_int_answer(matches[-1][0])
    return None

test_cases = [
    ('The answer is: $651$', 651),
    ('The answer is: $5$.', 5),
    ('The answer is: 6.', 6),
    ('The final answer is 0.', 0),
    ('The final answer is 126.', 126),
    ('The final answer is: $2$.', 2),
    ('The answer is $\\boxed{3}$', 3),
    ('The answer is $\\boxed{-1}$', None),
    ('The answer is $\\boxed{1.5}$', None),
    ('The answer is: $-1$.', None),
    ('The answer is: $4.5$.', None),
    ('The final answer is 0.6', None),
]
for text, answer in test_cases:
    assert parse_response_in_text(text) == answer
    assert parse_response_in_text(text.lower()) == answer

In [30]:
def parse_response_in_code(code_output):
    if code_output is None:
        return None
    try:
        code_output = code_output.strip()
        if code_output.startswith('[') and code_output.endswith(']'):
            return text_to_int_answer(code_output[1:-1])
        return text_to_int_answer(code_output)
    except Exception as e:
        print(f'Exception when trying to get a response from code: {e}')
        return None
    
assert parse_response_in_code('0') == 0
assert parse_response_in_code('[0]') == 0

### Code interpreter

In [31]:
def code_interpreter(code):
    code = preprocess_code(code)
    output, run_success = execute_code(code)
    return output, run_success

def preprocess_code(code):
    code = ensure_symbols_are_real(code)
    code = add_simplify_to_print(code)
    code = f'from sympy import *\n{code}'
    return code

def add_simplify_to_print(code):
    code = code.replace('print(', 'simplify_print(')
    new_code = """
def simplify_print(x):
    print(recursive_simplify(x))
        
def recursive_simplify(x):
    if isinstance(x, list):
        return [recursive_simplify(y) for y in x]
    return simplify(x)
"""
    code = new_code + '\n' + code
    return code

def ensure_symbols_are_real(code):
    def replace_symbols_call(match):
        matched_text = match.group()
        if "real" not in matched_text:
            return f"{matched_text[:-1]}, real=True)"
        else:
            return matched_text
    code = re.sub(r"symbols\([^)]+\)", replace_symbols_call, code)
    return code

assert ensure_symbols_are_real("x, y, z = symbols('x y z')") == "x, y, z = symbols('x y z', real=True)"
assert ensure_symbols_are_real("x, y, z = symbols('x y z', real=True)") == "x, y, z = symbols('x y z', real=True)"

def execute_code(code, timeout_limit=7):
    with tempfile.NamedTemporaryFile(mode='w+', delete=False) as temp_file:
        temp_file.write(code)
        temp_filepath = temp_file.name
    cmd = f'timeout {timeout_limit} {sys.executable} {temp_filepath}'
    ret = subprocess.run(cmd, shell=True, capture_output=True, text=True)
    os.remove(temp_filepath)
    if ret.returncode == 0:
        return truncate_text(get_last_line(ret.stdout)), True
    elif ret.returncode == 124:
        return f'The execution of the code timeout. The code needs to run in less than {timeout_limit} seconds.', False
    else:
        #output = remove_references_to_temp_code_file(ret.stderr, temp_filepath)
        output = truncate_text(get_last_line(ret.stderr))
        return output, False
    
def remove_references_to_temp_code_file(output, filepath):
    return output.replace(f'File "{filepath}", ', '')

def get_last_line(text):
    lines = text.strip().splitlines()
    if lines:
        return lines[-1]
    return text.strip()

def truncate_text(text, max_length=CFG.code_output_truncate_length):
    """Sometimes code output can be very long"""
    if len(text) > max_length:
        return text[:max_length] + '...'
    return text

test_code = """
print('Hello')
"""
print(code_interpreter('print(0)'))
print(code_interpreter('foo'))

('0', True)
("NameError: name 'foo' is not defined", False)


In [32]:
test_code = """
from sympy import symbols, Eq, solve

def solve_equation():
    x = symbols('x')
    equation = Eq(4 + x, 4)
    solution = solve(equation, x)

    return solution

result = solve_equation()
print(result)
"""
print(code_interpreter(test_code))

test_code = """
from sympy import symbols, Eq, solve

def solve_equation():
    x = symbols('x', real=True)
    equation = Eq(4 + x, 4)
    solution = solve(equation, x)

    return solution

result = solve_equation()
print(result)
"""
print(code_interpreter(test_code))

('[0]', True)
('[0]', True)


In [33]:
def parse_last_python_code_block(text):
    return text.split('```python')[-1].split("```")[0]

test_text = """
```python
hello
``````output
"""
assert parse_last_python_code_block(test_text) == '\nhello\n'

test_text = """
```python
hello
```
"""
assert parse_last_python_code_block(test_text) == '\nhello\n'

In [34]:
def add_code_output_to_prompt(decoded_output, code_output):
    if decoded_output.endswith(")\n```"):
        prompt = decoded_output+'```output\n'+str(code_output)+'\n```\n'
    else:
        prompt = decoded_output+'\n'+str(code_output)+'\n```\n'
    return prompt

In [35]:
class CodeRunner():
    """
    Abstraction to run code that:

    - Accumulates the code if the runs are succesfull
    - Measures number of coding errors
    """
    def __init__(self, max_coding_errors=2):
        self.accumulated_code = ''
        self.n_coding_errors = 0
        self.successful_code_output = None
        self.max_coding_errors = max_coding_errors
        self.code_interpreter_calls = 0

    def run_code(self, code):
        self.code_interpreter_calls += 1
        new_code = self.accumulated_code + "\n" + code
        code_output, run_success = code_interpreter(new_code)
        if run_success:
            self.accumulated_code = new_code
            self.successful_code_output = code_output
        else:
            self.n_coding_errors += 1
            self.successful_code_output = None
        return code_output

    def max_coding_errors_reached(self):
        max_coding_errors_reached = self.n_coding_errors >= self.max_coding_errors
        if max_coding_errors_reached:
            # logging.warning(f'Stopping solution generation because {self.n_coding_errors} coding errors were done.')
            pass
        return max_coding_errors_reached

### Results

In [36]:
class InferenceResult(BaseModel):
    # text
    prompt: str
    response: Optional[str] = None
    # answers
    boxed_answer: Optional[int] = None
    text_answer: Optional[int] = None
    code_answer: Optional[int] = None
    # output
    output_tokens: int = 0
    reached_max_tokens: bool = False
    # code
    coding_errors: int = 0
    code_interpreter_calls: int = 0

In [37]:
def is_difference_significative(n_first, n_second, n_tries, confidence_level=CFG.confidence_level):
    if n_second == 0:
        if n_first == n_tries:
            return is_difference_significative(n_first, 1, n_tries + 1, confidence_level)
        elif n_first < n_tries:
            return is_difference_significative(n_first, 1, n_tries, confidence_level)
        else:
            raise ValueError()
    p_first = n_first/n_tries
    p_second = n_second/n_tries
    uncertainty = (p_first*(1-p_first)/n_tries + p_second*(1-p_second)/n_tries)**0.5
    z = (p_first - p_second)/uncertainty
    logging.info(f'p_first: {p_first*100:.1f}% p_second: {p_second*100:.1f}% Confidence level for the difference: {2*(norm.cdf(z) - 0.5)*100:.1f}%')
    return z > norm.interval(confidence_level)[1]

is_difference_significative(3, 0, 3)

2024-06-18 10:13:42,120 - INFO - p_first: 75.0% p_second: 25.0% Confidence level for the difference: 89.8%


False

In [38]:
def log_ground_truth(idx):
    if isinstance(env, MockEnvWithDataframe) and 'ground_truth' in df.columns:
        logging.info(f'Ground truth: {df["ground_truth"].loc[idx]}')

class Results():
    def __init__(self):
        self.results = dict()

    def initialize(self, idx):
        self.results[idx] = []

    def add_result(self, idx, result: InferenceResult):
        self.results[idx].append(result)
    
    def log_results_distribution(self, idx):
        log_ground_truth(idx)
        keys = ['boxed_answer', 'text_answer', 'code_answer']
        for key in keys:
            values = self.get_result_distribution(idx, key)
            logging.info(f'{key} distribution: {values}')

    def get_valid_results(self, idx, result_priority):
        results = []
        for result in self.results[idx]:
            result = result.dict()
            for key in result_priority:
                if result[key] is not None:
                    results.append(result[key])
                    break
        if results:
            return results
        raise NoValidResults(idx)

    def get_most_frequent_result(self, idx, result_priority=CFG.result_priority):
        valid_results = self.get_valid_results(idx, result_priority)
        counter_ret = Counter(valid_results).most_common()
        logging.info(f'Result counts for {idx}: {counter_ret}')
        result, count = get_minimum_most_frequent_value(counter_ret)
        return result, count

    def is_best_solution_found(self, idx, result_priority=CFG.result_priority):
        try:
            valid_results = self.get_valid_results(idx, result_priority)
            counter_ret = Counter(valid_results).most_common()
            logging.info(f'Result counts for {idx}: {counter_ret}')
            if len(counter_ret) == 1:
                return is_difference_significative(counter_ret[0][1], 0, len(valid_results))
            else:
                return is_difference_significative(counter_ret[0][1], counter_ret[1][1], len(valid_results))
        except NoValidResults:
            return False

    def get_result_distribution(self, idx, key):
        results = self.results[idx]
        distribution = np.array([result.dict()[key] for result in results])
        return distribution
    
    def save(self, filepath='results.json'):
        logging.info(f'Saving results in {os.path.realpath(filepath)}')
        results = {idx: [result.dict() for result in results] for idx, results in self.results.items()}
        with open(filepath, 'w') as f:
            json.dump(results, f, indent=4)

    def load(self, filepath):
        logging.info(f'Loading results from {filepath}')
        with open(filepath, 'r') as f:
            results = json.load(f)
        self.results = {int(idx): [InferenceResult(**result) for result in results] for idx, results in results.items()}
            
    def __repr__(self):
        return str(self.results)
    
def get_minimum_most_frequent_value(counter_ret):
    max_count = counter_ret[0][1]
    candidates = []
    for value, count in counter_ret:
        if count == max_count:
            candidates.append(value)
        else:
            break
    return min(candidates), max_count

class NoValidResults(Exception):
    pass

assert get_minimum_most_frequent_value([(2, 1), (3, 1)]) == (2, 1)
assert get_minimum_most_frequent_value([(3, 1), (2, 1)]) == (2, 1)
assert get_minimum_most_frequent_value([(3, 2), (2, 1)]) == (3, 2)

### Inference

In [39]:
def solve_problem_with_code_interpreter(prompt, port=8000):
    text_generator = TextGenerator(cfg=CFG, port=port)
    clear_memory()
    code_runner = CodeRunner(CFG.max_coding_errors)
    decoded_output = prompt
    stop_word_cond = True
    generation_mode = 'text'
    while stop_word_cond and text_generator.are_generation_tokens_available():
        if decoded_output.endswith("Problem:") or decoded_output.endswith("User:"):
            break
        is_code_block_finished = not decoded_output.endswith("```python") and generation_mode == 'code'
        if is_code_block_finished:
            code_text = parse_last_python_code_block(decoded_output)
            code_output = code_runner.run_code(code_text)
            if code_runner.max_coding_errors_reached():
                break
            decoded_output = add_code_output_to_prompt(decoded_output, code_output)

        if decoded_output.endswith("```python"):
            decoded_output += '\n'
            generation_mode = 'code'
        else:
            generation_mode = 'text'

        decoded_output = text_generator(decoded_output, mode=generation_mode)
        stop_word_cond = any(decoded_output.endswith(stop_word) for stop_word in CFG.stop_words)

    if prompt.endswith("```python"):
        decoded_output = decoded_output.replace(prompt, '```python')
        prompt = prompt[:-len("```python")]
    else:
        decoded_output = decoded_output.replace(prompt, '')
    result = InferenceResult(
        prompt=prompt,
        response=decoded_output,
        output_tokens=text_generator.generated_tokens,
        coding_errors=code_runner.n_coding_errors,
        code_interpreter_calls=code_runner.code_interpreter_calls
    )
    if not text_generator.are_generation_tokens_available():
        # Solution was not achieved, it does not have sense to parse responses
        # logging.warning(f'Max number of new tokens {CFG.max_new_tokens} was reached. Solution not found.')
        result.reached_max_tokens = True
    else:
        # logging.info(f'Total generated tokens: {text_generator.generated_tokens}')
        if not code_runner.max_coding_errors_reached():
            result.boxed_answer = parse_boxed_answer(decoded_output)
            result.text_answer = parse_response_in_text(decoded_output)
            result.code_answer = parse_response_in_code(code_runner.successful_code_output)
    return result

### Show

In [40]:
def display_decoded_output(idx, text):
    display(Markdown('---'))
    display(Markdown(f'### Problem {idx}'))
    display(Markdown(text.replace('Assistant: ', 'Assistant: \n')))
    display(Markdown('---'))

### Results analysis

In [41]:
def show_inference_insights(results):
    keys = ['coding_errors', 'output_tokens', 'code_interpreter_calls']
    answers = ['boxed_answer', 'text_answer', 'code_answer']
    rows = []
    for idx in results.results:
        logging.info(f'Logging inference insights for problem {idx}')
        row = dict(n_runs=len(results.get_result_distribution(idx, keys[0])))
        for key in keys:
            values = results.get_result_distribution(idx, key)
            logging.info(f'{key} distribution: {values}')
            row[f'mean_{key}'] = round(np.mean(values), 1)
            row[f'median_{key}'] = round(np.median(values), 1)
        values = results.get_result_distribution(idx, 'reached_max_tokens')
        logging.info(f'reached_max_tokens distribution: {values}')
        row['unfinished_responses'] = np.sum(values)
        values = results.get_result_distribution(idx, 'coding_errors')
        row['max_coding_errors_reached'] = np.sum(values >= CFG.max_coding_errors)
        
        for answer in answers:
            values = results.get_result_distribution(idx, answer)
            logging.info(f'{answer} distribution: {values}')
            row[f'{answer}s'] = np.sum(values != None)
        rows.append(row)
        logging.info('')
    insights = pd.DataFrame(rows)
    summary = insights.sum()
    for column in insights.columns:
        if 'mean' in column or 'median' in column:
            summary[column] = round(summary[column] / len(insights), 1)
    insights.loc['all'] = summary
    for column in insights.columns[-5:]:
        insights[column] = (insights[column]/insights['n_runs']*100).round(1)
    return insights

In [42]:
def get_accuracy_report(results, result_priority):
    report = df[['answer', 'ground_truth']].copy()
    report['answer'] = 0
    report['n_runs'] = 0
    report['correct_counts'] = 0
    report['highest_wrong_counts'] = 0
    report['wrong_counts'] = 0
    report['highest_correct_tokens'] = None

    for idx in results.results:
        try:
            report.loc[idx, 'n_runs'] = get_n_runs(idx, results)
            values = np.array(results.get_valid_results(idx, result_priority))
            counter_ret = Counter(values).most_common()
            report.loc[idx, 'answer'] = get_minimum_most_frequent_value(counter_ret)[0]
            ground_truth = df.loc[idx, 'ground_truth']
            for pred, count in counter_ret:
                if pred == ground_truth:
                    report.loc[idx, 'correct_counts'] = count
                    break
            report.loc[idx, 'highest_wrong_counts'] = get_highest_wrong_count(counter_ret, ground_truth)
            report.loc[idx, 'wrong_counts'] = get_wrong_counts(counter_ret, ground_truth)
        except NoValidResults:
            report.loc[idx, 'answer'] = None
        if report.loc[idx, 'correct_counts'] > 0:
            report.loc[idx, 'highest_correct_tokens'] = get_highest_correct_tokens(idx, ground_truth, results)
    report['is_correct'] = (report['answer'] == report['ground_truth']).astype(int)
    report['pass'] = report['correct_counts'] > 0
    report.loc[report['answer'].isna(), 'is_correct'] = np.nan
    return add_summary_to_report(report)

def add_summary_to_report(report):
    summary = report.sum()
    for key in report.columns[:2]:
        summary[key] = '-'
    summary['highest_correct_tokens'] = report['highest_correct_tokens'].max()
    report.loc['summary'] = summary
    return report

def get_highest_wrong_count(counter_ret, ground_truth):
    for pred, count in counter_ret:
        if pred != ground_truth:
            return count
    return 0

def get_wrong_counts(counter_ret, ground_truth):
    wrong_counts = 0
    for pred, count in counter_ret:
        if pred != ground_truth:
            wrong_counts += count
    return wrong_counts


def get_n_runs(idx, results):
    return len(results.results[idx])

def get_highest_correct_tokens(idx, ground_truth, results):
    highest_correct_tokens = 0
    tokens = results.get_result_distribution(idx, 'output_tokens')
    for answer in CFG.result_priority:
        values = results.get_result_distribution(idx, answer)
        correct_answer_tokens = tokens[values == ground_truth]
        if len(correct_answer_tokens) > 0:
            max_tokens = max(correct_answer_tokens)
            highest_correct_tokens = max(highest_correct_tokens, max_tokens)
    return highest_correct_tokens

In [43]:
def analyze_MATH_results(result_priority):
    logging.info(f'Analyzing MATH results for {result_priority} priorities')
    accuracy_report = get_accuracy_report(results, result_priority)
    print_disaggregated_metrics(accuracy_report)
    accuracy_report = accuracy_report.loc[accuracy_report.index[:-1]]
    print_relevant_metrics(accuracy_report)
    for key in ['level', 'type']:
        accuracy_report[key] = df[key]
        plot_grouped_results(accuracy_report, key)


def print_relevant_metrics(accuracy_report):
    correct = accuracy_report['is_correct'].value_counts().get(1, 0)
    unanswered = accuracy_report['is_correct'].isna().sum()
    wrong = accuracy_report['is_correct'].value_counts().get(0, 0)
    total = correct + unanswered + wrong
    accuracy = correct/total
    print('\tAggregated metrics majority vote')
    print(f'Correct: {correct}/{total} ({accuracy:.2f} ± {estimate_uncertainty(accuracy, total):.2f})')
    print(f'Unanswered: {unanswered}/{total} ({unanswered/total:.2f} ± {estimate_uncertainty(unanswered/total, total):.2f})')
    print(f'Wrong: {wrong}/{total} ({wrong/total:.2f} ± {estimate_uncertainty(wrong/total, total):.2f})')
    print('\tAggregated metrics pass')
    correct = accuracy_report['pass'].sum()
    accuracy = correct/total
    print(f'Correct: {correct}/{total} ({accuracy:.2f} ± {estimate_uncertainty(accuracy, total):.2f})')


def estimate_uncertainty(proportion, n):
    return 1.96 * np.sqrt(proportion * (1 - proportion) / n)


def print_disaggregated_metrics(accuracy_report):
    correct = accuracy_report.loc['summary', 'correct_counts']
    wrong = accuracy_report.loc['summary', 'wrong_counts']
    total = accuracy_report.loc['summary', 'n_runs']
    unanswered = total - correct - wrong
    print('\tDisaggregated metrics')
    print(f'Correct: {correct}/{total} ({correct/total:.2f} ± {estimate_uncertainty(correct/total, total):.2f})')
    print(f'Unanswered: {unanswered}/{total} ({unanswered/total:.2f} ± {estimate_uncertainty(unanswered/total, total):.2f})')
    print(f'Wrong: {wrong}/{total} ({wrong/total:.2f} ± {estimate_uncertainty(wrong/total, total):.2f})')


def plot_grouped_results(df, group):
    categories = sorted(df[group].unique().tolist())
    correct = []
    unanswered = []
    wrong = []
    for category in categories:
        correct.append(df[df[group] == category].is_correct.value_counts().get(1, 0))
        unanswered.append(df[df[group] == category].is_correct.isna().sum())
        wrong.append(df[df[group] == category].is_correct.value_counts().get(0, 0))

    correct.append(np.sum(correct))
    unanswered.append(np.sum(unanswered))
    wrong.append(np.sum(wrong))
    categories.append('overall')

    total = np.array(correct) + np.array(unanswered) + np.array(wrong)
    correct = np.array(correct)/total
    unanswered = np.array(unanswered)/total
    wrong = np.array(wrong)/total
    plt.bar(categories, correct, label='Correct', color='tab:green')
    plt.bar(categories, unanswered, bottom=correct, label='Unanswered', color='tab:orange')
    plt.bar(categories, wrong, bottom=np.array(correct)+np.array(unanswered), label='Wrong', color='tab:red')
    for idx, value in enumerate(categories):
        plt.text(value, correct[idx]/2, f'{correct[idx]*100:.0f}%', ha='center', va='center')
        plt.text(value, correct[idx] + unanswered[idx]/2, f'{unanswered[idx]*100:.0f}%', ha='center', va='center')
        plt.text(value, correct[idx] + unanswered[idx] + wrong[idx]/2, f'{wrong[idx]*100:.0f}%', ha='center', va='center')
    #plt.legend(loc=0)
    plt.ylim(0, 1)
    plt.grid(axis='y')
    plt.title(f'Results grouped by {group}')
    plt.show()

### Inference with VLLM

In [44]:
if is_running_at_home():
    wait_for_gpu_memory()
    create_model_and_inference_artifacts()

2024-06-18 10:13:42,269 - INFO - Available memory on GPU 0 is 23.4 GB
2024-06-18 10:13:42,271 - INFO - Available memory on GPU 1 is 23.1 GB
2024-06-18 10:13:42,272 - INFO - GPU memory is available. Let's go training!
2024-06-18 10:13:43,273 - INFO - Starting VLLM server on device 0...
2024-06-18 10:14:09,390 - INFO - Server is running at http://localhost:8000/v1/models!
2024-06-18 10:14:09,391 - INFO - Starting VLLM server on device 1...
2024-06-18 10:14:25,459 - INFO - Server is running at http://localhost:8001/v1/models!
2024-06-18 10:14:25,460 - INFO - GPU 0 memory available: 4.2/23.7 GB
2024-06-18 10:14:25,461 - INFO - GPU 1 memory available: 4.0/23.7 GB
2024-06-18 10:14:25,463 - INFO - RAM memory available: 40.1/62.5 GB


In [45]:
def monitor_progress(submits):
    progress = 0
    with tqdm(total=len(submits), smoothing=0) as progress_bar:
        while 1:
            time.sleep(1)
            current_progress = np.sum([submit.done() for submit in submits])
            if current_progress > progress:
                progress_bar.update(current_progress - progress)
                progress = current_progress
            if progress == len(submits):
                break


def safe_solve_problem_with_code_interpreter(prompt, port=8000):
    try:
        return solve_problem_with_code_interpreter(prompt, port)
    except Exception as e:
        logging.error(f'Error when solving problem: {e}')
        return InferenceResult(prompt=prompt, coding_errors=-1, response=f'Error when solving problem: {e}')


def search_answer_for_problem(problem_idx):
    create_model_and_inference_artifacts()
    test = df.loc[problem_idx:problem_idx]
    results = Results()
    results.initialize(problem_idx)

    max_workers = CFG.max_workers or N_REPETITIONS
    with ProcessPoolExecutor(max_workers=max_workers) as pool:
        submits = []
        for repetition_idx in range(N_REPETITIONS):
            prompt = get_formatted_prompt(test['problem'].values[0], repetition_idx)
            device = CFG.cuda_visible_devices[repetition_idx % len(CFG.cuda_visible_devices)]
            port = get_device_port(device)
            submits.append(pool.submit(safe_solve_problem_with_code_interpreter, prompt=prompt, port=port))
        monitor_progress(submits)
        problem_results = [submit.result() for submit in submits]
    for result in problem_results:
        results.add_result(problem_idx, result)
        if CFG.verbose:
            display_decoded_output(problem_idx, test['problem'].values[0] + '\n\n' + result.response)
    # results.log_results_distribution(problem_idx)
    try:
        result, count = results.get_most_frequent_result(problem_idx, CFG.result_priority)
        answer = result % 1000
        logging.info(f'Predicted answer for problem {problem_idx} is: {answer} with {count} votes')
    except NoValidResults:
        logging.warning(f'No valid results for problem {problem_idx}. Using default answer: {CFG.default_answer}')
        answer = CFG.default_answer
    log_ground_truth(problem_idx)
    return answer

In [51]:
def search_answer_for_prompt(prompt_template, problem_idx, repetitions=CFG.n_repetitions):
    create_model_and_inference_artifacts()
    prompt = prompt_template.replace('PROBLEM_PLACEHOLDER', df['problem'].loc[problem_idx])
    results = Results()
    results.initialize(problem_idx)
    max_workers = CFG.max_workers or repetitions
    with ProcessPoolExecutor(max_workers=max_workers) as pool:
        submits = []
        for repetition_idx in range(repetitions):
            device = CFG.cuda_visible_devices[repetition_idx % len(CFG.cuda_visible_devices)]
            port = get_device_port(device)
            submits.append(pool.submit(safe_solve_problem_with_code_interpreter, prompt=prompt, port=port))
        monitor_progress(submits)
        problem_results = [submit.result() for submit in submits]
    for result in problem_results:
        results.add_result(problem_idx, result)
        if CFG.verbose:
            display_decoded_output(problem_idx, '\n\n' + result.response)
    try:
        result, count = results.get_most_frequent_result(problem_idx, CFG.result_priority)
        answer = result % 1000
        logging.info(f'Predicted answer for problem {problem_idx} is: {answer} with {count} votes')
    except NoValidResults:
        logging.warning(f'No valid results for problem {problem_idx}. Using default answer: {CFG.default_answer}')
        answer = CFG.default_answer
    logging.info(f'Ground truth is: {df.loc[problem_idx, "ground_truth"]}')

## Load previous results

In [47]:
old_results = Results()
old_results.load('/mnt/hdd0/Kaggle/aimo/experiments/17_vllm/400_repetitions.json')
accuracy_report = get_accuracy_report(old_results, CFG.result_priority)

2024-06-18 10:14:25,519 - INFO - Loading results from /mnt/hdd0/Kaggle/aimo/experiments/17_vllm/400_repetitions.json


In [48]:
accuracy_report.sort_values('correct_counts', ascending=False)

Unnamed: 0,answer,ground_truth,n_runs,correct_counts,highest_wrong_counts,wrong_counts,highest_correct_tokens,is_correct,pass
summary,-,-,252979,93344,37179,80289,639,351.0,539
32,4,4,512,502,0,0,633,1.0,1
29,1160,1160,515,488,9,17,619,1.0,1
53,81,81,491,484,1,5,586,1.0,1
54,60000,60000,490,480,3,10,442,1.0,1
...,...,...,...,...,...,...,...,...,...
341,52,41,417,0,26,212,,0.0,0
163,280,20160,438,0,20,226,,0.0,0
166,66,14,438,0,345,438,,0.0,0
238,1007,62,433,0,112,222,,0.0,0


In [92]:
accuracy_report[accuracy_report.correct_counts == 10]

Unnamed: 0,answer,ground_truth,n_runs,correct_counts,highest_wrong_counts,wrong_counts,highest_correct_tokens,is_correct,pass
211,42,45,438,10,198,253,478,0.0,1
229,10,6,435,10,59,194,428,0.0,1
246,205,225,431,10,206,326,274,0.0,1
299,3988,997,420,10,23,58,469,0.0,1
392,3,4,417,10,54,279,622,0.0,1
407,97,7,417,10,40,221,624,0.0,1
451,6,35,417,10,90,174,617,0.0,1


I could try to solve this problems that at least have one correct count.

In [50]:
raise

RuntimeError: No active exception to reraise

## Try with hints

### 70

In [101]:
problem_idx = 70
display(accuracy_report.loc[problem_idx:problem_idx])
display(Markdown(f'## Problem {problem_idx}\n\n' + df.loc[problem_idx, 'problem']))
display(Markdown('## Solution\n\n' + df.loc[problem_idx, 'solution']))

Unnamed: 0,answer,ground_truth,n_runs,correct_counts,highest_wrong_counts,wrong_counts,highest_correct_tokens,is_correct,pass
70,4,5,474,2,394,451,620,0.0,1


## Problem 70

How many distinct rectangles are there with integer side lengths such that the numerical value of area of the rectangle in square units is equal to $5$ times the numerical value of the perimeter in units? (Two rectangles are considered to be distinct if they are not congruent.)

## Solution

Let the side lengths of the rectangle be $a$ and $b$ with $a\leq b$. Then $ab=10(a+b).$ Expanding and moving all the terms to the left hand side gives $ab-10a-10b=0.$ We apply Simon's Favorite Factoring Trick and add $100$ to both sides to allow us to factor the left hand side: $$ab-10a-10b+100 = (a-10)(b-10)=100$$From this, we know that $(a-10,b-10)$ must be a pair of factors of $100$. Consequently, the pairs $(a,b)$ that provide different areas are $(11,110),$ $(12, 60),$ $(14, 35),$ $(15, 30),$ and $(20,20)$. There are therefore $\boxed{5}$ distinct rectangles with the desired property.

In [53]:
prompt_template = "PROBLEM_PLACEHOLDER"
search_answer_for_prompt(prompt_template, problem_idx)

  0%|          | 0/100 [00:00<?, ?it/s]

2024-06-18 10:17:06,766 - INFO - Result counts for 70: [(5, 19), (3, 16), (2, 12), (4, 9), (1, 8), (6, 2), (30, 1), (0, 1), (24, 1), (125, 1)]
2024-06-18 10:17:06,767 - INFO - Predicted answer for problem 70 is: 5 with 19 votes
2024-06-18 10:17:06,768 - INFO - Ground truth is: 5


In [54]:
prompt_template = """PROBLEM_PLACEHOLDER
Please reason step by step, and put your final answer within \\boxed{}."""
search_answer_for_prompt(prompt_template, problem_idx)

  0%|          | 0/100 [00:00<?, ?it/s]

2024-06-18 10:17:58,199 - INFO - Result counts for 70: [(5, 43), (4, 24), (3, 12), (6, 6), (2, 5), (1, 4), (9, 2), (8, 1)]
2024-06-18 10:17:58,200 - INFO - Predicted answer for problem 70 is: 5 with 43 votes
2024-06-18 10:17:58,200 - INFO - Ground truth is: 5


In [55]:
search_answer_for_problem(problem_idx)

  0%|          | 0/100 [00:00<?, ?it/s]

2024-06-18 10:18:40,521 - INFO - Result counts for 70: [(4, 90), (8, 2), (7, 2), (5, 1)]
2024-06-18 10:18:40,522 - INFO - Predicted answer for problem 70 is: 4 with 90 votes
2024-06-18 10:18:40,522 - INFO - Ground truth: 5


4

Very surprising finding. This problem is solved without code much more easily for the model.

#### Is the non negative hint helpful?

In [68]:
prompt_template = """PROBLEM_PLACEHOLDER
Please reason step by step, and put your final answer within \\boxed{}. The answer is a non negative integer."""
search_answer_for_prompt(prompt_template, problem_idx, repetitions=400)

  0%|          | 0/400 [00:00<?, ?it/s]

2024-06-18 11:50:06,717 - INFO - Result counts for 70: [(5, 114), (4, 78), (3, 63), (2, 49), (1, 45), (6, 15), (9, 6), (8, 6), (0, 2), (7, 2), (12, 1), (17, 1)]
2024-06-18 11:50:06,722 - INFO - Predicted answer for problem 70 is: 5 with 114 votes
2024-06-18 11:50:06,733 - INFO - Ground truth is: 5


In [69]:
prompt_template = """PROBLEM_PLACEHOLDER
Please reason step by step, and put your final answer within \\boxed{}."""
search_answer_for_prompt(prompt_template, problem_idx, repetitions=400)

  0%|          | 0/400 [00:00<?, ?it/s]

2024-06-18 11:52:46,761 - INFO - Result counts for 70: [(5, 147), (4, 91), (3, 47), (2, 30), (6, 17), (7, 13), (8, 12), (1, 11), (10, 1), (9, 1)]
2024-06-18 11:52:46,762 - INFO - Predicted answer for problem 70 is: 5 with 147 votes
2024-06-18 11:52:46,762 - INFO - Ground truth is: 5


In [83]:
def check_statistical_difference(count1, count2, n):
    ratio1 = count1/n
    ratio2 = count2/n
    uncertainty1 = estimate_uncertainty(ratio1, n)
    uncertainty2 = estimate_uncertainty(ratio2, n)
    print(f'Ratio 1: {ratio1:.2f} +- {uncertainty1:.2f}')
    print(f'Ratio 2: {ratio2:.2f} +- {uncertainty2:.2f}')
    print(f'Ratio difference: {ratio1 - ratio2:.2f} +- {(uncertainty1**2 + uncertainty2**2)**0.5:.2f}')

In [84]:
check_statistical_difference(147, 114, 400)

Ratio 1: 0.37 +- 0.05
Ratio 2: 0.28 +- 0.04
Ratio difference: 0.08 +- 0.06


The difference is significative, for this problem saying that the answer is a non negative is harmful.

#### Assistant prompt

In [102]:
prompt_template = """User: PROBLEM_PLACEHOLDER
Please reason step by step, and put your final answer within \\boxed{}.
Assistant: """
search_answer_for_prompt(prompt_template, problem_idx, repetitions=400)

  0%|          | 0/400 [00:00<?, ?it/s]

2024-06-18 13:39:09,637 - INFO - Result counts for 70: [(5, 134), (4, 93), (3, 57), (6, 19), (8, 16), (2, 16), (7, 15), (1, 6), (9, 5), (10, 4)]
2024-06-18 13:39:09,638 - INFO - Predicted answer for problem 70 is: 5 with 134 votes
2024-06-18 13:39:09,638 - INFO - Ground truth is: 5


### 193

In [56]:
problem_idx = 193
display(accuracy_report.loc[problem_idx:problem_idx])
display(Markdown(f'## Problem {problem_idx}\n\n' + df.loc[problem_idx, 'problem']))
display(Markdown('## Solution\n\n' + df.loc[problem_idx, 'solution']))

Unnamed: 0,answer,ground_truth,n_runs,correct_counts,highest_wrong_counts,wrong_counts,highest_correct_tokens,is_correct,pass
193,14,9,438,2,83,282,203,0.0,1


## Problem 193

The lengths, in order, of four consecutive sides of an equiangular hexagon are 1, 7, 2 and 4 units, respectively. What is the sum of the lengths of the two remaining sides?

## Solution

Name the vertices of the hexagon so that hexagon $ABCDEF$ has $AB=1$, $BC=7$, $CD=2$, and $DE=4$.  The hexagon is equiangular, so each interior angle measures $180(6-2)/6=120$ degrees.  Extend sides $AB$, $CD$, and $EF$ and call their intersection points $G$, $H$, and $J$ as shown.  The exterior angles of the hexagon each measure $180-120=60$ degrees, so triangles $JDE$, $CBH$, $FGA$, and $JHG$ are all equilateral.  It follows that $JD=DE=4$ units and $CH=CB=7$ units.  Therefore the side length $JH$ of triangle $JGH$ is $4+2+7=13$ units.  Turning to side $HG$, we find that $AF=AG=13-(7+1)=5$ units.  Finally, we solve $JG=JE+EF+FG$ for $EF$ to get $EF=13-(4+5)=4$ units.  The sum of the missing sides is $5+4=\boxed{9}$ units.

[asy]
size(6cm);
defaultpen(linewidth(.7pt)+fontsize(8pt));
dotfactor=4;

pair A=(8,0), B=(7,0), C=7*dir(60), D=9*dir(60), Ep=(13,0)+9*dir(120), F=(13,0)+5*dir(120), G=(13,0), H=(0,0), J=13*dir(60);

pair[] dots = {A, B, C, D, Ep, F};

dot(dots);

draw(A--B--C--D--Ep--F--cycle);

draw(B--H--C,dashed);
draw(D--J--Ep,dashed);
draw(F--G--A,dashed);

label("$A$",A,S);
label("$B$",B,S);
label("$C$",C,NW);
label("$D$",D,NW);
label("$E$",Ep,NE);
label("$F$",F,NE);
label("$G$",G,SE);
label("$H$",H,SW);
label("$J$",J,N);

label("$1$",(A+B)/2,N);
label("$7$",(B+C)/2,NE);
label("$2$",(C+D)/2,SE);
label("$4$",(D+Ep)/2,S);
[/asy]

In [59]:
prompt_template = "PROBLEM_PLACEHOLDER"
search_answer_for_prompt(prompt_template, problem_idx)

  0%|          | 0/100 [00:00<?, ?it/s]

2024-06-18 10:51:38,513 - INFO - Result counts for 193: [(14, 13), (8, 5), (6, 4), (3, 3), (5, 3), (9, 2), (4, 2), (28, 2), (2, 1), (346, 1), (40, 1), (7, 1), (201, 1), (36, 1), (1, 1)]
2024-06-18 10:51:38,513 - INFO - Predicted answer for problem 193 is: 14 with 13 votes
2024-06-18 10:51:38,514 - INFO - Ground truth is: 9


In [57]:
prompt_template = """PROBLEM_PLACEHOLDER
Please reason step by step, and put your final answer within \\boxed{}."""
search_answer_for_prompt(prompt_template, problem_idx)

  0%|          | 0/100 [00:00<?, ?it/s]

2024-06-18 10:20:20,938 - INFO - Result counts for 193: [(14, 28), (2, 7), (8, 7), (7, 4), (16, 4), (5, 3), (4, 2), (6, 2), (70, 2), (3, 2), (11, 1), (10, 1)]
2024-06-18 10:20:20,939 - INFO - Predicted answer for problem 193 is: 14 with 28 votes
2024-06-18 10:20:20,939 - INFO - Ground truth is: 9


In [58]:
search_answer_for_problem(problem_idx)

  0%|          | 0/100 [00:00<?, ?it/s]

2024-06-18 10:24:47,541 - INFO - Result counts for 193: [(4, 15), (14, 14), (7, 5), (28, 4), (10, 3), (6, 2), (22, 2), (3, 1), (5, 1), (16, 1), (106, 1), (2, 1), (9, 1), (19, 1), (18, 1), (34, 1), (0, 1), (12, 1), (42, 1), (11, 1)]
2024-06-18 10:24:47,542 - INFO - Predicted answer for problem 193 is: 4 with 15 votes
2024-06-18 10:24:47,543 - INFO - Ground truth: 9


4

In [60]:
prompt_template = """PROBLEM_PLACEHOLDER
Please reason step by step, and put your final answer within \\boxed{}.

An equiangular hexagon is a special case of a regular hexagon, where all the angles are equal."""
search_answer_for_prompt(prompt_template, problem_idx)

  0%|          | 0/100 [00:00<?, ?it/s]

2024-06-18 11:32:31,955 - INFO - Result counts for 193: [(14, 29), (9, 8), (5, 6), (8, 6), (3, 5), (6, 3), (7, 3), (10, 2), (106, 2), (120, 1), (11, 1), (2, 1), (12, 1), (28, 1), (24, 1), (13, 1), (20, 1), (16, 1)]
2024-06-18 11:32:31,956 - INFO - Predicted answer for problem 193 is: 14 with 29 votes
2024-06-18 11:32:31,956 - INFO - Ground truth is: 9


In [61]:
prompt_template = """PROBLEM_PLACEHOLDER
Please reason step by step, and put your final answer within \\boxed{}.

An equiangular hexagon is a special case of a regular hexagon, where all the angles are equal.
An equiangular hexagon will have all the angles equal to 120 degrees."""
search_answer_for_prompt(prompt_template, problem_idx)

  0%|          | 0/100 [00:00<?, ?it/s]

2024-06-18 11:35:24,446 - INFO - Result counts for 193: [(14, 30), (8, 11), (7, 8), (5, 6), (6, 4), (28, 3), (9, 3), (3, 3), (10, 1), (4, 1), (1, 1)]
2024-06-18 11:35:24,447 - INFO - Predicted answer for problem 193 is: 14 with 30 votes
2024-06-18 11:35:24,447 - INFO - Ground truth is: 9


In [62]:
prompt_template = """PROBLEM_PLACEHOLDER
Please reason step by step, and put your final answer within \\boxed{}.

An equiangular hexagon is a special case of a regular hexagon, where all the angles are equal.
An equiangular hexagon will have all the angles equal to 120 degrees.
We already know the angles and the dimensions of 4 sides, thus we simply have to find where the remaining 2 sides meet."""
search_answer_for_prompt(prompt_template, problem_idx)

  0%|          | 0/100 [00:00<?, ?it/s]

2024-06-18 11:37:14,609 - INFO - Result counts for 193: [(14, 22), (5, 10), (8, 7), (9, 6), (7, 6), (6, 5), (4, 4), (3, 3), (10, 2), (2, 2), (166, 1), (120, 1), (13, 1), (68, 1), (353, 1), (12, 1), (18, 1), (16, 1), (96, 1)]
2024-06-18 11:37:14,609 - INFO - Predicted answer for problem 193 is: 14 with 22 votes
2024-06-18 11:37:14,610 - INFO - Ground truth is: 9


### 300

In [99]:
problem_idx = 300
display(accuracy_report.loc[problem_idx:problem_idx])
display(Markdown(f'## Problem {problem_idx}\n\n' + df.loc[problem_idx, 'problem']))
display(Markdown('## Solution\n\n' + df.loc[problem_idx, 'solution']))

Unnamed: 0,answer,ground_truth,n_runs,correct_counts,highest_wrong_counts,wrong_counts,highest_correct_tokens,is_correct,pass
300,3,1209,420,2,21,73,452,0.0,1


## Problem 300

Let $f : \mathbb{R} \to \mathbb{R}$ be a function such that $f(5) = 3$ and
\[f(4xy) = 2y[f(x + y) + f(x - y)]\]for all real numbers $x$ and $y.$  Find $f(2015).$

## Solution

Setting $y = 0,$ we get $f(0) = 0.$

Then setting $x = 0,$ we get
\[f(0) = 2y[f(y) + f(-y)].\]Assuming $y \neq 0,$ we get $f(-y) + f(y) = 0.$  Hence, $f(-y) = -f(y)$ for all $y.$

We can reverse the roles of $x$ and $y$ to get
\[f(4xy) = 2x[f(x + y) + f(y - x)],\]so
\[2y[f(x + y) + f(x - y)] = 2x[f(x + y) + f(y - x)].\]Hence,
\[y f(x - y) - x f(y - x) = (x - y) f(x + y).\]Since $f(y - x) = -f(x - y),$
\[(x + y) f(x - y) = (x - y) f(x + y).\]We want to take $x$ and $y$ so that $x + y = 5$ and $x - y = 2015.$  Solving, we find $x = 1010$ and $y = -1005.$  Then
\[5 f(2015) = 2015 f(5),\]so $f(2015) = \frac{2015 f(5)}{5} = \boxed{1209}.$

In [100]:
logging.info('Just the problem')
prompt_template = "PROBLEM_PLACEHOLDER"
search_answer_for_prompt(prompt_template, problem_idx)
logging.info('Reason step by step')
prompt_template = """PROBLEM_PLACEHOLDER
Please reason step by step, and put your final answer within \\boxed{}."""
search_answer_for_prompt(prompt_template, problem_idx)

2024-06-18 13:21:25,882 - INFO - Just the problem


  0%|          | 0/100 [00:00<?, ?it/s]

2024-06-18 13:22:38,200 - INFO - Result counts for 300: [(0, 2), (3, 2), (1215, 1)]
2024-06-18 13:22:38,201 - INFO - Predicted answer for problem 300 is: 0 with 2 votes
2024-06-18 13:22:38,201 - INFO - Ground truth is: 1209
2024-06-18 13:22:38,202 - INFO - Reason step by step


  0%|          | 0/100 [00:00<?, ?it/s]

2024-06-18 13:23:51,421 - INFO - Result counts for 300: [(6, 1)]
2024-06-18 13:23:51,421 - INFO - Predicted answer for problem 300 is: 6 with 1 votes
2024-06-18 13:23:51,422 - INFO - Ground truth is: 1209


In [65]:
prompt_template = """PROBLEM_PLACEHOLDER
Please reason step by step, and put your final answer within \\boxed{}.

Let's start by inspecting the cases when x=0 and y=0"""
search_answer_for_prompt(prompt_template, problem_idx)

  0%|          | 0/100 [00:00<?, ?it/s]

2024-06-18 11:44:07,206 - INFO - Result counts for 300: [(0, 7), (1209, 1), (2418, 1)]
2024-06-18 11:44:07,207 - INFO - Predicted answer for problem 300 is: 0 with 7 votes
2024-06-18 11:44:07,207 - INFO - Ground truth is: 1209


### 464

In [85]:
problem_idx = 464
display(accuracy_report.loc[problem_idx:problem_idx])
display(Markdown(f'## Problem {problem_idx}\n\n' + df.loc[problem_idx, 'problem']))
display(Markdown('## Solution\n\n' + df.loc[problem_idx, 'solution']))

Unnamed: 0,answer,ground_truth,n_runs,correct_counts,highest_wrong_counts,wrong_counts,highest_correct_tokens,is_correct,pass
464,468,636,417,2,21,391,337,0.0,1


## Problem 464

My clock chimes two times 15 minutes after the hour, four times 30 minutes after the hour and six times 45 minutes after the hour. The clock also chimes eight times on each hour in addition to chiming the number of times equal to the hour. (So at 2:00 p.m., the clock chimes $8 + 2 = 10$ times.) Starting at 12:05 a.m., how many times does the clock chime in a 24-hour period?

## Solution

Twenty-four hours will pass, so, ignoring the chimes on the hour that are equal to the hour, there are $24 \cdot (2 + 4 + 6 + 8) = 480$ chimes. Now, the chimes that are equal to the hour can be calculated by $2 \cdot (12 + 1 + 2 + \ldots + 9 + 10 + 11) = 2 \cdot 78 = 156$. Thus, there are a total of $480 + 156 = \boxed{636}$ chimes.

In [97]:
logging.info('Just the problem')
prompt_template = "PROBLEM_PLACEHOLDER"
search_answer_for_prompt(prompt_template, problem_idx)
logging.info('Reason step by step')
prompt_template = """PROBLEM_PLACEHOLDER
Please reason step by step, and put your final answer within \\boxed{}."""
search_answer_for_prompt(prompt_template, problem_idx)

2024-06-18 13:12:00,055 - INFO - Just the problem


  0%|          | 0/100 [00:00<?, ?it/s]

2024-06-18 13:12:54,353 - INFO - Result counts for 567: [(14, 14), (15, 14), (22, 6), (18, 6), (6, 5), (12, 3), (16, 2), (24, 2), (0, 2), (30, 2), (3, 2), (5, 2), (21, 2), (9, 1), (20, 1), (39, 1), (23, 1), (4, 1), (46, 1), (17, 1), (35, 1), (10, 1), (13, 1)]
2024-06-18 13:12:54,354 - INFO - Predicted answer for problem 567 is: 14 with 14 votes
2024-06-18 13:12:54,355 - INFO - Ground truth is: 20
2024-06-18 13:12:54,355 - INFO - Reason step by step


  0%|          | 0/100 [00:00<?, ?it/s]

2024-06-18 13:13:47,550 - INFO - Result counts for 567: [(15, 27), (14, 24), (24, 5), (22, 5), (3, 3), (12, 3), (13, 3), (6, 3), (31, 3), (16, 2), (18, 2), (21, 2), (27, 1), (26, 1), (9, 1), (10, 1), (11, 1), (46, 1), (23, 1), (30, 1), (29, 1), (5, 1)]
2024-06-18 13:13:47,551 - INFO - Predicted answer for problem 567 is: 15 with 27 votes
2024-06-18 13:13:47,551 - INFO - Ground truth is: 20


In [None]:
prompt_template = """PROBLEM_PLACEHOLDER
Please reason step by step, and put your final answer within \\boxed{}.

Let's summarize the information given in the problem:

- The clock chimes 2 times at 15 minutes past the hour.
- The clock chimes 4 times at 30 minutes past the hour.
- The clock chimes 6 times at 45 minutes past the hour.
- The clock chimes 8 times at the hour.
- In addition at the hour it chimes the number of the hour (in a 12 hour convention)."""
search_answer_for_prompt(prompt_template, problem_idx)

  0%|          | 0/100 [00:00<?, ?it/s]

2024-06-18 12:54:18,912 - INFO - Result counts for 464: [(480, 7), (348, 7), (162, 5), (174, 3), (460, 3), (156, 2), (328, 2), (180, 2), (780, 2), (224, 2), (432, 2), (142, 1), (492, 1), (236, 1), (200, 1), (600, 1), (310, 1), (444, 1), (111, 1), (323, 1), (756, 1), (315, 1), (768, 1), (386, 1), (342, 1), (557, 1), (240, 1), (254, 1), (456, 1), (207, 1), (552, 1), (228, 1), (440, 1), (380, 1), (166, 1), (264, 1), (186, 1), (318, 1), (504, 1), (202, 1), (268, 1), (503, 1), (558, 1), (154, 1), (1068, 1), (430, 1), (64, 1), (252, 1), (337, 1), (571, 1), (296, 1), (161, 1), (276, 1), (420, 1), (118, 1), (606, 1), (120, 1), (379, 1), (115, 1), (250, 1), (128, 1), (510, 1), (306, 1), (171, 1)]
2024-06-18 12:54:18,912 - INFO - Predicted answer for problem 464 is: 348 with 7 votes
2024-06-18 12:54:18,913 - INFO - Ground truth is: 636


### 567

In [96]:
problem_idx = 567
display(accuracy_report.loc[problem_idx:problem_idx])
display(Markdown(f'## Problem {problem_idx}\n\n' + df.loc[problem_idx, 'problem']))
display(Markdown('## Solution\n\n' + df.loc[problem_idx, 'solution']))

logging.info('Just the problem')
prompt_template = "PROBLEM_PLACEHOLDER"
search_answer_for_prompt(prompt_template, problem_idx)
logging.info('Reason step by step')
prompt_template = """PROBLEM_PLACEHOLDER
Please reason step by step, and put your final answer within \\boxed{}."""
search_answer_for_prompt(prompt_template, problem_idx)

Unnamed: 0,answer,ground_truth,n_runs,correct_counts,highest_wrong_counts,wrong_counts,highest_correct_tokens,is_correct,pass
567,22,20,417,2,142,354,439,0.0,1


## Problem 567

Find the number of solutions to $\cos \frac{x}{4} = \cos x$ in the interval $0 < x < 24 \pi.$

## Solution

From the equation $\cos \frac{x}{4} = \cos x,$ $\cos x - \cos \frac{x}{4} = 0.$  From the sum-to-product formula, we can write this as
\[-2 \sin \frac{5x}{8} \sin \frac{3x}{8} = 0.\]Hence, $\sin \frac{5x}{8} = 0$ or $\sin \frac{3x}{8} = 0.$

If $\sin \frac{5x}{8} = 0,$ then $x = \frac{8m \pi}{5}$ for some integer $m,$ $1 \le m \le 14.$  If $\sin \frac{3x}{8} = 0,$ then $x = \frac{8m \pi}{3}$ for some integer $n,$ $1 \le n \le 8.$  Note that $m = 5$ and $n = 3$ give the same solution $x = 8 \pi,$ and $m = 10$ and $n = 6$ give the same solution $x = 16 \pi.$  Thus, the number of solutions is $14 + 8 - 2 = \boxed{20}.$

2024-06-18 13:10:16,889 - INFO - Just the problem


  0%|          | 0/100 [00:00<?, ?it/s]

2024-06-18 13:11:08,018 - INFO - Result counts for 567: [(15, 26), (14, 11), (6, 6), (22, 4), (3, 4), (24, 3), (12, 3), (18, 2), (8, 2), (48, 2), (0, 1), (60, 1), (16, 1), (27, 1), (19, 1), (7, 1), (17, 1), (13, 1), (36, 1), (2, 1), (32, 1), (21, 1), (25, 1), (23, 1), (4, 1)]
2024-06-18 13:11:08,018 - INFO - Predicted answer for problem 567 is: 15 with 26 votes
2024-06-18 13:11:08,019 - INFO - Ground truth is: 20
2024-06-18 13:11:08,020 - INFO - Reason step by step


  0%|          | 0/100 [00:00<?, ?it/s]

2024-06-18 13:12:00,013 - INFO - Result counts for 567: [(15, 23), (14, 21), (22, 13), (24, 6), (18, 3), (21, 3), (3, 3), (4, 2), (16, 2), (42, 1), (39, 1), (17, 1), (25, 1), (7, 1), (62, 1), (12, 1), (13, 1)]
2024-06-18 13:12:00,014 - INFO - Predicted answer for problem 567 is: 15 with 23 votes
2024-06-18 13:12:00,014 - INFO - Ground truth is: 20


In [89]:
prompt_template = """PROBLEM_PLACEHOLDER
Please reason step by step, and put your final answer within \\boxed{}.

Can we rewrite the equation in a more convenient form?"""
search_answer_for_prompt(prompt_template, problem_idx)

  0%|          | 0/100 [00:00<?, ?it/s]

2024-06-18 12:59:19,764 - INFO - Result counts for 567: [(15, 37), (14, 29), (22, 5), (16, 4), (4, 2), (24, 2), (26, 1), (5, 1), (13, 1), (7, 1), (30, 1), (3, 1), (39, 1)]
2024-06-18 12:59:19,764 - INFO - Predicted answer for problem 567 is: 15 with 37 votes
2024-06-18 12:59:19,765 - INFO - Ground truth is: 20


### 222

In [95]:
problem_idx = 222
display(accuracy_report.loc[problem_idx:problem_idx])
display(Markdown(f'## Problem {problem_idx}\n\n' + df.loc[problem_idx, 'problem']))
display(Markdown('## Solution\n\n' + df.loc[problem_idx, 'solution']))

logging.info('Just the problem')
prompt_template = "PROBLEM_PLACEHOLDER"
search_answer_for_prompt(prompt_template, problem_idx)
logging.info('Reason step by step')
prompt_template = """PROBLEM_PLACEHOLDER
Please reason step by step, and put your final answer within \\boxed{}."""
search_answer_for_prompt(prompt_template, problem_idx)

Unnamed: 0,answer,ground_truth,n_runs,correct_counts,highest_wrong_counts,wrong_counts,highest_correct_tokens,is_correct,pass
222,2,224,438,5,39,43,590,0.0,1


## Problem 222

Let a sequence be defined as follows: $a_1 = 3,$ $a_2 = 3,$ and for $n \ge 2,$
\[a_{n + 1} a_{n - 1} = a_n^2 + 2007.\]Find the largest integer less than or equal to $\frac{a_{2007}^2+a_{2006}^2}{a_{2007}a_{2006}}$.

## Solution

The fact that the equation $a_{n+1}a_{n-1} = a_n^2 + 2007$ holds for $n \geq 2$ implies that $a_na_{n-2} = a_{n-1}^2 + 2007$ for $n \geq
3$. Subtracting the second equation from the first one yields $a_{n+1} a_{n-1} -a_n a_{n-2} = a_n^2 -a_{n-1}^2$, or
\[a_{n+1} a_{n-1} + a_{n-1}^2 = a_n a_{n-2} + a_n^2.\]Dividing the last equation by $a_{n-1} a_n$ and simplifying produces
\[\frac{a_{n+1}+ a_{n-1}}{a_n}=\frac{a_n+a_{n-2}}{a_{n-1}}.\]This equation shows that $\frac{a_{n+1}+a_{n-1}}{a_n}$ is constant for $n\geq 2$.

Because $a_3a_1 = a_2^2 + 2007$, $a_3=2016/3=672$. Thus
\[\frac{a_{n+1}+a_{n-1}}{a_n} = \frac{672+3}{3}=225,\]and $a_{n+1}=225a_n-a_{n-1}$ for $n \geq 2$.

Note that $a_3 = 672 >3 = a_2$. Furthermore, if $a_n > a_{n-1}$, then $a_{n+1}a_{n-1} = a_n^2
+ 2007$ implies that \[a_{n+1} = \frac{a_n^2}{a_{n-1}}+\frac{2007}{a_{n-1}} = a_n\left(\frac{a_n}{a_{n-1}}\right) + \frac{2007}{a_{n-1}}>a_n + \frac{2007}{a_{n-1}} > a_n.\]Thus by mathematical induction, $a_n > a_{n-1}$ for all $n \geq 3$. Therefore the recurrence $a_{n+1} = 225a_n - a_{n-1}$ implies that $a_{n+1}> 225a_n - a_n = 224a_n$ and therefore $a_n \geq 2007$ for $n \geq 4$.

Finding $a_{n+1}$ from $a_{n+1} a_{n-1} = a_n^2+ 2007$ and substituting into $225 = \frac{a_{n+1}+a_{n-1}}{a_n}$ shows that
\[\frac{a_n^2 + a_{n-1}^2}{a_n a_{n-1}} = 225 -\frac{2007}{a_n a_{n-1}}.\]Thus the largest integer less than or equal to the original fraction is $\boxed{224}$.

2024-06-18 13:07:59,101 - INFO - Just the problem


  0%|          | 0/100 [00:00<?, ?it/s]

2024-06-18 13:09:00,356 - INFO - Result counts for 222: [(3, 16), (2, 11), (1, 4), (642, 1), (2008, 1), (235, 1), (223, 1), (63, 1)]
2024-06-18 13:09:00,357 - INFO - Predicted answer for problem 222 is: 3 with 16 votes
2024-06-18 13:09:00,358 - INFO - Ground truth is: 224
2024-06-18 13:09:00,358 - INFO - Reason step by step


  0%|          | 0/100 [00:00<?, ?it/s]

2024-06-18 13:10:16,824 - INFO - Result counts for 222: [(2, 12), (223, 2), (3, 1), (1, 1), (2007, 1), (224, 1)]
2024-06-18 13:10:16,825 - INFO - Predicted answer for problem 222 is: 2 with 12 votes
2024-06-18 13:10:16,825 - INFO - Ground truth is: 224


### 451

In [94]:
problem_idx = 451
display(accuracy_report.loc[problem_idx:problem_idx])
display(Markdown(f'## Problem {problem_idx}\n\n' + df.loc[problem_idx, 'problem']))
display(Markdown('## Solution\n\n' + df.loc[problem_idx, 'solution']))

logging.info('Just the problem')
prompt_template = "PROBLEM_PLACEHOLDER"
search_answer_for_prompt(prompt_template, problem_idx)
logging.info('Reason step by step')
prompt_template = """PROBLEM_PLACEHOLDER
Please reason step by step, and put your final answer within \\boxed{}."""
search_answer_for_prompt(prompt_template, problem_idx)

Unnamed: 0,answer,ground_truth,n_runs,correct_counts,highest_wrong_counts,wrong_counts,highest_correct_tokens,is_correct,pass
451,6,35,417,10,90,174,617,0.0,1


## Problem 451

In the statement below, the two blanks can be filled by positive single-digit numbers in such a way that the statement is always true:

$$\text{If }2x\equiv y+5\ (\bmod\ 9)\text{, then }x\equiv \underline{\ \ \ }\,y+\underline{\ \ \ }\ (\bmod\ 9).$$What is the product of the two digits that go in the blanks?

## Solution

Multiplying both sides of the congruence $$2x\equiv y+5\pmod 9$$by $5$ gives $$10x \equiv 5y+25\pmod 9,$$then reducing both sides modulo $9$ gives $$x\equiv 5y+7\pmod 9.$$Thus, the product of the blanks is $5\cdot 7=\boxed{35}$.

2024-06-18 13:06:09,459 - INFO - Just the problem


  0%|          | 0/100 [00:00<?, ?it/s]

2024-06-18 13:06:59,692 - INFO - Result counts for 451: [(8, 27), (5, 14), (1, 12), (7, 8), (35, 4), (2, 4), (15, 3), (6, 3), (14, 1), (10, 1)]
2024-06-18 13:06:59,693 - INFO - Predicted answer for problem 451 is: 8 with 27 votes
2024-06-18 13:06:59,694 - INFO - Ground truth is: 35
2024-06-18 13:06:59,695 - INFO - Reason step by step


  0%|          | 0/100 [00:00<?, ?it/s]

2024-06-18 13:07:59,002 - INFO - Result counts for 451: [(8, 19), (5, 12), (1, 8), (4, 5), (7, 4), (2, 3), (0, 1), (15, 1), (3, 1), (10, 1), (25, 1)]
2024-06-18 13:07:59,003 - INFO - Predicted answer for problem 451 is: 8 with 19 votes
2024-06-18 13:07:59,003 - INFO - Ground truth is: 35


## Learnings

- Some problems are better solved without using code.
- Adding that the answer is a non negative answer might be harmful
- Sometimes reason step by step works worse than simply giving the problem as input