Skip to content

megagonlabs/TIM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TIM

Figure 1

This repository accompanies the paper From Proof to Program: Characterizing Tool-Induced Reasoning Hallucinations in Large Language Models, accepted to ACL 2026 Main. It contains the public release of our benchmark and the training/evaluation pipeline for tool-assisted mathematical reasoning.

What Is Included

  • A structured Python package for training, benchmarking, and multi-dimensional evaluation.
  • A canonical benchmark split for competition-style math problems.
  • Train-data generators deriving DPO-style JSONL from the train dataset.
  • Command-line entry points for benchmarking, correctness checks, missing-step evaluation, and pairwise comparison.

Repository Layout

TIM-private/
├── data/
│   ├── benchmark/
│   └── train/
├── src/
│   ├── cli/
│   ├── benchmarking.py
│   ├── evaluation.py
│   ├── judging.py
│   └── ...
├── tests/
├── pyproject.toml
└── README.md

Installation

python -m venv .venv
source .venv/bin/activate
pip install -e .

Set your API key before running judge or benchmark commands. Example:

export OPENAI_API_KEY=...

Quick Start

Run a base benchmark on the included benchmark split:

tim-benchmark \
  --model gpt-4.1-2025-04-14 \
  --mode base \
  --output outputs/base_gpt-4.1-2025-04-14.csv

Run a tool-assisted benchmark. The public tool mode is compatible with the internal talm naming used in older scripts:

tim-benchmark \
  --model o4-mini-2025-04-16 \
  --mode tool \
  --output outputs/tool_o4-mini-2025-04-16.csv

Add correctness labels to a response file:

tim-eval-correctness \
  --responses outputs/tool_o4-mini-2025-04-16.csv \
  --output outputs/tool_o4-mini-2025-04-16_correctness.csv

Compute missing-step rates for correct solutions:

tim-eval-miss-rate \
  --responses outputs/tool_o4-mini-2025-04-16_correctness.csv \
  --output outputs/tool_o4-mini-2025-04-16_miss_rate.csv

Compare base and tool-assisted runs head-to-head:

tim-compare \
  --base outputs/base_gpt-4.1-2025-04-14.csv \
  --tool outputs/tool_gpt-4.1-2025-04-14.csv \
  --output outputs/base_vs_tool_gpt-4.1-2025-04-14.csv

Data

The repo includes benchmark, validation, and training artifacts with the original data lineage preserved. The active train, validation, and benchmark sets are mutually exclusive:

  • data/benchmark/pymath_dataset.csv Public evaluation benchmark with 1000 examples.

  • data/benchmark/pymath_validation_dataset.csv Validation split with 82 examples.

  • data/train/train_dataset_v2_gpt-4.1-2025-04-14.csv Preference data for DPO training.

  • data/train/aime.csv

  • data/train/olympiadbench.csv

  • data/train/olympicarena.csv

  • data/train/omni_math.csv

Because the AoPS AIME pages do not have a standard open-data license, current AIME-sourced CSV rows may store AoPS problem links instead of extracted solution text. Run the materializer once to replace those links in the public benchmark, validation, and train CSVs with AIME solution using:

tim-materialize-aime-solutions

This command replaces those AoPS links with extracted AIME solution text in:

  • data/benchmark/pymath_dataset.csv
  • data/benchmark/pymath_validation_dataset.csv
  • data/train/train_dataset_gpt-4.1-2025-04-14.csv

Generate a DPO-style training set from data/train/train_dataset_gpt-4.1-2025-04-14.csv to fine-tune GPT-4.1 in the OpenAI dashboard with:

tim-generate-dpo-data \
  --train-dataset data/train/train_dataset_v2_gpt-4.1-2025-04-14.csv \
  --output data/train/openai_dpo_data_gpt-4.1-2025-04-14.jsonl

This derived JSONL uses the train CSV's problem, positive_response_text, and negative_response_text columns. If any AIME-backed rows in the benchmark or train CSVs still store AoPS links, run tim-materialize-aime-solutions first.

Data Source Attribution

The public benchmark and source tables in this repository build on the following benchmark datasets and competition materials:

  1. AIME
    • Source: AoPS Wiki AIME Problems and Solutions (2024-2025)
    • License: No standard open-data license was identified for the AoPS AIME pages. Because we do not treat these materials as openly licensed benchmark data, the data files keep references to the source pages rather than actual AIME solutions. Use the materialization step described above to obtain complete datasets.
  2. OlympiadBench
  3. OlympicArena
  4. Omni-MATH

Please refer to the respective sources for detailed licensing terms.

Notes

  • All code in this public release uses repo-relative paths. No absolute paths tied to a local username are required.
  • API keys are read from environment variables only.
  • AIME-backed rows intentionally keep AoPS solution-page links because no standard open-data license was identified for those AoPS pages. To get complete datasets, run tim-materialize-aime-solutions once and then use the resulting files for benchmark/eval/train workflows.
  • This repository is licensed under the BSD 3-Clause License. See LICENSE.

AI-Generated Content Disclaimer

Parts of our datasets, including preference-style training artifacts, were generated using OpenAI's GPT-4.1 model.

  • The content adheres to OpenAI's Usage Policies.
  • Outputs were reviewed and refined to align with the dataset's objectives.
  • No prohibited use cases or violations of OpenAI's terms are present in this dataset.

Please ensure compliance with OpenAI's policies if redistributing or modifying this dataset.

Usage Guidelines

  • Use this dataset for research and educational purposes.
  • Commercial use may require additional permissions depending on source licenses.

Citation

If you use this repository, please cite our work:

@article{bayat2025proof,
  title={From Proof to Program: Characterizing Tool-Induced Reasoning Hallucinations in Large Language Models},
  author={Bayat, Farima Fatahi and Pezeshkpour, Pouya and Hruschka, Estevam},
  journal={arXiv preprint arXiv:2511.10899},
  year={2025}
}

Disclosure

Embedded in, or bundled with, this product are open source software (OSS) components, datasets and other third party components identified below. The license terms respectively governing the datasets and third-party components continue to govern those portions, and you agree to those license terms, which, when applicable, specifically limit any distribution. You may receive a copy of, distribute and/or modify any open source code for the OSS component under the terms of their respective licenses, which may be CC license and Apache 2.0 license. In the event of conflicts between Megagon Labs, Inc., license conditions and the Open Source Software license conditions, the Open Source Software conditions shall prevail with respect to the Open Source Software portions of the software. You agree not to, and are not permitted to, distribute actual datasets used with the OSS components listed below. You agree and are limited to distribute only links to datasets from known sources by listing them in the datasets overview table below. You are permitted to distribute derived datasets of data sets from known sources by including links to original dataset source in the datasets overview table below. You agree that any right to modify datasets originating from parties other than Megagon Labs, Inc. are governed by the respective third party's license conditions. All OSS components and datasets are distributed WITHOUT ANY WARRANTY, without even implied warranty such as for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE, and without any liability to or claim against any Megagon Labs, Inc. entity other than as explicitly documented in this README document. You agree to cease using any part of the provided materials if you do not agree with the terms or the lack of any warranty herein. While Megagon Labs, Inc., makes commercially reasonable efforts to ensure that citations in this document are complete and accurate, errors may occur. If you see any error or omission, please help us improve this document by sending information to contact_oss@megagon.ai.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages