Skip to content

microsoft/bapo-cot

Repository files navigation

Code for "Reasoning about Reasoning"

This repository contains code for reproducing the reasoning length experiments in

Kiran Tomlinson, Tobias Schnabel, Adith Swaminathan, and Jennifer Neville. Reasoning about Reasoning: BAPO Bounds on Chain-of-Thought Token Complexity in LLMs. arXiv, 2026. https://arxiv.org/abs/2602.02909

The code runs reasoning and non-reasoning models on synthetic tasks of varying size and counts how many reasoning tokens these models use to solve the tasks, as well as their accuracy on the tasks. See the paper for details about the experimental design and results. OpenAI and Google API keys are needed to run the experiments.

Setup

  • Python 3.11+ recommended (we used 3.11.14)
  • Install requirements in requirements.txt (e.g., with venv)
  • Set up API keys in environment variables (OPENAI_API_KEY, GOOGLE_API_KEY)

Note: the code also includes support for the Anthropic API, but this is not necessary to reproduce the experiments reported in the paper.

Run experiments

To run all experiments:

python experiments.py --config config.json
  • Outputs land in *_results_combined.json.

The total cost of running the experiments is ~$1000 USD (roughly evenly split between OpenAI and Google). This can be reduced by lowering the number of trials or limiting the instance size n.

Generating plots

  • Ensure the combined result JSONs are present.
  • Save PDFs to plots/: python plot.py

Transparency documentation

Intended uses

This repository is intended to be used for academic research, benchmarking, and comparative analysis of token usage by reasoning and non-reasoning LLMs. This code is being shared with the research community to facilitate reproduction of our results and foster further research in this area.

Out-of-scope uses and limitations

This code was written for research purposes and does not conform to production-grade standards. The experiments are limited in scope to measuring token usage and accuracy on synthetic problem-solving tasks. The prompts are written in English and may not represent performance in other languages.

License

This code is released under an MIT License (see LICENSE).

Nothing disclosed here, including the Out of Scope Uses section, should be interpreted as or deemed a restriction or modification to the license the code is released under.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Contact

This research was conducted by members of Microsoft Research. If you have suggestions or questions, please contact Kiran Tomlinson at kitomlinson@microsoft.com.

About

Code accompanying "Reasoning about Reasoning" (https://www.arxiv.org/abs/2602.02909)

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages