This repository contains code for reproducing the reasoning length experiments in
Kiran Tomlinson, Tobias Schnabel, Adith Swaminathan, and Jennifer Neville. Reasoning about Reasoning: BAPO Bounds on Chain-of-Thought Token Complexity in LLMs. arXiv, 2026. https://arxiv.org/abs/2602.02909
The code runs reasoning and non-reasoning models on synthetic tasks of varying size and counts how many reasoning tokens these models use to solve the tasks, as well as their accuracy on the tasks. See the paper for details about the experimental design and results. OpenAI and Google API keys are needed to run the experiments.
- Python 3.11+ recommended (we used 3.11.14)
- Install requirements in
requirements.txt(e.g., with venv) - Set up API keys in environment variables (
OPENAI_API_KEY,GOOGLE_API_KEY)
Note: the code also includes support for the Anthropic API, but this is not necessary to reproduce the experiments reported in the paper.
To run all experiments:
python experiments.py --config config.json
- Outputs land in
*_results_combined.json.
The total cost of running the experiments is ~$1000 USD (roughly evenly split between OpenAI and Google). This can be reduced by lowering the number of trials or limiting the instance size n.
- Ensure the combined result JSONs are present.
- Save PDFs to
plots/:python plot.py
This repository is intended to be used for academic research, benchmarking, and comparative analysis of token usage by reasoning and non-reasoning LLMs. This code is being shared with the research community to facilitate reproduction of our results and foster further research in this area.
This code was written for research purposes and does not conform to production-grade standards. The experiments are limited in scope to measuring token usage and accuracy on synthetic problem-solving tasks. The prompts are written in English and may not represent performance in other languages.
This code is released under an MIT License (see LICENSE).
Nothing disclosed here, including the Out of Scope Uses section, should be interpreted as or deemed a restriction or modification to the license the code is released under.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.
This research was conducted by members of Microsoft Research. If you have suggestions or questions, please contact Kiran Tomlinson at kitomlinson@microsoft.com.