Code for "Reasoning about Reasoning"

This repository contains code for reproducing the reasoning length experiments in

Kiran Tomlinson, Tobias Schnabel, Adith Swaminathan, and Jennifer Neville. Reasoning about Reasoning: BAPO Bounds on Chain-of-Thought Token Complexity in LLMs. arXiv, 2026. https://arxiv.org/abs/2602.02909

The code runs reasoning and non-reasoning models on synthetic tasks of varying size and counts how many reasoning tokens these models use to solve the tasks, as well as their accuracy on the tasks. See the paper for details about the experimental design and results. OpenAI and Google API keys are needed to run the experiments.

Setup

Python 3.11+ recommended (we used 3.11.14)
Install requirements in requirements.txt (e.g., with venv)
Set up API keys in environment variables (OPENAI_API_KEY, GOOGLE_API_KEY)

Note: the code also includes support for the Anthropic API, but this is not necessary to reproduce the experiments reported in the paper.

Run experiments

To run all experiments:

python experiments.py --config config.json

Outputs land in *_results_combined.json.

The total cost of running the experiments is ~$1000 USD (roughly evenly split between OpenAI and Google). This can be reduced by lowering the number of trials or limiting the instance size n.

Generating plots

Ensure the combined result JSONs are present.
Save PDFs to plots/: python plot.py

Transparency documentation

Intended uses

This repository is intended to be used for academic research, benchmarking, and comparative analysis of token usage by reasoning and non-reasoning LLMs. This code is being shared with the research community to facilitate reproduction of our results and foster further research in this area.

Out-of-scope uses and limitations

This code was written for research purposes and does not conform to production-grade standards. The experiments are limited in scope to measuring token usage and accuracy on synthetic problem-solving tasks. The prompts are written in English and may not represent performance in other languages.

License

This code is released under an MIT License (see LICENSE).

Nothing disclosed here, including the Out of Scope Uses section, should be interpreted as or deemed a restriction or modification to the license the code is released under.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Contact

This research was conducted by members of Microsoft Research. If you have suggestions or questions, please contact Kiran Tomlinson at kitomlinson@microsoft.com.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
config.json		config.json
experiments.py		experiments.py
plot.py		plot.py
providers.py		providers.py
requirements.txt		requirements.txt
results.py		results.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code for "Reasoning about Reasoning"

Setup

Run experiments

Generating plots

Transparency documentation

Intended uses

Out-of-scope uses and limitations

License

Trademarks

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Code for "Reasoning about Reasoning"

Setup

Run experiments

Generating plots

Transparency documentation

Intended uses

Out-of-scope uses and limitations

License

Trademarks

Contact

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages