This repository includes code and data to reproduce all results: dataset construction, single-pass baselines, InfoQA, and the fitting scripts that align empirical accuracy with our theoretical capacity curves.
``` InfoQA/ ├── datasets/ # Synthetic benchmark (controllable hops & noise) │ ├── 1hop/ 2hop/ 3hop/ 4hop/ # JSON contexts for each hop and length bucket │ ├── general_noise.json │ ├── multi_hop_chain_company_stats.json │ └── syn_data.py # Script to (re)generate the datasets │ ├── fitting/ │ └── draw_all.py # Fit empirical results to theory & plot curves │ ├── utils/ │ ├── utils.py │ ├── infoqa_wo_decom.py # Ablation: without decomposition │ ├── infoqa_wo_pru.py # Ablation: without pruning │ ├── MHQA_direct.py # Direct prompting │ ├── MHQA_cot.py # Chain-of-Thought │ ├── MHQA_SC.py # Self-Consistency │ ├── MHQA_SFRF.py # Self-Refine │ ├── MHQA_ReAct.py # ReAct │ ├── MHQA_plan_and_solve.py # Plan-and-Solve │ ├── MHQA_self_ask.py # Self-Ask │ └── MHQA_infoqa.py # InfoQA (proof-of-concept) │ ├── run_all.sh # Reproduce main experiments end-to-end ├── requirements.txt # Python dependencies └── README.md ```
- Python 3.10
- Create a virtual environment and install dependencies:
conda create -n infoqa python=3.10
conda activate infoqa
pip install -r requirements.txt- (Optional) Regenerate the Synthetic Benchmark
cd datasets
python syn_data.py
cd ..bash run_all.shDefault settings match the paper (temperature = 0.2, max generation length = 4096). Outputs, including metrics and logs, are written to method-specific folders (see run_all.sh).
cd fitting
python draw_all.py
cd ..