CogBench: Benchmarking Cognitive Alignment of Large Language Models in Educational Question Answering
CogBench is a benchmark to assess the cognitive alignment capabilities of Large Language models in educational question answering
- Benchmark: 2,100 K–12 mathematics questions, each with multiple valid, cognition-differentiated solutions
- Average 2.16 solutions per question; 3.2 curriculum knowledge components per question
- Grade coverage: Primary 40%, Middle 35%, High 25%
- 3 cognition-aware QA tasks; 3 complementary metrics (CA, KC, KD)
- Curriculum-Aware Knowledge Graph (CAKG) aligned to grade levels and solution strategies
- Evaluated 11 LLMs (open-source and proprietary) via APIs (Sept–Dec 2025)
- Key findings:
- Large gap between standard accuracy (up to 0.942) and cognitive alignment under unconstrained QA (best CA 0.534, KC 0.604)
- Grade-constrained prompting improves alignment (best CA 0.560, KC 0.753; KD up to 0.790)
- Knowledge-constrained prompting often reduces alignment due to activation of higher-level parametric patterns
- Fine-tuning (SFT + DPO) improves CA (0.47→0.63) and KC (0.54→0.68) with slight drops in ACC (0.88→0.83) and KD (0.72→0.61)
- Automatic metrics correlate well with expert human judgments on consistency and diversity
CogBench is built using a Multi-solution–Alignment–Evaluation pipeline:
-
Multi-solution Generation
- Multi-turn sampling with controlled decoding (temperature, top-k, nucleus) produces diverse, correct solution traces per question.
- Only answers with correct final results are retained (validated against gold answers).
-
Probability Attenuation for Diversity
- Identify anchor tokens (key concepts) from existing solutions.
- Use semantic inference (via embedding space and Moore–Penrose pseudoinverse) to attenuate probabilities of previously used/semantically similar tokens during decoding.
- Encourage discovery of novel, valid reasoning paths relying on different knowledge/strategies.
-
Curriculum-Aware Knowledge Graph (CAKG)
- Extract K–12 cognition-aware math knowledge from official standards.
- Organize as cumulative, grade-tagged subgraphs that encode procedural strategies and reasoning patterns.
- Emphasize fine-grained, solution strategy–level knowledge beyond topic hierarchies.
-
Solution–Cognition Alignment
- Encode solutions and CAKG triples (e.g., Qwen3 Embedding) and retrieve top-k relevant knowledge by cosine similarity.
- Induce candidate grade levels from retrieved triples.
- Human-in-the-loop expert validation refines solution–knowledge–grade mappings.
-
Cognition-Aware Evaluation
- Tasks:
- Unconstrained QA: solve without cognitive cues (baseline behavior).
- Grade-Constrained QA: generate solutions tailored to a specified grade.
- Knowledge-Constrained QA: solve using only provided curriculum knowledge.
- Metrics:
- Cognitive Accuracy (CA): correct answers that also meet the target cognitive level.
- Knowledge Consistency (KC): adherence to grade-appropriate curriculum knowledge.
- Knowledge Divergence (KD): differentiation of knowledge usage across grades (pairwise Jaccard distance).
- Tasks:
- Sources: 1.2K Olympiad problems (public website) + 0.9K CMMath problems
- Coverage: Primary (Grades 1–6), Middle (7–9), High (10–12)
- Per-question: at least two solutions at different cognitive levels
- Generation base model for multi-solution sampling: Qwen3-30B-A3B
- Expert alignment: education experts verify solution–knowledge–grade mapping
- Reliability: high-quality, cognition-aware labels after expert review
The evaluation program is in the evaluation folder, and the metrics it uses are in the metric folder.
- Unconstrained:
response1_title_only - Grade-constrained:
response2_title_grade - Knowledge-Constrained:
response3_title_knowledge
python -m evaluation.response --model_name gpt-5-nano-2025-08-07
python -m evaluation.evaluate_response --model_name gpt-5-nano-2025-08-07
python -m evaluation.find_knowledge_used --model_name gpt-5-nano-2025-08-07
python -m evaluation.calculate_metrics --model_name gpt-5-nano-2025-08-07Project Lead: ethanlu@mail.bnu.edu.cn
Dataset: https://huggingface.co/datasets/realEthanTLu/CogBench
Web Page: https://cogbench.lutong.space/
On my own behalf, I would like to express my sincere gratitude to Jun Xue for his support and contributions to part of the code in this paper.
If you use CogBench or our construction framework, please cite:
@article{CogBench2026,
title={CogBench: Benchmarking Cognitive Alignment of Large Language Models in Educational Question Answering},
author={Tong Lu, Zhichun Wang, Yuanhao Sun, Yaoyu Zhou, Mingrui Li,Yiming Guan, Zhiyong Bai},
year={2026},
journal={Findings of ACL}
}