CogBench: Benchmarking Cognitive Alignment of Large Language Models in Educational Question Answering

CogBench is a benchmark to assess the cognitive alignment capabilities of Large Language models in educational question answering

🔥 Highlights

Benchmark: 2,100 K–12 mathematics questions, each with multiple valid, cognition-differentiated solutions
Average 2.16 solutions per question; 3.2 curriculum knowledge components per question
Grade coverage: Primary 40%, Middle 35%, High 25%
3 cognition-aware QA tasks; 3 complementary metrics (CA, KC, KD)
Curriculum-Aware Knowledge Graph (CAKG) aligned to grade levels and solution strategies
Evaluated 11 LLMs (open-source and proprietary) via APIs (Sept–Dec 2025)
Key findings:
- Large gap between standard accuracy (up to 0.942) and cognitive alignment under unconstrained QA (best CA 0.534, KC 0.604)
- Grade-constrained prompting improves alignment (best CA 0.560, KC 0.753; KD up to 0.790)
- Knowledge-constrained prompting often reduces alignment due to activation of higher-level parametric patterns
- Fine-tuning (SFT + DPO) improves CA (0.47→0.63) and KC (0.54→0.68) with slight drops in ACC (0.88→0.83) and KD (0.72→0.61)
- Automatic metrics correlate well with expert human judgments on consistency and diversity

🚀 Overview

CogBench is built using a Multi-solution–Alignment–Evaluation pipeline:

Multi-solution Generation
- Multi-turn sampling with controlled decoding (temperature, top-k, nucleus) produces diverse, correct solution traces per question.
- Only answers with correct final results are retained (validated against gold answers).
Probability Attenuation for Diversity
- Identify anchor tokens (key concepts) from existing solutions.
- Use semantic inference (via embedding space and Moore–Penrose pseudoinverse) to attenuate probabilities of previously used/semantically similar tokens during decoding.
- Encourage discovery of novel, valid reasoning paths relying on different knowledge/strategies.
Curriculum-Aware Knowledge Graph (CAKG)
- Extract K–12 cognition-aware math knowledge from official standards.
- Organize as cumulative, grade-tagged subgraphs that encode procedural strategies and reasoning patterns.
- Emphasize fine-grained, solution strategy–level knowledge beyond topic hierarchies.
Solution–Cognition Alignment
- Encode solutions and CAKG triples (e.g., Qwen3 Embedding) and retrieve top-k relevant knowledge by cosine similarity.
- Induce candidate grade levels from retrieved triples.
- Human-in-the-loop expert validation refines solution–knowledge–grade mappings.
Cognition-Aware Evaluation
- Tasks:
  - Unconstrained QA: solve without cognitive cues (baseline behavior).
  - Grade-Constrained QA: generate solutions tailored to a specified grade.
  - Knowledge-Constrained QA: solve using only provided curriculum knowledge.
- Metrics:
  - Cognitive Accuracy (CA): correct answers that also meet the target cognitive level.
  - Knowledge Consistency (KC): adherence to grade-appropriate curriculum knowledge.
  - Knowledge Divergence (KD): differentiation of knowledge usage across grades (pairwise Jaccard distance).

📊 Dataset & Annotations

Sources: 1.2K Olympiad problems (public website) + 0.9K CMMath problems
Coverage: Primary (Grades 1–6), Middle (7–9), High (10–12)
Per-question: at least two solutions at different cognitive levels
Generation base model for multi-solution sampling: Qwen3-30B-A3B
Expert alignment: education experts verify solution–knowledge–grade mapping
Reliability: high-quality, cognition-aware labels after expert review

📦 Usage

The evaluation program is in the evaluation folder, and the metrics it uses are in the metric folder.

Three prompting modes:

Unconstrained: response1_title_only
Grade-constrained: response2_title_grade
Knowledge-Constrained: response3_title_knowledge

Run the evaluation scripts:

python -m evaluation.response --model_name gpt-5-nano-2025-08-07
python -m evaluation.evaluate_response --model_name gpt-5-nano-2025-08-07
python -m evaluation.find_knowledge_used --model_name gpt-5-nano-2025-08-07
python -m evaluation.calculate_metrics --model_name gpt-5-nano-2025-08-07

📬 Contact

Project Lead: ethanlu@mail.bnu.edu.cn

Dataset: https://huggingface.co/datasets/realEthanTLu/CogBench

Web Page: https://cogbench.lutong.space/

🙏 Acknowledge

On my own behalf, I would like to express my sincere gratitude to Jun Xue for his support and contributions to part of the code in this paper.

📄 Citation

If you use CogBench or our construction framework, please cite:

@article{CogBench2026,
  title={CogBench: Benchmarking Cognitive Alignment of Large Language Models in Educational Question Answering},
  author={Tong Lu, Zhichun Wang, Yuanhao Sun, Yaoyu Zhou, Mingrui Li,Yiming Guan, Zhiyong Bai},
  year={2026},
  journal={Findings of ACL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
base		base
data/exampleEQA		data/exampleEQA
evaluation		evaluation
metrics		metrics
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CogBench: Benchmarking Cognitive Alignment of Large Language Models in Educational Question Answering

🔥 Highlights

🚀 Overview

📊 Dataset & Annotations

📦 Usage

Three prompting modes:

Run the evaluation scripts:

📬 Contact

🙏 Acknowledge

📄 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CogBench: Benchmarking Cognitive Alignment of Large Language Models in Educational Question Answering

🔥 Highlights

🚀 Overview

📊 Dataset & Annotations

📦 Usage

Three prompting modes:

Run the evaluation scripts:

📬 Contact

🙏 Acknowledge

📄 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages