CausalReasoningLLM

Causal reasoning benchmarks and tasks for large language models. For detailed reviews, please see Yang, L., Clivio, O., Shirvaikar, V., & Falck, F. (2023, December). A critical review of Causal Inference benchmarks for Large Language Models. In AAAI 2024 Workshop on''Are Large Language Models Simply Causal Parrots?''.

Benchmark name	Paper title	Link / URL to data
ART	The Magic of IF: Investigating Causal Reasoning Abilities in Large Language Models of Code	https://github.com/allenai/abductive-commonsense-reasoning
BIGbench empirical_judgements	Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models.	https://github.com/google/BIG-bench/blob/main/bigbench/benchmark_tasks/empirical_judgments/task.json
BIGbench cause_and_effect	Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models.	https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/cause_and_effect
BIGbench com2sense	Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models.	https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks
BIGbench crass_ai	Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models.	https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/crass_ai
BIGbench entailed_polarity	Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models.	https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/entailed_polarity
BIGbench entailed_polarity_hindi	Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models.	https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/entailed_polarity_hindi
BIGbench fantasy_reasoning	Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models.	https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/fantasy_reasoning
BIGbench figure_of_speech_detection	Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models.	https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/figure_of_speech_detection
BIGbench forecasting_subquestions	Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models.	https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/forecasting_subquestions
BIGbench goal_step_wikihow	Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models.	https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/goal_step_wikihow
BIGbench human_organs_senses	Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models.	https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/human_organs_senses
BIGbench indic_cause_and_effect	Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models.	https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/indic_cause_and_effect
BIGbench minute_mysteries_qa	Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models.	https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/minute_mysteries_qa
BIGbench winowhy	Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models.	https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/winowhy
BIGbench-tellmewhy	Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models.	https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/tellmewhy
Causal Chains	Causal Parrots: Large Language Models May Talk Causality But Are Not Causal	https://github.com/MoritzWillig/causalParrots/blob/master/media/causal_chains.pdf
Causal-TimeBank (CTB)	(Original) The Causal News Corpus: Annotating Causal Relations in Event Sentences from News (LLM) Is ChatGPT a Good Causal Reasoner? A Comprehensive Evaluation.	https://github.com/paramitamirza/Causal-TimeBank
causalbank	(Original) Guided generation of cause and effect (LLM) Boosting Language Models Reasoning with Chain-of-Knowledge Prompting	https://nlp.jhu.edu/causalbank/
CausalDiscovery (Causal Parrots)	Causal Parrots: Large Language Models May Talk Causality But Are Not Causal	https://github.com/MoritzWillig/causalParrots
CLadder	CLadder: A Benchmark to Assess Causal Reasoning Capabilities of Language Models	https://huggingface.co/datasets/causalnlp/CLadder
COPA	(Original) Choice of plausible alternatives: An evaluation of commonsense causal reasoning. (LLM) Is ChatGPT a Good Causal Reasoner? A Comprehensive Evaluation.	https://people.ict.usc.edu/~~gordon/copa.html#:~~:text=An%20evaluation%20of%20commonsense%20causal,sets%20of%20500%20questions%20each.
Corr2Cause	Causal Parrots: Large Language Models May Talk Causality But Are Not Causal	https://github.com/causalNLP/corr2cause
Counterfactual reasoning	Counterfactual reasoning: Do Language Models need world knowledge for causal inference?	https://github.com/goldengua/Counterfactual_Inference_LM/tree/main/dataset
DREAM	Dream: A challenge data set and models for dialogue-based reading comprehension	https://paperswithcode.com/dataset/dream
e-CARE	Is ChatGPT a Good Causal Reasoner? A Comprehensive Evaluation.	https://github.com/Waste-Wood/e-CARE
EventstoryLinev0.9 (ESC)	(Original) The Event StoryLine Corpus: A New Benchmark for Causal and Temporal Relation Extraction (LLM) Is ChatGPT a Good Causal Reasoner? A Comprehensive Evaluation.	https://github.com/cltl/EventStoryLine
FCR	Towards fine-grained causal reasoning and qa	https://github.com/YangLinyi/Fine-grained-Causal-Reasoning
FinCausal	Financial document causality detection shared task (fincausal 2020)	https://wp.lancs.ac.uk/cfie/fincausal2020/
Intuitive Physics	Causal Parrots: Large Language Models May Talk Causality But Are Not Causal	https://github.com/MoritzWillig/causalParrots/blob/master/media/intuitive_physics.pdf
Knowledge Base Fact Embeddings	Causal Parrots: Large Language Models May Talk Causality But Are Not Causal
LogiQA	A challenge dataset for machine reading comprehension with logical reasoning	https://github.com/lgw863/LogiQA-dataset
MAVEN-ERE	(Original) MAVEN-ERE: A Unified Large-scale Dataset for Event Coreference, Temporal, Causal, and Subevent Relation Extraction(LLM) Is ChatGPT a Good Causal Reasoner? A Comprehensive Evaluation.	https://github.com/THU-KEG/MAVEN-ERE
Natural World Chain	Causal Parrots: Large Language Models May Talk Causality But Are Not Causal	https://github.com/MoritzWillig/causalParrots/blob/master/media/causal_world.pdf
Neuropathic-pain-diagnosis	(Original) Neuropathic pain diagnosis simulator for causal discovery algorithm evaluation (LLM) Causal-discovery performance of chatgpt in the context of neuropathic pain diagnosis.	https://github.com/TURuibo/Neuropathic-Pain-Diagnosis-Simulator
NLP tasks for cause and effects	Super-natural instructions: Generalization via declarative instructions on 1600+ nlp tasks	https://github.com/allenai/natural-instructions
RACE	Race: Large-scale reading comprehension dataset from examinations	https://www.cs.cmu.edu/~glai1/data/race/
TimeTravel	The Magic of IF: Investigating Causal Reasoning Abilities in Large Language Models of Code	https://github.com/qkaren/Counterfactual-StoryRW
Tuebingen cause-effect pairs dataset	(Original) Distinguishing cause from effect using observational data: methods and benchmarks (LLM) Causal Reasoning and Large Language Models: Opening a New Frontier for Causality	https://github.com/amit-sharma/chatgpt-causality-pairs/tree/main

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

CausalReasoningLLM

Files

README.md

Latest commit

History

README.md

File metadata and controls

CausalReasoningLLM