# Chain-of-thought: Extensions and Analysis

- 📺 **Video:** [https://youtu.be/9sFyzMywKmo](https://youtu.be/9sFyzMywKmo)

## Overview
- Analyze when chain-of-thought prompting helps and how to improve robustness.
- Explore self-consistency, verifier models, and editing noisy chains.

## Key ideas
- **Sampling multiple chains:** vote over diverse rationales to reduce errors.
- **Verification:** post-check reasoning steps for arithmetic or logic mistakes.
- **Compression:** distill verbose chains into concise explanations.
- **Failure modes:** hallucinated steps or unsupported leaps need filtering.

## Demo
Simulate self-consistency by generating multiple chains with random noise and majority voting, following the lecture (https://youtu.be/R6N3Wf0iHzs).

In [1]:
import random

random.seed(0)

question = 'What is 4 + 3 + 2?'
chains = []
for _ in range(5):
    noise = random.choice([-1, 0, 1])
    partial = 4 + 3 + noise
    final = partial + 2
    chain = [
        f'Add 4 + 3 to get {partial}.',
        f'Add 2 to reach {final}.',
        f'Answer {final}.'
    ]
    chains.append((chain, final))

votes = {}
for _, result in chains:
    votes[result] = votes.get(result, 0) + 1
best = max(votes, key=votes.get)
print('Chains generated:')
for chain, result in chains:
    print(chain, '->', result)
print()
print('Majority vote answer:', best)


Chains generated:
['Add 4 + 3 to get 7.', 'Add 2 to reach 9.', 'Answer 9.'] -> 9
['Add 4 + 3 to get 7.', 'Add 2 to reach 9.', 'Answer 9.'] -> 9
['Add 4 + 3 to get 6.', 'Add 2 to reach 8.', 'Answer 8.'] -> 8
['Add 4 + 3 to get 7.', 'Add 2 to reach 9.', 'Answer 9.'] -> 9
['Add 4 + 3 to get 8.', 'Add 2 to reach 10.', 'Answer 10.'] -> 10

Majority vote answer: 9


## Try it
- Modify the demo
- Add a tiny dataset or counter-example


## References
- [The Mythos of Model Interpretability](https://arxiv.org/pdf/1606.03490.pdf)
- [Deep Unordered Composition Rivals Syntactic Methods for Text Classification](https://www.aclweb.org/anthology/P15-1162/)
- [Analysis Methods in Neural Language Processing: A Survey](https://arxiv.org/pdf/1812.08951.pdf)
- ["Why Should I Trust You?" Explaining the Predictions of Any Classifier](https://arxiv.org/pdf/1602.04938.pdf)
- [Axiomatic Attribution for Deep Networks](https://arxiv.org/pdf/1703.01365.pdf)
- [BERT Rediscovers the Classical NLP Pipeline](https://arxiv.org/pdf/1905.05950.pdf)
- [What Do You Learn From Context? Probing For Sentence Structure In Contextualized Word Represenations](https://arxiv.org/pdf/1905.06316.pdf)
- [Annotation Artifacts in Natural Language Inference Data](https://www.aclweb.org/anthology/N18-2017/)
- [Hypothesis Only Baselines in Natural Language Inference](https://www.aclweb.org/anthology/S18-2023/)
- [Did the Model Understand the Question?](https://www.aclweb.org/anthology/P18-1176/)
- [Swag: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference](https://www.aclweb.org/anthology/D18-1009.pdf)
- [Generating Visual Explanations](https://arxiv.org/pdf/1603.08507.pdf)
- [e-SNLI: Natural Language Inference with Natural Language Explanations](https://arxiv.org/abs/1812.01193)
- [Explaining Question Answering Models through Text Generation](https://arxiv.org/pdf/2004.05569.pdf)
- [Program Induction by Rationale Generation : Learning to Solve and Explain Algebraic Word Problems](https://arxiv.org/abs/1705.04146)
- [Chain-of-Thought Prompting Elicits Reasoning in Large Language Models](https://arxiv.org/abs/2201.11903)
- [The Unreliability of Explanations in Few-shot Prompting for Textual Reasoning](https://arxiv.org/abs/2205.03401)
- [Large Language Models are Zero-Shot Reasoners](https://arxiv.org/abs/2205.11916)
- [Complementary Explanations for Effective In-Context Learning](https://arxiv.org/pdf/2211.13892.pdf)
- [PAL: Program-aided Language Models](https://arxiv.org/abs/2211.10435)
- [Measuring and Narrowing the Compositionality Gap in Language Models](https://arxiv.org/abs/2210.03350)


*Links only; we do not redistribute slides or papers.*