# GPT-3

- 📺 **Video:** [https://youtu.be/jn41DLgnqek](https://youtu.be/jn41DLgnqek)

## Overview
- Examine the GPT-3 architecture and scaling trends that enabled emergent abilities.
- Connect parameter counts, dataset size, and performance metrics.

## Key ideas
- **Decoder-only transformer:** autoregressive modeling with causal attention.
- **Scaling laws:** loss improves predictably with more parameters/data/compute.
- **In-context learning:** larger models adapt from prompts without weight updates.
- **API access:** inference costs and latency matter for deployment.

## Demo
Plot a toy scaling curve showing how loss decreases with parameter count using the power-law fit discussed in the lecture (https://youtu.be/jxu1qmwM04c).

In [1]:
import numpy as np

params = np.array([10**7, 10**8, 10**9, 10**10, 10**11], dtype=float)
loss = 1.2 * (params ** -0.05) + 1.1

for p, l in zip(params, loss):
    print(f"Parameters: {p/1e9:.1f}B | Approx. validation loss: {l:.3f}")

coef = np.polyfit(np.log10(params), np.log10(loss - min(loss) + 1e-6), 1)
print()
print('Fitted slope:', coef[0])


Parameters: 0.0B | Approx. validation loss: 1.636
Parameters: 0.1B | Approx. validation loss: 1.578
Parameters: 1.0B | Approx. validation loss: 1.526
Parameters: 10.0B | Approx. validation loss: 1.479
Parameters: 100.0B | Approx. validation loss: 1.438

Fitted slope: -1.1121550565476668


## Try it
- Modify the demo
- Add a tiny dataset or counter-example


## References
- [Language Models are Unsupervised Multitask Learners](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
- [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165)
- [Llama 2: Open Foundation and Fine-Tuned Chat Models](https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/)
- [Demystifying Prompts in Language Models via Perplexity Estimation](https://arxiv.org/abs/2212.04037)
- [Calibrate Before Use: Improving Few-Shot Performance of Language Models](https://arxiv.org/abs/2102.09690)
- [Holistic Evaluation of Language Models](https://arxiv.org/abs/2211.09110)
- [Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?](https://arxiv.org/abs/2202.12837)
- [In-context Learning and Induction Heads](https://arxiv.org/abs/2209.11895)
- [Multitask Prompted Training Enables Zero-Shot Task Generalization](https://arxiv.org/abs/2110.08207)
- [Scaling Instruction-Finetuned Language Models](https://arxiv.org/abs/2210.11416)
- [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155)
- [[Website] Stanford Alpaca: An Instruction-following LLaMA Model](https://crfm.stanford.edu/2023/03/13/alpaca.html)
- [Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation](https://arxiv.org/abs/2212.07981)
- [WiCE: Real-World Entailment for Claims in Wikipedia](https://arxiv.org/abs/2303.01432)
- [SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization](https://arxiv.org/abs/2111.09525)
- [FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation](https://arxiv.org/abs/2305.14251)
- [RARR: Researching and Revising What Language Models Say, Using Language Models](https://arxiv.org/abs/2210.08726)


*Links only; we do not redistribute slides or papers.*