# Reinforcement Learning from Human Feedback (RLHF)

- 📺 **Video:** [https://youtu.be/DwAdhx6GFh8](https://youtu.be/DwAdhx6GFh8)

## Overview
- Align models by fine-tuning with reward models trained on human preference data.
- Understand the steps: supervised fine-tuning, reward learning, policy optimization.

## Key ideas
- **Preference data:** annotators rank responses for quality and safety.
- **Reward model:** approximates human preferences from text.
- **Policy optimization:** e.g., PPO maximizes reward while staying near the base model.
- **Safety guardrails:** combine reward shaping with constraints.

## Demo
Simulate preference comparisons, train a logistic reward model, and run a simple policy update, paralleling the lecture (https://youtu.be/0pLSyg7Z9hA).

In [1]:
import numpy as np
from sklearn.linear_model import LogisticRegression

responses = ['Concise answer with sources.', 'Vague answer with speculation.', 'Helpful step-by-step reasoning.', 'Off-topic joke.']
features = np.array([
    [1.0, 0.8],  # relevant, safe
    [0.2, 0.4],
    [0.9, 0.9],
    [0.1, 0.1]
])
preferences = [(0, 1), (2, 1), (0, 3), (2, 3)]

X, y = [], []
for better, worse in preferences:
    diff = features[better] - features[worse]
    X.append(diff)
    y.append(1)
    X.append(-diff)
    y.append(0)

clf = LogisticRegression().fit(X, y)
rewards = clf.decision_function(features)
print('Reward estimates:', rewards)

policy = np.array([0.25, 0.25, 0.25, 0.25])
advantage = rewards - rewards.mean()
policy = np.exp(np.log(policy + 1e-6) + 0.5 * advantage)
policy /= policy.sum()
print('Updated policy distribution:', policy)


Reward estimates: [1.88613449 0.58578553 1.85394001 0.20599333]
Updated policy distribution: [0.34040625 0.17767664 0.33497052 0.14694659]


## Try it
- Modify the demo
- Add a tiny dataset or counter-example


## References
- [Language Models are Unsupervised Multitask Learners](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
- [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165)
- [Llama 2: Open Foundation and Fine-Tuned Chat Models](https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/)
- [Demystifying Prompts in Language Models via Perplexity Estimation](https://arxiv.org/abs/2212.04037)
- [Calibrate Before Use: Improving Few-Shot Performance of Language Models](https://arxiv.org/abs/2102.09690)
- [Holistic Evaluation of Language Models](https://arxiv.org/abs/2211.09110)
- [Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?](https://arxiv.org/abs/2202.12837)
- [In-context Learning and Induction Heads](https://arxiv.org/abs/2209.11895)
- [Multitask Prompted Training Enables Zero-Shot Task Generalization](https://arxiv.org/abs/2110.08207)
- [Scaling Instruction-Finetuned Language Models](https://arxiv.org/abs/2210.11416)
- [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155)
- [[Website] Stanford Alpaca: An Instruction-following LLaMA Model](https://crfm.stanford.edu/2023/03/13/alpaca.html)
- [Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation](https://arxiv.org/abs/2212.07981)
- [WiCE: Real-World Entailment for Claims in Wikipedia](https://arxiv.org/abs/2303.01432)
- [SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization](https://arxiv.org/abs/2111.09525)
- [FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation](https://arxiv.org/abs/2305.14251)
- [RARR: Researching and Revising What Language Models Say, Using Language Models](https://arxiv.org/abs/2210.08726)


*Links only; we do not redistribute slides or papers.*