# Part 6: Human‑in‑the‑Loop & Preference Models

This notebook demonstrates how to simulate human preference data for evaluating language models. Human feedback is often collected in the form of ratings or pairwise comparisons. We simulate pairwise preferences on a set of prompts with multiple raters, aggregate the results, and discuss how to interpret them.

## Simulating Pairwise Preferences

We simulate 30 prompts, each evaluated by three human raters.  For each prompt there is a true best model (A or B).  If Model A is truly better for a prompt, raters prefer it with probability 0.8; if Model B is truly better, raters prefer A with probability 0.2.  Preferences are encoded as 1 for **A wins** and −1 for **B wins**.

In [1]:

import numpy as np

np.random.seed(123)
n_prompts = 30
n_raters = 3
# 60% of prompts favour Model A
true_quality = np.random.choice([1, 0], size=n_prompts, p=[0.6, 0.4])

# preferences[i,j] = 1 if rater j prefers A on prompt i, -1 if prefers B
preferences = np.zeros((n_prompts, n_raters), dtype=int)
for i in range(n_prompts):
    prob_A = 0.8 if true_quality[i] == 1 else 0.2
    for j in range(n_raters):
        preferences[i, j] = np.random.choice([1, -1], p=[prob_A, 1 - prob_A])

# Aggregate preference per prompt: sign of sum
agg_pref = np.sign(preferences.sum(axis=1))
A_wins = np.sum(agg_pref == 1)
B_wins = np.sum(agg_pref == -1)
ties = np.sum(agg_pref == 0)
proportion_A = A_wins / (A_wins + B_wins + ties)

# Results dictionary
results = {
    'Prompts': n_prompts,
    'A_wins': int(A_wins),
    'B_wins': int(B_wins),
    'Ties': int(ties),
    'Proportion_A_preferred': round(proportion_A, 3)
}
results


{'Prompts': 30, 'A_wins': 17, 'B_wins': 13, 'Ties': 0, 'Proportion_A_preferred': 0.567}

### Discussion

In this simulation, human raters collectively preferred Model A on **17 out of 30 prompts**, with **0** ties.  That corresponds to **56.7 %** of prompts favouring Model A overall.

Human‑in‑the‑loop evaluations often use aggregated preferences or ratings to train reward models and to assess whether one model is better than another.  Pairwise comparison data is particularly powerful for **reinforcement learning from human feedback (RLHF)**, where a reward model is trained to predict which of two outputs is preferred.

Keep in mind that real human evaluations involve additional complexities: rater biases, annotation guidelines, quality checks, and inter‑rater reliability metrics.  Nevertheless, this toy example shows how to aggregate preferences and quantify which model wins more often.

---

This notebook is part of a series on evaluating LLMs.  See the accompanying article for more context and further discussion.