##  Human & Preference-Based Evaluation

Automated metrics don’t always align with human judgments, so LLM evaluation requires human feedback.

2.1 Human Rating Scales

Likert Scale – 1 to 5 rating for fluency, coherence, etc.

Ranking-based Evaluation – Compare multiple outputs and rank them.


Likert Scale – 1 to 5 Rating <br>
Human evaluators rate text quality on a 1 to 5 scale based on:<br>
	•	Fluency: Is the text grammatically correct?<br>
	•	Coherence: Does it make sense in context?<br>
	•	Relevance: Does it answer the question or match the task?<br>

In [1]:
def collect_human_ratings():
    outputs = [
        "The cat sits on the mat.",
        "Cat mat on sits.",
        "A feline is resting on a carpet."
    ]
    
    scores = []
    for idx, output in enumerate(outputs):
        print(f"\n[{idx + 1}] Generated Text: {output}")
        fluency = int(input("Rate Fluency (1-5): "))
        coherence = int(input("Rate Coherence (1-5): "))
        relevance = int(input("Rate Relevance (1-5): "))
        
        avg_score = (fluency + coherence + relevance) / 3
        scores.append((output, avg_score))
    
    scores.sort(key=lambda x: x[1], reverse=True)
    print("\n**Ranked Outputs (Best to Worst):**")
    for rank, (text, score) in enumerate(scores, start=1):
        print(f"{rank}. {text} (Avg Score: {score:.2f})")

# Run the human evaluation collection
collect_human_ratings()


[1] Generated Text: The cat sits on the mat.

[2] Generated Text: Cat mat on sits.

[3] Generated Text: A feline is resting on a carpet.

**Ranked Outputs (Best to Worst):**
1. Cat mat on sits. (Avg Score: 5.00)
2. A feline is resting on a carpet. (Avg Score: 3.67)
3. The cat sits on the mat. (Avg Score: 3.33)


Ranking-Based Evaluation

Instead of assigning numerical scores, humans compare multiple outputs and rank them in order of preference.

Example: Comparing Model Outputs

In [2]:
def compare_outputs():
    outputs = [
        "Climate change means the planet is getting hotter due to pollution, affecting weather and ecosystems.",
        "Global warming increases CO2 levels, causing environmental changes.",
        "Rising heat is bad."
    ]
    
    rankings = []
    for i in range(len(outputs)):
        for j in range(i + 1, len(outputs)):
            print(f"\nChoose the better response:\n[1] {outputs[i]}\n[2] {outputs[j]}")
            choice = int(input("Enter 1 or 2: "))
            rankings.append((outputs[i], outputs[j], choice))
    
    # Count wins and rank outputs
    scores = {output: 0 for output in outputs}
    for output1, output2, choice in rankings:
        if choice == 1:
            scores[output1] += 1
        else:
            scores[output2] += 1
    
    # Sort outputs by ranking
    sorted_outputs = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    
    print("\n**Final Ranking:**")
    for rank, (text, score) in enumerate(sorted_outputs, start=1):
        print(f"{rank}. {text} (Wins: {score})")

# Run pairwise ranking
compare_outputs()


Choose the better response:
[1] Climate change means the planet is getting hotter due to pollution, affecting weather and ecosystems.
[2] Global warming increases CO2 levels, causing environmental changes.

Choose the better response:
[1] Climate change means the planet is getting hotter due to pollution, affecting weather and ecosystems.
[2] Rising heat is bad.

Choose the better response:
[1] Global warming increases CO2 levels, causing environmental changes.
[2] Rising heat is bad.

**Final Ranking:**
1. Climate change means the planet is getting hotter due to pollution, affecting weather and ecosystems. (Wins: 2)
2. Global warming increases CO2 levels, causing environmental changes. (Wins: 1)
3. Rising heat is bad. (Wins: 0)


2.2 Reinforcement Learning from Human Feedback (RLHF)

Collect human preference data (A > B style ranking).

Train a reward model on this data.

Fine-tune LLMs using RL with PPO (Proximal Policy Optimization).

LLMs like ChatGPT improve using Reinforcement Learning with Human Feedback (RLHF).

📌 Steps in RLHF

1️⃣ Collect Human Preferences
	•	Show humans two model outputs (A & B) for the same prompt.
	•	Ask them to choose the better one.

2️⃣ Train a Reward Model
	•	Convert human choices into training data.
	•	Train a reward model that predicts which output is better.

3️⃣ Fine-Tune LLM Using RL (PPO Algorithm)
	•	Use Proximal Policy Optimization (PPO) to fine-tune the model.
	•	The model maximizes human preference scores while avoiding degeneration.

In [3]:
!pip install trl torch transformers accelerate datasets

Collecting trl
  Downloading trl-0.15.2-py3-none-any.whl.metadata (11 kB)
Collecting datasets
  Using cached datasets-3.3.2-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Using cached pyarrow-19.0.1-cp311-cp311-macosx_12_0_arm64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Using cached dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Using cached xxhash-3.5.0-cp311-cp311-macosx_11_0_arm64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Using cached multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Downloading trl-0.15.2-py3-none-any.whl (318 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m318.9/318.9 kB[0m [31m985.4 kB/s[0m eta [36m0:00:00[0m0:01[0m00:01[0mm
[?25hUsing cached datasets-3.3.2-py3-none-any.whl (485 kB)
Using cached dill-0.3.8-py3-none-any.whl (116 kB)
Using cached multiprocess-0.70.16-py311-none-any.whl (143 kB)
Using cached pyar

In [None]:
import torch
from datasets import load_dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer, TrainingArguments, Trainer

# Load a smaller human preference dataset (1% of SHP dataset)
dataset = load_dataset("stanfordnlp/SHP", split="train[:1%]")

# Load a pre-trained transformer model for reward scoring
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
device = torch.device("cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu")
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=1).to(device)

# Tokenize dataset
def preprocess(examples):
    return tokenizer(examples["human_ref_A"], truncation=True, padding="max_length", max_length=512)

tokenized_dataset = dataset.map(preprocess, batched=True)

# Reduce dataset size for quick training
train_dataset = tokenized_dataset.shuffle(seed=42).select(range(min(100, len(tokenized_dataset))))

# Define training arguments
training_args = TrainingArguments(
    output_dir="./reward_model",
    evaluation_strategy="no",
    per_device_train_batch_size=4,
    save_strategy="epoch",
    logging_steps=10,
    report_to="none",  # Avoids logging to WandB by default
)

# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
)

# Train the reward model
trainer.train()


Resolving data files:   0%|          | 0/18 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/18 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/18 [00:00<?, ?it/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/50 [00:00<?, ? examples/s]

  0%|          | 0/75 [00:00<?, ?it/s]

: 