##  Human & Preference-Based Evaluation

Automated metrics don’t always align with human judgments, so LLM evaluation requires human feedback.

2.1 Human Rating Scales

Likert Scale – 1 to 5 rating for fluency, coherence, etc.

Ranking-based Evaluation – Compare multiple outputs and rank them.


Likert Scale – 1 to 5 Rating <br>
Human evaluators rate text quality on a 1 to 5 scale based on:<br>
	•	Fluency: Is the text grammatically correct?<br>
	•	Coherence: Does it make sense in context?<br>
	•	Relevance: Does it answer the question or match the task?<br>

In [1]:
def collect_human_ratings():
    outputs = [
        "The cat sits on the mat.",
        "Cat mat on sits.",
        "A feline is resting on a carpet."
    ]
    
    scores = []
    for idx, output in enumerate(outputs):
        print(f"\n[{idx + 1}] Generated Text: {output}")
        fluency = int(input("Rate Fluency (1-5): "))
        coherence = int(input("Rate Coherence (1-5): "))
        relevance = int(input("Rate Relevance (1-5): "))
        
        avg_score = (fluency + coherence + relevance) / 3
        scores.append((output, avg_score))
    
    scores.sort(key=lambda x: x[1], reverse=True)
    print("\n**Ranked Outputs (Best to Worst):**")
    for rank, (text, score) in enumerate(scores, start=1):
        print(f"{rank}. {text} (Avg Score: {score:.2f})")

# Run the human evaluation collection
collect_human_ratings()


[1] Generated Text: The cat sits on the mat.

[2] Generated Text: Cat mat on sits.

[3] Generated Text: A feline is resting on a carpet.

**Ranked Outputs (Best to Worst):**
1. Cat mat on sits. (Avg Score: 5.00)
2. A feline is resting on a carpet. (Avg Score: 3.67)
3. The cat sits on the mat. (Avg Score: 3.33)


Ranking-Based Evaluation

Instead of assigning numerical scores, humans compare multiple outputs and rank them in order of preference.

Example: Comparing Model Outputs

In [2]:
def compare_outputs():
    outputs = [
        "Climate change means the planet is getting hotter due to pollution, affecting weather and ecosystems.",
        "Global warming increases CO2 levels, causing environmental changes.",
        "Rising heat is bad."
    ]
    
    rankings = []
    for i in range(len(outputs)):
        for j in range(i + 1, len(outputs)):
            print(f"\nChoose the better response:\n[1] {outputs[i]}\n[2] {outputs[j]}")
            choice = int(input("Enter 1 or 2: "))
            rankings.append((outputs[i], outputs[j], choice))
    
    # Count wins and rank outputs
    scores = {output: 0 for output in outputs}
    for output1, output2, choice in rankings:
        if choice == 1:
            scores[output1] += 1
        else:
            scores[output2] += 1
    
    # Sort outputs by ranking
    sorted_outputs = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    
    print("\n**Final Ranking:**")
    for rank, (text, score) in enumerate(sorted_outputs, start=1):
        print(f"{rank}. {text} (Wins: {score})")

# Run pairwise ranking
compare_outputs()


Choose the better response:
[1] Climate change means the planet is getting hotter due to pollution, affecting weather and ecosystems.
[2] Global warming increases CO2 levels, causing environmental changes.

Choose the better response:
[1] Climate change means the planet is getting hotter due to pollution, affecting weather and ecosystems.
[2] Rising heat is bad.

Choose the better response:
[1] Global warming increases CO2 levels, causing environmental changes.
[2] Rising heat is bad.

**Final Ranking:**
1. Climate change means the planet is getting hotter due to pollution, affecting weather and ecosystems. (Wins: 2)
2. Global warming increases CO2 levels, causing environmental changes. (Wins: 1)
3. Rising heat is bad. (Wins: 0)


2.2 Reinforcement Learning from Human Feedback (RLHF)

Collect human preference data (A > B style ranking).

Train a reward model on this data.

Fine-tune LLMs using RL with PPO (Proximal Policy Optimization).