# MemAlign: Aligning LLM Judges with Human Feedback
This notebook demonstrates how to use MemAlign to align an LLM judge with human preferences.

MemAlign uses a dual-memory system:

Semantic Memory: Distills general guidelines from human feedback patterns
Episodic Memory: Retrieves similar past examples using embeddings for few-shot learning
## What you'll learn:
1. Create an LLM judge for evaluating responses
2. Prepare alignment and test datasets with edge cases
3. Evaluate the judge before alignment (baseline)
4. Align the judge using human feedback
5. Evaluate the improved judge (post-alignment)
6. Inspect the learned guidelines
7. Unalign (remove) specific feedback from the judge
8. Register the judge as a scorer for future experiments

# Setup
First, let's import the required modules and set up the environment.

In [None]:
%pip install --upgrade mlflow>=3.9.0 openai litellm dspy

In [None]:
from mlflow.genai.judges import make_judge
from mlflow.genai.judges.optimizers import MemAlignOptimizer
from mlflow.entities import AssessmentSource, AssessmentSourceType

import mlflow

import time
import os

## Set up your provider and model

In [None]:
# For example, to use OpenAI API, uncomment the following lines and comment out Option 1 above:
os.environ["OPENAI_API_KEY"] = "your_openai_api_key_here"  # TODO: set your OpenAI API key
mlflow.set_tracking_uri("your MLflow server")
mlflow.set_registry_uri("your MLflow server")
experiment_name = "memalign-demo"
experiment = mlflow.set_experiment(experiment_name)
experiment_id = experiment.experiment_id

# Step 1: Create an LLM Judge
We'll create a judge that evaluates whether customer service responses are helpful.

In [None]:
JUDGE_NAME = "helpfulness"

initial_judge = make_judge(
    name=JUDGE_NAME,
    instructions=(
        "Evaluate whether the customer support botâ€™s response is helpful "
        "given the user query.\n\n"
        "User query: {{ inputs }}\n"
        "Assistant response: {{ outputs }}\n"
    ),
    feedback_value_type=bool,
    model="openai:/gpt-5_2", # Alternative: Example using OpenAI model
)

print(f"Created judge: {initial_judge.name}")
print(f"Model: {initial_judge.model}")

# Step 2: Create Toy Datasets
We'll create two datasets:

1. **Alignment set** (5 examples): Used to teach the judge our preferences
2. **Test set** (5 examples): Used to evaluate the judge's performance
## The tricky case: Factually correct but emotionally cold
LLM judges often rate **factually correct responses as helpful**, even when they lack empathy. But in customer service, a cold transactional response to a frustrated user is unhelpful - it should acknowledge emotions first.

In [None]:
# Alignment dataset - 4 easy examples + 1 tricky case
alignment_examples = [
    {
        "inputs": "What are your store hours?",
        "outputs": "We're open Monday to Friday, 9am to 6pm.",
        "is_helpful": True,
        "rationale": "Direct, accurate answer to a simple question.",
    },
    {
        "inputs": "Thanks for your help!",
        "outputs": "You're welcome! Let me know if you need anything else.",
        "is_helpful": True,
        "rationale": "Warm, friendly acknowledgment.",
    },
    {
        "inputs": "Can you help me track my order?",
        "outputs": "Figure it out yourself.",
        "is_helpful": False,
        "rationale": "Rude and dismissive.",
    },
    {
        "inputs": "I have a question about returns.",
        "outputs": "Whatever.",
        "is_helpful": False,
        "rationale": "Dismissive and unprofessional.",
    },
    # Tricky: Factually correct with solution, but lacks empathy
    {
        "inputs": "The sweater I ordered looks completely different from what was shown on the website.",
        "outputs": "Product colors may vary slightly due to lighting and display settings. You can initiate a return through your order history if needed.",
        "is_helpful": False,
        "rationale": "Response provides factual explanation and offers a solution (return option), "
                     "but fails to acknowledge the customer's disappointment. Should start with "
                     "'I'm sorry the product didn't meet your expectations' before explaining.",
    },
]

print(f"Created {len(alignment_examples)} alignment examples")

In [None]:
# Test dataset - 4 easy examples + 1 tricky case
test_examples = [
    {
        "inputs": "Do you offer gift wrapping?",
        "outputs": "Yes! You can select gift wrapping at checkout for $3.",
        "is_helpful": True,
        "rationale": "Helpful and informative.",
    },
    {
        "inputs": "I love your product!",
        "outputs": "Thank you so much! We're glad you're enjoying it.",
        "is_helpful": True,
        "rationale": "Warm acknowledgment of positive feedback.",
    },
    {
        "inputs": "How do I cancel my subscription?",
        "outputs": "Why would you want to do that? That's stupid.",
        "is_helpful": False,
        "rationale": "Insulting and unprofessional.",
    },
    {
        "inputs": "Is this item in stock?",
        "outputs": "I don't care.",
        "is_helpful": False,
        "rationale": "Rude and unhelpful.",
    },
    # Tricky: Factually correct but lacks empathy for frustrated user
    {
        "inputs": "I've been charged twice for my subscription this month. This is really frustrating!",
        "outputs": "We see two charges on your account because you updated your payment method. "
                   "One charge will be reversed automatically within 5-7 business days.",
        "is_helpful": False,
        "rationale": "Factually correct but too cold and transactional. "
                     "Should start with empathy (e.g., 'Sorry for the confusion') and end with "
                     "support-oriented language when responding to a frustrated customer.",
    },
]

print(f"Created {len(test_examples)} test examples")

# Step 3: Create Traces and Log Human Feedback
MemAlign learns from traces that have human feedback attached. We'll:

1. Create traces for each example
2. Log human feedback (ground truth) for alignment examples You can either log feedback programmatically (like below) or with the MLflow UI (see [here](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/alignment/#collecting-feedback-for-alignment)).

In [None]:
# Step 1: Create all traces first (separate from feedback logging)
def create_traces(examples, prefix):
    """Create traces from examples."""
    trace_ids = []

    for i, example in enumerate(examples):
        with mlflow.start_span(f"{prefix}_{i}") as span:
            span.set_inputs({"inputs": example["inputs"]})
            span.set_outputs({"outputs": example["outputs"]})
            trace_ids.append(span.trace_id)

    return trace_ids

# Create traces for alignment and test sets
alignment_trace_ids = create_traces(alignment_examples, "alignment")
print(f"Created {len(alignment_trace_ids)} alignment traces")

test_trace_ids = create_traces(test_examples, "test")
print(f"Created {len(test_trace_ids)} test traces")
time.sleep(2)  # Ensure traces are committed before adding assessments

In [None]:
# Step 2: Log human feedback for alignment examples

for i, (trace_id, example) in enumerate(zip(alignment_trace_ids, alignment_examples)):
    mlflow.log_feedback(
        trace_id=trace_id,
        name=JUDGE_NAME,
        value=example["is_helpful"],
        rationale=example["rationale"],
        source=AssessmentSource(
            source_type=AssessmentSourceType.HUMAN,
            source_id="human_expert"
        ),
    )

print(f"Logged human feedback for {len(alignment_trace_ids)} alignment traces")

# Step 4: Evaluate Baseline Judge Performance
Before alignment, let's see how the initial judge performs on both datasets. We expect the judge to make mistakes on edge cases like the tricky empathy examples.



In [None]:
def evaluate_judge(judge, examples, dataset_name):
    """Evaluate judge on examples and compute accuracy."""
    correct = 0
    results = []

    print(f"\n{'='*60}")
    print(f"Evaluating on {dataset_name} ({len(examples)} examples)")
    print(f"{'='*60}")

    for i, example in enumerate(examples):
        # Run judge
        feedback = judge(
            inputs=example["inputs"],
            outputs=example["outputs"]
        )

        predicted = feedback.value
        expected = example["is_helpful"]
        is_correct = predicted == expected

        if is_correct:
            correct += 1

        results.append({
            "example": i + 1,
            "predicted": predicted,
            "expected": expected,
            "correct": is_correct,
            "rationale": feedback.rationale[:100] + "..." if len(feedback.rationale) > 100 else feedback.rationale,
        })

        # Print result
        status = "CORRECT" if is_correct else "WRONG"
        print(f"\nExample {i+1}: [{status}]")
        print(f"  Input: {example['inputs'][:50]}...")
        print(f"  Predicted: {predicted}, Expected: {expected}")
        if not is_correct:
          print(f"  Judge rationale: {feedback.rationale[:150]}...")

    accuracy = correct / len(examples) * 100
    print(f"\n{'-'*60}")
    print(f"Accuracy: {correct}/{len(examples)} ({accuracy:.1f}%)")
    print(f"{'-'*60}")

    return accuracy, results

In [None]:
# Evaluate baseline on alignment set
baseline_align_accuracy, baseline_align_results = evaluate_judge(
    initial_judge, alignment_examples, "Alignment Set (Baseline)"
)

In [None]:
# Evaluate baseline on test set
baseline_test_accuracy, baseline_test_results = evaluate_judge(
    initial_judge, test_examples, "Test Set (Baseline)"
)

# Step 5: Align the Judge with MemAlign
Now we'll use MemAlign to align the judge with our human feedback.

MemAlign will:

1. **Distill guidelines** from the feedback rationales (semantic memory)
2. **Store examples** for few-shot retrieval (episodic memory)

In [None]:
# Create the MemAlign optimizer
optimizer = MemAlignOptimizer(
    reflection_lm="openai:/gpt-5_2",  # Model for distilling guidelines
    embedding_model="openai/text-embedding-3-large",  # Model for embeddings
    retrieval_k=3,  # Number of similar examples to retrieve during evaluation
)

print("Created MemAlign optimizer")

In [None]:
# Retrieve traces with human feedback for alignment
all_traces = mlflow.search_traces(
    experiment_ids=[experiment_id],
    return_type="list",
)

alignment_traces = [
    trace for trace in all_traces
    if trace.info.trace_id in alignment_trace_ids
]

print(f"Retrieved {len(alignment_traces)} traces for alignment")

In [None]:
# Align the judge
aligned_judge = initial_judge.align(
    traces=alignment_traces,
    optimizer=optimizer
)

print(f"\nAlignment complete!")

# Step 6: Inspect Learned Guidelines (Semantic Memory)
Let's see what guidelines MemAlign distilled from our feedback.

In [None]:
# View the full instructions (original + distilled guidelines)
print("\nFull Judge Instructions (with guidelines)")
print("="*60)
print(aligned_judge.instructions)

# Step 7: Evaluate Aligned Judge Performance
Let's see how the aligned judge performs compared to the baseline.

In [None]:
# Evaluate aligned judge on alignment set
aligned_align_accuracy, aligned_align_results = evaluate_judge(
    aligned_judge, alignment_examples, "Alignment Set (Aligned)"
)

In [None]:
# Evaluate aligned judge on test set
aligned_test_accuracy, aligned_test_results = evaluate_judge(
    aligned_judge, test_examples, "Test Set (Aligned)"
)

In [None]:
# Summary comparison
print("\n" + "="*60)
print("PERFORMANCE COMPARISON")
print("="*60)
print(f"\n{'Dataset':<25} {'Baseline':<15} {'Aligned':<15} {'Improvement':<15}")
print("-"*60)
print(f"{'Alignment Set':<25} {baseline_align_accuracy:<15.1f} {aligned_align_accuracy:<15.1f} {aligned_align_accuracy - baseline_align_accuracy:+.1f}%")
print(f"{'Test Set':<25} {baseline_test_accuracy:<15.1f} {aligned_test_accuracy:<15.1f} {aligned_test_accuracy - baseline_test_accuracy:+.1f}%")
print("-"*60)


# Step 8: Unalign - Remove Specific Feedback
Sometimes you may want to remove specific examples from the judge's memory. For instance, if some feedback was incorrect or is no longer relevant.

Let's remove one of the alignment traces (say, the last one where the judge fails initially) and see how it affects the performance.

In [None]:
# Check current memory state
print(f"Before unalignment:")
print(f"  Semantic memory: {len(aligned_judge._semantic_memory)} guidelines")
print(f"  Episodic memory: {len(aligned_judge._episodic_memory)} examples")

In [None]:
# Remove the last alignment example
traces_to_remove = [t for t in alignment_traces if t.info.trace_id == alignment_trace_ids[-1]]

print(f"Removing {len(traces_to_remove)} trace(s) from the judge's memory...")
for trace in traces_to_remove:
    print(f"  - Trace ID: {trace.info.trace_id}")

# Unalign
updated_judge = aligned_judge.unalign(traces=traces_to_remove)

In [None]:
# View updated instructions
print("\nUpdated instructions (after unalignment)")
print("="*60)
print(updated_judge.instructions)

In [None]:
# Evaluate updated judge on test set
updated_test_accuracy, updated_test_results = evaluate_judge(
    updated_judge, test_examples, "Test Set (After Unalignment)"
)


After unalignment, we see the guideline on response empathy is removed from the instructions, and the judge's prediction on the relevant test example also degrades back to incorrect.

# Step 9: Register the Judge as a Scorer
Finally, let's register the aligned judge so it can be used in future MLflow experiments. This allows you to:

- Use the judge consistently across experiments
- Share the judge with team members
- Track judge versions over time

In [None]:
# Register the aligned judge
registered_judge = aligned_judge.register()

print(f"Judge registered successfully!")
print(f"  Name: {registered_judge.name}")

In [None]:
# List all registered scorers
from mlflow.genai import list_scorers

scorers = list_scorers(experiment_id=experiment_id)
print(f"\nRegistered scorers in experiment:")
for scorer in scorers:
    print(f"  - {scorer.name} (model: {scorer.model})")

In [None]:
# Retrieve the registered scorer
from mlflow.genai import get_scorer

retrieved_judge = get_scorer(name="helpfulness", experiment_id=experiment_id)

print(f"Retrieved registered judge: {retrieved_judge.name}")

In [None]:
# Use the retrieved judge
test_result = retrieved_judge(
    inputs="I'm having trouble with my order and feeling frustrated.",
    outputs="I understand this is frustrating. Let me look into your order right away and help resolve this."
)

print(f"\nTest evaluation:")
print(f"  helpful: {test_result.value}")
print(f"  Rationale: {test_result.rationale}")


# Summary
In this notebook, we demonstrated the complete MemAlign workflow:

1. Created an LLM judge for evaluating response helpfulness
2. Prepared datasets with a tricky case: factually correct but emotionally cold responses
3. Evaluated baseline performance - the judge incorrectly rated cold responses as helpful
4. Aligned the judge using human feedback with MemAlign
5. Inspected learned guidelines - MemAlign learned that empathy matters
6. Evaluated improved performance - the aligned judge now considers emotional tone
7. Unaligned specific traces - removed feedback from the judge's memory
8. Registered the judge for use in future experiments
# Key takeaways:
- MemAlign captures nuance: It learned that factual correctness alone isn't enough
- Dual memory system: Guidelines (semantic) + examples (episodic) provide robust alignment
- Incremental updates: Use .align() to add feedback and .unalign() to remove it
- Persistence: Register judges to share and reuse across experiments

# Cleanup (optional) - delete the registered scorer

In [None]:
from mlflow.genai.scorers import delete_scorer

delete_scorer(name="helpfulness", experiment_id=experiment_id, version="all")