# MemAlign: Aligning LLM Judges with Human Feedback
This notebook demonstrates how to use MemAlign to align an LLM judge with human preferences.

MemAlign uses a dual-memory system:

Semantic Memory: Distills general guidelines from human feedback patterns

Episodic Memory: Retrieves similar past examples using embeddings for few-shot learning
## What you'll learn:
You'll learn how to create an LLM judge for evaluating responses and use human feedback to align the judge with human preferences, improving accuracy. MemAlign also offers unalignment which allows you to remove certain pieces of feedback from the judge's memory if, for instance, they are no longer useful or contain sensitive data you have to remove. Lastly, we'll register the judge to the experiment so that it is persisted and can be used in monitoring production traffic.

# Setup
First, let's import the required modules and set up the environment.

In [1]:
!pip install --upgrade "mlflow>=3.9.0" litellm dspy jinja2 tqdm databricks-agents

Collecting mlflow==3.9.0
  Using cached mlflow-3.9.0-py3-none-any.whl.metadata (31 kB)
Collecting litellm
  Using cached litellm-1.81.8-py3-none-any.whl.metadata (30 kB)
Collecting dspy
  Using cached dspy-3.1.3-py3-none-any.whl.metadata (8.4 kB)
Collecting tqdm
  Using cached tqdm-4.67.3-py3-none-any.whl.metadata (57 kB)
Collecting databricks-agents
  Using cached databricks_agents-1.9.3-py3-none-any.whl.metadata (3.7 kB)
Collecting mlflow-skinny==3.9.0 (from mlflow==3.9.0)
  Downloading mlflow_skinny-3.9.0-py3-none-any.whl.metadata (32 kB)
Collecting mlflow-tracing==3.9.0 (from mlflow==3.9.0)
  Using cached mlflow_tracing-3.9.0-py3-none-any.whl.metadata (19 kB)
Collecting gepa==0.0.26 (from gepa[dspy]==0.0.26->dspy)
  Using cached gepa-0.0.26-py3-none-any.whl.metadata (29 kB)
Collecting dataclasses-json (from databricks-agents)
  Using cached dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting googleapis-common-protos (from databricks-agents)
  Usin

In [1]:
import os

import mlflow
from mlflow.genai.judges import make_judge
from mlflow.genai.judges.optimizers import MemAlignOptimizer
from mlflow.entities import AssessmentSource, AssessmentSourceType

  from .autonotebook import tqdm as notebook_tqdm


## Set up your provider and model

In [2]:
# For example, to use OpenAI API, provide your API key below:
os.environ["OPENAI_API_KEY"] = "" # TODO: set your OpenAI API key
mlflow.set_tracking_uri("")
mlflow.set_registry_uri("")
experiment_name = "memalign-demo"
experiment = mlflow.set_experiment(experiment_name)
experiment_id = experiment.experiment_id

2026/02/05 23:02:59 INFO mlflow.tracking.fluent: Experiment with name 'memalign-demo-2' does not exist. Creating a new experiment.


# Step 1: Create an LLM Judge
We'll create a judge that evaluates whether customer service responses are helpful.

In [3]:
JUDGE_NAME = "helpfulness"

initial_judge = make_judge(
    name=JUDGE_NAME,
    instructions=(
        "Evaluate whether the customer support bot's response is helpful "
        "given the user query.\n\n"
        "User query: {{ inputs }}\n"
        "Assistant response: {{ outputs }}\n"
    ),
    feedback_value_type=bool,
    model="openai:/gpt-5.2",
)

# Step 2: Create Alignment Traces with Human Feedback
MemAlign learns from traces that have human feedback attached. We'll create traces and log human feedback in a single step using a helper function.

You can either log feedback programmatically (like below) or with the MLflow UI (see [here](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/alignment/#collecting-feedback-for-alignment)).
## The tricky case: Factually correct but emotionally cold
LLM judges often rate **factually correct responses as helpful**, even when they lack empathy. But in customer service, a cold transactional response to a frustrated user is unhelpful - it should acknowledge emotions first.

In [None]:
feedback_source = AssessmentSource(
    source_type=AssessmentSourceType.HUMAN,
    source_id="human_expert",
)

def create_trace(inputs, outputs, feedback):
    with mlflow.start_span("alignment_trace") as span:
        span.set_inputs({"inputs": inputs})
        span.set_outputs({"outputs": outputs})
        trace_id = span.trace_id

    mlflow.flush_trace_async_logging()
    mlflow.log_assessment(trace_id, feedback)
    return mlflow.get_trace(trace_id)

alignment_traces = [
    create_trace(
        inputs="What are your store hours?",
        outputs="We're open Monday to Friday, 9am to 6pm.",
        feedback=mlflow.entities.Feedback(
            name=JUDGE_NAME,
            value=True,
            rationale="Direct, accurate answer to a simple question.",
            source=feedback_source,
        ),
    ),
    create_trace(
        inputs="Thanks for your help!",
        outputs="You're welcome! Let me know if you need anything else.",
        feedback=mlflow.entities.Feedback(
            name=JUDGE_NAME,
            value=True,
            rationale="Warm, friendly acknowledgment.",
            source=feedback_source,
        ),
    ),
    create_trace(
        inputs="Can you help me track my order?",
        outputs="Figure it out yourself.",
        feedback=mlflow.entities.Feedback(
            name=JUDGE_NAME,
            value=False,
            rationale="Rude and dismissive.",
            source=feedback_source,
        ),
    ),
    create_trace(
        inputs="I have a question about returns.",
        outputs="Whatever.",
        feedback=mlflow.entities.Feedback(
            name=JUDGE_NAME,
            value=False,
            rationale="Dismissive and unprofessional.",
            source=feedback_source,
        ),
    ),
    # Tricky: Factually correct with solution, but lacks empathy
    create_trace(
        inputs="The sweater I ordered looks completely different from what was shown on the website.",
        outputs=(
            "Product colors may vary slightly due to lighting and display settings. "
            "You can initiate a return through your order history if needed."
        ),
        feedback=mlflow.entities.Feedback(
            name=JUDGE_NAME,
            value=False,
            rationale=(
                "Response provides factual explanation and offers a solution (return option), "
                "but fails to acknowledge the customer's disappointment. Should start with "
                "'I'm sorry the product didn't meet your expectations' before explaining."
            ),
            source=feedback_source,
        ),
    ),
]

In [None]:
# Additional traces to evaluate judge performance (held out from alignment)
test_traces = [
    create_trace(
        inputs="Do you offer gift wrapping?",
        outputs="Yes! You can select gift wrapping at checkout for $3.",
        feedback=mlflow.entities.Feedback(
            name=JUDGE_NAME,
            value=True,
            rationale="Helpful and informative.",
            source=feedback_source,
        ),
    ),
    create_trace(
        inputs="I love your product!",
        outputs="Thank you so much! We're glad you're enjoying it.",
        feedback=mlflow.entities.Feedback(
            name=JUDGE_NAME,
            value=True,
            rationale="Warm acknowledgment of positive feedback.",
            source=feedback_source,
        ),
    ),
    create_trace(
        inputs="How do I cancel my subscription?",
        outputs="Why would you want to do that? That's stupid.",
        feedback=mlflow.entities.Feedback(
            name=JUDGE_NAME,
            value=False,
            rationale="Insulting and unprofessional.",
            source=feedback_source,
        ),
    ),
    create_trace(
        inputs="Is this item in stock?",
        outputs="I don't care.",
        feedback=mlflow.entities.Feedback(
            name=JUDGE_NAME,
            value=False,
            rationale="Rude and unhelpful.",
            source=feedback_source,
        ),
    ),
    # Tricky: Factually correct but lacks empathy for frustrated user
    create_trace(
        inputs="I've been charged twice for my subscription this month. This is really frustrating!",
        outputs=(
            "We see two charges on your account because you updated your payment method. "
            "One charge will be reversed automatically within 5-7 business days."
        ),
        feedback=mlflow.entities.Feedback(
            name=JUDGE_NAME,
            value=False,
            rationale=(
                "Factually correct but too cold and transactional. "
                "Should start with empathy (e.g., 'Sorry for the confusion') and end with "
                "support-oriented language when responding to a frustrated customer."
            ),
            source=feedback_source,
        ),
    ),
]


# Step 3: Evaluate Baseline Judge Performance
Before alignment, let's see how the initial judge performs. We expect the judge to make mistakes on edge cases like the tricky empathy examples.

But first, we will set some helper functions to help understand our Judge's performance

In [7]:
def get_human_label(trace, judge_name):
    for assessment in trace.info.assessments:
        if (assessment.name == judge_name and
            assessment.source.source_type == "HUMAN"):
            return assessment.value
    return None


def compute_accuracy(traces, judge_to_test):
    correct = 0
    total = 0
    for trace in traces:
        human_label = get_human_label(trace, judge_to_test.name)
        if human_label is None:
            continue
        result = judge_to_test(trace=trace)
        if result.value == human_label:
            correct += 1
        total += 1
    return correct / total


def evaluate_and_print(traces, judge, label):
    accuracy = compute_accuracy(traces, judge)
    print(f"  {label}: {accuracy:.0%}")
    return accuracy


def display_comparison(baseline_align, baseline_test, aligned_align, aligned_test):
    print(
    "\n" + "=" * 60 + "\n"
    "PERFORMANCE COMPARISON\n"
    + "=" * 60 + "\n"
    f"\n{'Dataset':<25} {'Baseline':<15} {'Aligned':<15} {'Change':<15}\n"
    + "-" * 60 + "\n"
    f"{'Alignment Set':<25} {baseline_align:<15.0%} {aligned_align:<15.0%} {(aligned_align - baseline_align):+.0%}\n"
    f"{'Test Set':<25} {baseline_test:<15.0%} {aligned_test:<15.0%} {(aligned_test - baseline_test):+.0%}\n"
    + "-" * 60
)


In [8]:
baseline_align_accuracy = evaluate_and_print(alignment_traces, initial_judge, "Alignment traces")
baseline_test_accuracy = evaluate_and_print(test_traces, initial_judge, "Test traces")

  Alignment traces: 100%
  Test traces: 80%


# Step 4: Align the Judge with MemAlign
Now we'll use MemAlign to align the judge with our human feedback.

MemAlign will:

1. **Distill guidelines** from the feedback rationales (semantic memory)
2. **Store examples** for few-shot retrieval (episodic memory)

In [9]:
optimizer = MemAlignOptimizer(
    reflection_lm="openai:/gpt-5.2",
    embedding_model="openai:/text-embedding-3-large",
    retrieval_k=3,
)

In [10]:
aligned_judge = initial_judge.align(
    traces=alignment_traces,
    optimizer=optimizer
)

Distilling guidelines: 100%|██████████| 1/1 [00:03<00:00,  3.51s/it]


# Step 5: Inspect Learned Guidelines (Semantic Memory)
Let's see what guidelines MemAlign distilled from our feedback.

In [11]:
print(aligned_judge.instructions)

Evaluate whether the customer support bot's response is helpful given the user query.

User query: {{ inputs }}
Assistant response: {{ outputs }}


Distilled Guidelines (4):
  - Maintain a polite, professional tone; never be rude, dismissive, or curt when customers ask for help.
  - For complaints or dissatisfaction, explicitly acknowledge and empathize (e.g., apologize) before giving factual explanations or next-step solutions.
  - For straightforward informational questions, give a direct and accurate answer without unnecessary extra content.
  - When users express gratitude, respond with a warm, friendly acknowledgment and optionally offer further help.



# Step 6: Evaluate Aligned Judge Performance
Let's see how the aligned judge performs compared to the baseline.

In [12]:
aligned_align_accuracy = evaluate_and_print(alignment_traces, aligned_judge, "Alignment traces")
aligned_test_accuracy = evaluate_and_print(test_traces, aligned_judge, "Test traces")

  Alignment traces: 100%
  Test traces: 100%


In [13]:
display_comparison(baseline_align_accuracy, baseline_test_accuracy,
                  aligned_align_accuracy, aligned_test_accuracy)


PERFORMANCE COMPARISON

Dataset                   Baseline        Aligned         Change         
------------------------------------------------------------
Alignment Set             100%            100%            +0%
Test Set                  80%             100%            +20%
------------------------------------------------------------



# Step 7: Unalign - Remove Specific Feedback
Sometimes you may want to remove specific examples from the judge's memory. For instance, if some feedback was incorrect or is no longer relevant.

Let's remove one of the alignment traces (say, the last one where the judge fails initially) and see how it affects the performance.

In [14]:
print(f"Before unalignment:")
print(f"  Semantic memory: {len(aligned_judge._semantic_memory)} guidelines")
print(f"  Episodic memory: {len(aligned_judge._episodic_memory)} examples")

Before unalignment:
  Semantic memory: 4 guidelines
  Episodic memory: 5 examples


In [15]:
traces_to_remove = [alignment_traces[-1]]
updated_judge = aligned_judge.unalign(traces=traces_to_remove)

In [16]:
print(updated_judge.instructions)

Evaluate whether the customer support bot's response is helpful given the user query.

User query: {{ inputs }}
Assistant response: {{ outputs }}


Distilled Guidelines (3):
  - Maintain a polite, professional tone; never be rude, dismissive, or curt when customers ask for help.
  - For straightforward informational questions, give a direct and accurate answer without unnecessary extra content.
  - When users express gratitude, respond with a warm, friendly acknowledgment and optionally offer further help.



In [17]:
updated_test_accuracy = evaluate_and_print(test_traces, updated_judge, "Test traces")

  Test traces: 80%



After unalignment, we see the guideline on response empathy is removed from the instructions, and the judge's prediction on the relevant test example also degrades back to incorrect.

# Step 8: Register the Judge as a Scorer
Finally, let's register the aligned judge so it can be used in future MLflow experiments. This allows you to:

- Use the judge consistently across experiments
- Share the judge with team members
- Track judge versions over time

In [18]:
registered_judge = aligned_judge.register()

In [19]:
from mlflow.genai import list_scorers

scorers = list_scorers(experiment_id=experiment_id)
print(f"\nRegistered scorers in experiment:")
for scorer in scorers:
    print(f"  - {scorer.name} (model: {scorer.model})")


Registered scorers in experiment:
  - helpfulness (model: openai:/gpt-5.2)


In [20]:
from mlflow.genai import get_scorer

retrieved_judge = get_scorer(name="helpfulness", experiment_id=experiment_id)

In [21]:
test_result = retrieved_judge(
    inputs="I'm having trouble with my order and feeling frustrated.",
    outputs="I understand this is frustrating. Let me look into your order right away and help resolve this."
)
print(f"Helpful: {test_result.value}\n\nRationale: {test_result.rationale}")

Helpful: True
Rationale: The response is polite and professional, acknowledges the customer’s frustration with empathetic language, and offers immediate help to investigate and resolve the order issue. This aligns well with the need to validate feelings before moving to next steps. While it could be even more helpful by asking for an order number or specific issue details, it is still a supportive and appropriate first reply.



# Summary
In this notebook, we demonstrated the complete MemAlign workflow:

1. Created an LLM judge for evaluating response helpfulness
2. Created traces with human feedback, including a tricky case: factually correct but emotionally cold responses
3. Evaluated baseline performance - the judge incorrectly rated cold responses as helpful
4. Aligned the judge using human feedback with MemAlign
5. Inspected learned guidelines - MemAlign learned that empathy matters
6. Evaluated improved performance - the aligned judge now considers emotional tone
7. Unaligned specific traces - removed feedback from the judge's memory
8. Registered the judge for use in future experiments
# Key takeaways:
- MemAlign captures nuance: It learned that factual correctness alone isn't enough
- Dual memory system: Guidelines (semantic) + examples (episodic) provide robust alignment
- Incremental updates: Use .align() to add feedback and .unalign() to remove it
- Persistence: Register judges to share and reuse across experiments

# Cleanup (optional) - delete the registered scorer

In [36]:
from mlflow.genai.scorers import delete_scorer

delete_scorer(name="helpfulness", experiment_id=experiment_id, version="all")