In [1]:
# add parent directory to path
import sys
sys.path.append('..')

import utils.utils as utils

utils.query_llm(prompt="""hey""", system_prompt_included=False, is_hippa=True, model="openai-gpt-4o-high-quota-chat"
)

  from .autonotebook import tqdm as notebook_tqdm


'Hello! How can I assist you today?'

These are diverse explanations of DPO

1. "Explain DPO (Direct Preference Optimization)"
2. "Give this to me as unstructured text."
3. "Give me a bunch of different versions of these explanations. give me like ten different versions"
4. "make each one longer and more detailed"
5. "Give these explanations as a list of strings in Python. If the text involves mathematical expressions write it in latex text"

In [1]:
dpo_explanations = [
    # 1. Simple and Friendly
    """DPO stands for Direct Preference Optimization. It’s a way to teach AI models—like chatbots or writing assistants—to give better, more helpful answers. Usually, when we try to make a model better, we use a process called reinforcement learning, which is kind of like giving the model a score for every answer it gives and telling it to aim for higher scores. But that can be complicated, expensive, and hard to get right.

DPO takes a simpler path. Instead of scores, it just shows the model two different answers to the same question and says, “This one is better.” That’s it. The model then tries to increase the chances of generating the better answer next time. By repeating this process over and over, the model gradually learns what kinds of answers people like more.

It’s like training by comparison—showing it lots of “A is better than B” examples. The result is a more human-aligned model, with fewer complications along the way.""",

    # 2. Technical and Precise
    """Direct Preference Optimization (DPO) is a scalable and stable alternative to reinforcement learning-based preference fine-tuning methods like RLHF. The core idea is to directly optimize a pairwise preference objective without needing to explicitly learn or sample from a reward model.

Formally, DPO receives a dataset of triplets \\((x, y^+, y^-)\\), where \\(x\\) is a prompt, \\(y^+\\) is a preferred response, and \\(y^-\\) is a less preferred one. The model learns to assign a higher conditional probability to \\(y^+\\) than to \\(y^-\\). Rather than optimizing this with a typical classification loss, DPO introduces a log-ratio term between the target policy and a frozen reference policy, resulting in the following loss derived from the Bradley-Terry model:

\\[
\\mathcal{L}_{\\text{DPO}} = -\\log \\sigma\\left( \\beta \\cdot \\left[ \\log \\pi_\\theta(y^+|x) - \\log \\pi_\\theta(y^-|x) - \\log \\pi_{\\text{ref}}(y^+|x) + \\log \\pi_{\\text{ref}}(y^-|x) \\right] \\right)
\\]

Here, \\(\\pi_\\theta\\) is the current model, \\(\\pi_{\\text{ref}}\\) is the frozen reference model, and \\(\\beta\\) is a temperature hyperparameter.""",

    # 3. Metaphor-Driven
    """Imagine you’re teaching someone to be a movie critic. You don’t hand them a rulebook or a numerical rubric. Instead, you sit them down and say, “Between these two movie reviews, I like this one more.” Over time, with enough of these comparisons, they develop a sense of taste that aligns with yours.

That’s the essence of DPO. Rather than giving a model a reward score or training it to maximize an abstract reward, you simply give it a lot of examples of what people prefer. Given two possible answers to a question, you mark which one you like more. The model uses these comparisons to shift itself gradually toward responses that align with human preferences.

This approach is much easier and more stable than traditional reinforcement learning. You don’t have to engineer a reward function or worry about unstable training dynamics. You just compare A vs. B, and the model learns which direction to go. It’s preference learning boiled down to its simplest, most intuitive form.""",

    # 4. Academic Summary
    """Direct Preference Optimization (DPO) presents a significant simplification of preference alignment for language models by eliminating the need for a learned reward model or policy gradient-based fine-tuning. Traditional methods like RLHF require two major steps: (1) training a reward model to predict human preference scores, and (2) using reinforcement learning (e.g., PPO) to train the model to maximize these scores. This process is often unstable, sensitive to hyperparameters, and computationally intensive.

DPO reframes preference learning as a binary classification problem: given a prompt and two candidate completions—one preferred by a human annotator and one not—the model is trained to increase the probability of the preferred response relative to the other. Crucially, it introduces a reference model to anchor the comparison, preventing distributional collapse and providing a stable baseline.

The method is theoretically grounded in the Bradley-Terry model and operationalized through a logit difference scaled by a temperature parameter. Empirical results demonstrate that DPO achieves alignment quality comparable to or exceeding RLHF baselines with a fraction of the implementation and training complexity.""",

    # 5. For Newcomers
    """Direct Preference Optimization is a way to train AI models to better reflect what people actually want. Let’s say you ask a chatbot a question, and it gives you two possible answers. A human looks at them and says, “I like this one more.” DPO takes that feedback and uses it to teach the model: give answers more like this one, and less like the other.

It does this by comparing how likely the model thinks each answer is, and then adjusting its behavior so that it leans toward the better one. No complex reward scores or advanced math needed—just comparisons. Over time, with enough examples, the model becomes more aligned with human values and preferences.

Because it skips over the complicated parts of reinforcement learning, DPO is faster, simpler, and easier to trust.""",

    # 6. For Engineers
    """DPO is a supervised learning-based approach to preference fine-tuning. Instead of building a reward model and using PPO or other reinforcement learning tools, DPO operates directly on preference pairs: \\((x, y^+, y^-)\\).

The loss encourages the model to prefer \\(y^+\\) over \\(y^-\\) in terms of conditional log-probability, using a contrastive objective. To prevent model collapse or mode drift, the log-probabilities are compared relative to a frozen reference model (e.g., the original pre-trained base). The reference model acts like a stabilizer: it measures how much the new model is shifting its behavior, and DPO penalizes overly aggressive updates.

This makes DPO:
- Easy to implement using standard backprop
- Compatible with any causal LM
- Efficient in compute (no rollouts or reward sampling)

If you’ve used pairwise ranking losses before (e.g., for retrieval), DPO will feel very familiar—just applied in a generative context.""",

    # 7. For Product Teams or Executives
    """DPO, or Direct Preference Optimization, is a technique that helps AI models behave more in line with human expectations. It’s useful in situations where you care about how the model responds—whether it's polite, helpful, safe, or on-brand.

The traditional way to do this is to first build a reward model that tries to quantify what “good” means, then train the model to maximize that reward. But that process is expensive, complex, and can produce weird or unstable results.

DPO skips all of that. Instead, it just trains the model using comparisons. Show it two possible answers, and mark which one a human prefers. The model then learns to generate responses more like the one that was picked. Over time, this produces a much more aligned, user-friendly model, without the need for complicated reward engineering.

This simplicity makes DPO especially appealing in fast-moving product environments, where model behavior needs to be controlled, measurable, and easy to improve.""",

    # 8. Jargon-Free Analogy
    """Training a model with DPO is like teaching someone to write good emails. You don’t need to explain grammar rules or give them scores. You just show them examples: “Here are two versions of a reply—this one is better.” The more of those you give, the more they learn what “better” means to you.

DPO trains models in exactly that way. Instead of scoring every possible answer, it just gives the model lots of these A-vs-B comparisons. Over time, the model starts to naturally prefer the kinds of responses humans like more.

It’s like raising a model on good taste—not through rigid rules, but through consistent, comparative feedback.""",

    # 9. One-Liner + Expansion
    """DPO fine-tunes models to match human preferences using a simple, stable loss over pairwise comparisons—no reinforcement learning needed.

DPO simplifies preference learning by treating it as a supervised classification problem. For each prompt and a pair of candidate responses, it teaches the model to favor the preferred one based on how much more likely it should be, compared to the baseline model’s beliefs. The method avoids the complexity and instability of reward models and RL algorithms like PPO, and yields competitive results using only standard tools like backpropagation and log-likelihoods. It's the easiest way to align models with human preferences without a full reinforcement learning stack.""",

    # 10. Critical Comparison Style
    """Reinforcement Learning from Human Feedback (RLHF) has been the dominant paradigm for preference alignment in LLMs, but it comes with significant costs: reward model training, rollout instability, and brittle convergence. DPO addresses these problems head-on with a simpler, more robust solution.

Rather than estimating reward functions, DPO directly optimizes the preference relation between response pairs. It leverages log-likelihood differences under the model and a frozen reference to create a stable contrastive loss. This results in a model that prefers human-validated responses without needing to simulate full reward optimization.

In practice, DPO has shown that you can achieve RLHF-quality results without the infrastructure burden, making it ideal for organizations looking to scale aligned LLMs more affordably and reliably."""
]
# Join explanations with '^^^^' and save to file
explanations_text = '^^^^'.join(dpo_explanations)

# Save to file
with open("../data/arxiv/list_of_dpo_explanations_chatgpt_generated.txt", "w") as f:
    f.write(explanations_text)

print(f"Saved {len(dpo_explanations)} explanations to ../data/arxiv/list_of_dpo_explanations_chatgpt_generated.txt")

print(len(explanations_text))

Saved 10 explanations to ../data/arxiv/list_of_dpo_explanations_chatgpt_generated.txt
9164


Let's automate this

In [7]:
dpo_explanations = []
target_length = 9164
current_length = 0

while current_length < target_length:
    prompt = {}
    prompt['system'] = """You are a teacher that explains concepts in a way that is easy to understand but also fully accurate and detailed. Please make sure the output is in paragraph form."""
    prompt['user'] = """Explain DPO (Direct Preference Optimization)"""

    response = utils.query_llm(prompt=prompt, temperature=1, system_prompt_included=True, is_hippa=False, model="gpt-4.1")
    
    dpo_explanations.append(response)
    current_length += len(response)
    
    print(f"Generated explanation {len(dpo_explanations)}, current length: {current_length}")

# Join explanations with '^^^^' and save to file
explanations_text = '^^^^'.join(dpo_explanations)

# Save to file
with open("../data/arxiv/list_of_dpo_explanations_gpt_4_1_generated.txt", "w") as f:
    f.write(explanations_text) 

print(f"Saved {len(dpo_explanations)} explanations to ../data/arxiv/list_of_dpo_explanations_gpt_4_1_generated.txt")
print(f"Final length: {len(explanations_text)}")

Generated explanation 1, current length: 1710
Generated explanation 2, current length: 3129
Generated explanation 3, current length: 5028
Generated explanation 4, current length: 6867
Generated explanation 5, current length: 8530
Generated explanation 6, current length: 10014
Saved 6 explanations to ../data/arxiv/list_of_dpo_explanations_gpt_4_1_generated.txt
Final length: 10034


Let's also diversify the format (not just paragraph form)

In [None]:
dpo_explanations = []
target_length = 9164
current_length = 0

while current_length < target_length:
    prompt = {}
    prompt['system'] = """You are a teacher that explains concepts in a way that is easy to understand but also fully accurate and detailed. Please make sure the output is in paragraph form."""
    prompt['user'] = """Explain DPO (Direct Preference Optimization)"""

    response = utils.query_llm(prompt=prompt, temperature=1, system_prompt_included=True, is_hippa=False, model="gpt-4.1")
    
    dpo_explanations.append(response)
    current_length += len(response)
    
    print(f"Generated explanation {len(dpo_explanations)}, current length: {current_length}")

# Join explanations with '^^^^' and save to file
explanations_text = '^^^^'.join(dpo_explanations)

# Save to file
with open("../data/arxiv/list_of_dpo_explanations_gpt_4_1_generated.txt", "w") as f:
    f.write(explanations_text) 

print(f"Saved {len(dpo_explanations)} explanations to ../data/arxiv/list_of_dpo_explanations_gpt_4_1_generated.txt")
print(f"Final length: {len(explanations_text)}")