# ACReFOSC Demo: Generating a Preference Dataset for DPO üßë‚Äçüíª

This notebook provides a hands-on demonstration of how to use the **ACReFOSC (A Companion REpository to the French OLDI Seed Corpus)** dataset.

It shows how one can construct a **preference dataset** from the provided files. This type of dataset is useful for fine-tuning language models using methods like Direct Preference Optimization (DPO), which teaches a model to prefer one style of response over another.

---

## Dataset Overview üìä

The script will use two core files from the repository:

* `acrefosc_raw.json`: This file contains the primary data. For each English source segment, it includes:
    * The final, **human post-edited French translation** (the high-quality "chosen" response).
    * Nine different **raw machine translation hypotheses** (the lower-quality "rejected" responses).
    * Quality Estimation (QE) scores for each machine-generated hypothesis from two different metrics: `COMET` and `MetricX`.

* `acrefosc_prompts.json`: This file contains the specific prompts used to generate the translations for the `Llama-4-Scout_segment-level` system adapted to discourage CoT.

---

## The Core Task: Creating Preference Pairs ‚úÖ‚ùå

The key operation consists in creating pairs of "chosen" and "rejected" translations for each source text. The `create_preference_dataset` function in this notebook automates this process. Crucially, the code verifies that a machine-generated hypothesis is different from the human-post-edited reference before considering it as a 'rejected' candidate, ensuring no identical pairs are created.

It also demonstrates several **advanced selection strategies** for choosing the "rejected" example, using the provided QE scores:

* **`all`**: The simplest approach. It creates a preference pair for every machine translation that isn't identical to the final human version.
* **`hardest`**: A more nuanced strategy. For each source text, it selects only the *best-scoring* machine translation as the rejected example. This creates "hard" negative examples that are closer in quality to the chosen response.
* **`easiest`**: This strategy selects the *worst-scoring* machine translation as the rejected example, creating a dataset with more clear-cut quality differences.

**Note**: The script correctly handles the fact that **MetricX is an error score, where lower is better**. Therefore, the "hardest" negative example will have the *minimum* MetricX score among the machine-generated options.

The notebook will first clone the repository, then process the data to generate and save a `.jsonl` file ready for DPO training. Let's get started! üëá

In [1]:
!git clone https://github.com/mmarmonier/ACReFOSC.git

Cloning into 'ACReFOSC'...
remote: Enumerating objects: 29, done.[K
remote: Counting objects: 100% (29/29), done.[K
remote: Compressing objects: 100% (25/25), done.[K
remote: Total 29 (delta 4), reused 0 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (29/29), 5.53 MiB | 4.62 MiB/s, done.
Resolving deltas: 100% (4/4), done.


In [2]:
%cd ACReFOSC

/content/ACReFOSC


In [3]:
import json

# 1. Load both datasets from their files
with open('data/acrefosc_raw.json', 'r', encoding='utf-8') as f:
    raw_data = json.load(f)

with open('data/acrefosc_prompts.json', 'r', encoding='utf-8') as f:
    prompts_data = json.load(f)

# 2. Create a dictionary for quick prompt lookup by ID
prompts_map = {item['id']: item['prompt'] for item in prompts_data}

# 3. List the keys for the MT systems
mt_system_keys = [
    "DeepSeek-R1",
    "Llama-4-Scout_document-level",
    "Llama-4-Scout_document-level_no-instruction",
    "Llama-4-Scout_document-level_Wikipedia",
    "Llama-4-Scout_segment-level",
    "MADLAD-400-3B",
    "NLLB-200-600M-Distilled",
    "NLLB-3.3B",
    "OPUS-MT_en-fr",
]

# 4. Generate the preference dataset with advanced selection strategies
def create_preference_dataset(raw_data, prompts_map, system_keys,
                              format_style='openai',
                              selection_strategy='all',
                              selection_metric='COMET'):
    """
    Generates a preference dataset with flexible selection strategies.

    Args:
        raw_data (list): Data from acrefosc_raw.json.
        prompts_map (dict): Dictionary mapping IDs to prompts.
        system_keys (list): List of MT system key names.
        format_style (str): Output format ('simple' or 'openai').
        selection_strategy (str): How to choose rejected examples.
            'all': Naive approach, creates a pair for every non-identical hypothesis.
            'hardest': Selects the best-scoring non-identical hypothesis.
            'easiest': Selects the worst-scoring hypothesis.
        selection_metric (str): Which QE score to use ('COMET' or 'MetricX').
    """
    preference_dataset = []
    for instance in raw_data:
        instance_id = instance['id']
        prompt = prompts_map.get(instance_id, f"Translate to French: {instance['text']}")
        chosen_response = instance['post_edited_text']

        hypotheses = []
        for key in system_keys:
            if key in instance and instance[key] != chosen_response:
                score_key = f"{key}_{selection_metric}_QE_Score"
                hypotheses.append({
                    "text": instance[key],
                    "score": instance.get(score_key),
                    "system": key
                })

        # Filter out hypotheses without a valid score for selection strategies
        scored_hypotheses = [h for h in hypotheses if h['score'] is not None]
        if not scored_hypotheses and selection_strategy != 'all':
            continue

        pairs_to_add = []
        if selection_strategy == 'all':
            pairs_to_add = hypotheses
        elif selection_strategy == 'hardest':
            # [cite_start]Note: MetricX is an error score (lower is better) [cite: 177]
            # So the "hardest" (best) negative has the MINIMUM MetricX score.
            best_hypo = min(scored_hypotheses, key=lambda x: x['score']) if selection_metric == 'MetricX' \
                        else max(scored_hypotheses, key=lambda x: x['score'])
            pairs_to_add = [best_hypo]
        elif selection_strategy == 'easiest':
            # The "easiest" (worst) negative has the MAXIMUM MetricX score.
            worst_hypo = max(scored_hypotheses, key=lambda x: x['score']) if selection_metric == 'MetricX' \
                         else min(scored_hypotheses, key=lambda x: x['score'])
            pairs_to_add = [worst_hypo]

        for hypo in pairs_to_add:
            if format_style == 'openai':
                pair = {
                    "input": {"messages": [{"role": "user", "content": prompt}]},
                    "preferred_output": {"messages": [{"role": "assistant", "content": f"<translation>{chosen_response}</translation>"}]},
                    "non_preferred_output": {"messages": [{"role": "assistant", "content": f"<translation>{hypo['text']}</translation>"}]}
                }
            else:
                pair = {"prompt": prompt, "chosen": f"<translation>{chosen_response}</translation>", "rejected": f"<translation>{hypo['text']}</translation>"}
            preference_dataset.append(pair)

    return preference_dataset

# --- Create and save the dataset ---

# You can now choose your strategy.
# 'all': Creates the most pairs.
# 'hardest': Creates one pair per instance with the best-scoring negative.
# 'easiest': Creates one pair per instance with the worst-scoring negative.
strategy = 'all'
metric = 'MetricX' # or 'COMET'
dpo_dataset = create_preference_dataset(raw_data, prompts_map, mt_system_keys,
                                        selection_strategy=strategy,
                                        selection_metric=metric)

# Define the output filename
output_filename = f'dpo_preference_dataset_{strategy}_{metric}.jsonl'

# Save the generated dataset to a .jsonl file
with open(output_filename, 'w', encoding='utf-8') as f:
    for pair in dpo_dataset:
        f.write(json.dumps(pair, ensure_ascii=False) + '\n')

print(f"‚úÖ Success! Generated {len(dpo_dataset)} preference pairs using the '{strategy}' strategy with {metric} scores.")
print(f"Dataset saved to '{output_filename}'")

‚úÖ Success! Generated 49580 preference pairs using the 'all' strategy with MetricX scores.
Dataset saved to 'dpo_preference_dataset_all_MetricX.jsonl'


In [4]:
!head dpo_preference_dataset_all_MetricX.jsonl

{"input": {"messages": [{"role": "user", "content": "You are an expert English-French translator of encyclopedic documents. In translating, you adhere to the following guidelines:\n\n1. Refer to the source document context when available. Context helps clarify meaning, resolve ambiguities, and maintain tone and accuracy in translation.\n2. Do not convert any units of measurement. Translate them exactly as noted in the source content.\n3. Encyclopedic documents should be translated using a formal tone.\n4. Provide fluent translations without deviating excessively from the structure of the source segment.\n5. Do not expand or replace information compared to what is present in the source segment. Do not add any explanatory or parenthetical information, definitions, etc.\n6. Do not ignore any meaningful text that was present in the source segment.\n7. If a named entity in the source language has a canonical equivalent in the target language, use this canonical equivalent.\n8. If a named en