# Clean an Existing Preference Dataset with LLMs as Judges

In this example, we will use distilabel to clean a dataset using the LLMs as judges by providing AI feedback on the quality of the data.

[`distilabel`](https://github.com/argilla-io/distilabel) is a synthetic data and AI feedback framework for engineers who need fast, reliable and scalable pipelines based on verified research papers.

to evaluate the responses, we will use the serverless HuggingFace Inference API integrated with distilabel.

To further curate the data, we will use [`Argilla`](https://github.com/argilla-io/argilla), which allows us to provide human feedback on the data quality. Argilla is a collaboration tool for AI engineers and domain experts who need to build high-quality datasets for their projects.

## Setups

In [None]:
!pip install -qU "transformers~=4.0" "torch~=2.0" "distilabel[argilla, hf-inference-endpoints]"

In [None]:
import random

from datasets import load_dataset
from distilabel.pipeline import Pipeline
from distilabel.steps import (
    KeepColumns,
    LoadDataFromDicts,
    PreferenceToArgilla
)
from distilabel.steps.tasks import UltraFeedback

## The dataset

In this example, we will clean a preference dataset, the [`Intel/orca_dpo_pairs`](https://huggingface.co/datasets/Intel/orca_dpo_pairs) dataset.

In [None]:
dataset = load_dataset(
    'Intel/orca_dpo_pairs',
    split='train[:20]'
)

README.md:   0%|          | 0.00/196 [00:00<?, ?B/s]

orca_rlhf.jsonl:   0%|          | 0.00/36.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/12859 [00:00<?, ? examples/s]

In [None]:
dataset

Dataset({
    features: ['system', 'question', 'chosen', 'rejected'],
    num_rows: 20
})

In [None]:
dataset[0]

{'system': '',
 'question': "You will be given a definition of a task first, then some input of the task.\nThis task is about using the specified sentence and converting the sentence to Resource Description Framework (RDF) triplets of the form (subject, predicate object). The RDF triplets generated must be such that the triplets accurately capture the structure and semantics of the input sentence. The input is a sentence and the output is a list of triplets of the form [subject, predicate, object] that capture the relationships present in the sentence. When a sentence has more than 1 RDF triplet possible, the output must contain all of them.\n\nAFC Ajax (amateurs)'s ground is Sportpark De Toekomst where Ajax Youth Academy also play.\nOutput:",
 'chosen': '[\n  ["AFC Ajax (amateurs)", "has ground", "Sportpark De Toekomst"],\n  ["Ajax Youth Academy", "plays at", "Sportpark De Toekomst"]\n]',
 'rejected': " Sure, I'd be happy to help! Here are the RDF triplets for the input sentence:\n\n[

We will shuffle the `chosen` and `rejected` columns to avoid any bias in the dataset.

In [None]:
def shuffle_and_track(chosen, rejected):
    pair = [chosen, rejected]
    random.shuffle(pair)

    order = ['chosen' if x == chosen else 'rejected' for x in pair]

    return {'generations': pair, 'order': order}

dataset = dataset.map(lambda x: shuffle_and_track(x['chosen'], x['rejected']))

Map:   0%|          | 0/20 [00:00<?, ? examples/s]

In [None]:
dataset[0]

{'system': '',
 'question': "You will be given a definition of a task first, then some input of the task.\nThis task is about using the specified sentence and converting the sentence to Resource Description Framework (RDF) triplets of the form (subject, predicate object). The RDF triplets generated must be such that the triplets accurately capture the structure and semantics of the input sentence. The input is a sentence and the output is a list of triplets of the form [subject, predicate, object] that capture the relationships present in the sentence. When a sentence has more than 1 RDF triplet possible, the output must contain all of them.\n\nAFC Ajax (amateurs)'s ground is Sportpark De Toekomst where Ajax Youth Academy also play.\nOutput:",
 'chosen': '[\n  ["AFC Ajax (amateurs)", "has ground", "Sportpark De Toekomst"],\n  ["Ajax Youth Academy", "plays at", "Sportpark De Toekomst"]\n]',
 'rejected': " Sure, I'd be happy to help! Here are the RDF triplets for the input sentence:\n\n[

### (Optional) Create a custom step

A **step** is a block in a `distilabel` pipeline used to manipulate, generate, or evaluate data, among other tasks. A set of predefined steps is provided, but we can also create our own custom steps. Instead of preprocessing the data as in the previous section, it is possible to use a custom step to shuffle the columns. This step should be in a separate module to be imported and used in the pipeline.

In this case, the pipeline would start by loading the `orca_dpo_pairs` dataset using the `LoadDataFromHub` step and then applying the `ShuffleStep`.

In [None]:
from typing import TYPE_CHECKING, List
from distilabel.steps import GlobalStep, StepInput
import random

if TYPE_CHECKING:
    from distalabel.steps.typing import StepOutput


class ShuffleStep(GlobalStep):
    @property
    def inputs(self) -> List[str]:
        return ['instruction', 'chosen', 'rejected']

    @property
    def outputs(self) -> List[str]:
        return ['instruction', 'generations', 'order']

    def process(self, inputs: StepInput) -> StepOutput:
        outputs = []

        for _input in inputs:
            chosen = _input['chosen']
            rejected = _input['rejected']
            pair = [chosen, rejected]
            random.shuffle(pair)

            order = ['chosen' if x == chosen else 'rejected' for x in pair]

            outputs.append({
                'instruction': _input['instruction'],
                'generations': pair,
                'order': order
            })

        yield outputs

## Define the pipeline

To clean an existing preference dataset, we will need to define a `Pipeline` with all the necessary steps. A similar workflow can be used to clean an SFT dataset.

### Load the dataset

In [None]:
load_dataset = LoadDataFromDicts(
    data=dataset[:1],
    output_mapping={'question': 'instruction'},
    pipeline=Pipeline(name='showcase-pipeline')
)

load_dataset.load()
next(load_dataset.process())

### Evaluate the responses

To evaluate the quality of the responses, we will use [`meta-llama/Meta-Llama-3.1-70B-Instruct`](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct), applying the `UltraFeedback` task that judges the responses according to different dimensions (helpfulness, honesty, instruction-following, truthfulness). For SFT dataset, we can use `PrometheusEval` instead.

In [None]:
evaluate_responses = UltraFeedback(
    aspect='overall-rating',
    llm=InferenceEndpointsLLM(
        model_id='meta-llama/Llama-3.1-70B-Instruct',
        tokenizer_id='meta-llama/Llama-3.1-70B-Instruct',
        generation_kwargs={'max_new_tokens': 512, 'temperature': 0.7}
    ),
    pipeline=Pipeline(name='showcase-pipeline')
)

evaluate_responses.load()

next(evaluate_responses.process(
    [{
        'instruction': "What's the capital of Spain?",
        'generations': ['Madrid', 'Barcelona']
    }]
))

### Keep only the required columns

We will get rid of the unneeded columns.

In [None]:
keep_columns = KeepColumns(
    columns=[
        'instruction',
        'generations',
        'order',
        'ratings',
        'rationales',
        'model_name'
    ],
    pipeline=Pipeline(name='showcase-pipeline')
)
keep_columns.load()

next(keep_columns.process(
    [{
        'system': "",
        'instruction': "What's the capital of Spain?",
        'chosen': 'Madrid',
        'rejected': 'Barcelona',
        'generations': ['Madrid', 'Barcelona'],
        'order': ['chosen', 'rejected'],
        'ratings': [5, 1],
        'rationales': ['', ''],
        'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'
    }]
))

### (Optional) Further data curation

We can use Argilla to further curate our data.

In [None]:
to_argilla = PreferenceToArgilla(
    dataset_name='cleaned-dataset',
    dataset_workspace='argilla',
    api_url='https://<username>-<space-name>.hf.space',
    api_key='<api-key>',
    num_generations=2
)

## Run the pipeline

In [None]:
with Pipeline(name='clean-dataset') as pipeline:
    load_dataset = LoadDataFromDicts(
        data=dataset,
        output_mapping={'question': 'instruction'}
    )

    evaluate_responses = UltraFeedback(
        aspect='overall-rating',
        llm=InferenceEndpointsLLM(
            model_id='meta-llama/Llama-3.1-70B-Instruct',
            tokenizer_id='meta-llama/Llama-3.1-70B-Instruct',
            generation_kwargs={'max_new_tokens': 512, 'temperature': 0.7}
        )
    )

    keep_columns = KeepColumns(
        columns=[
            'instruction',
            'generations',
            'order',
            'ratings',
            'rationales',
            'model_name'
        ]
    )

    to_argilla = PreferenceToArgilla(
        dataset_name='cleaned-dataset',
        dataset_workspace='argilla',
        api_url='https://<username>-<space-name>.hf.space',
        api_key='<api-key>',
        num_generations=2
    )

    load_dataset.connect(evaluate_responses)
    evaluate_responses.connect(keep_columns)
    keep_columns.connect(to_argilla)

In [None]:
distiset = pipeline.run()