# Generate a Preference Dataset with distilabel

In this example, we will use `distilabel` to generate a synthetic preference dataset for DPO, ORPO or RLHF.

[`distilabel`](https://github.com/argilla-io/distilabel) is a synthetic data and AI feedback framework for engineers who need fast, reliable and scalable pipelines based on verified research papers.

To generate the responses and evaluate them, we ill use the serverless HuggingFace Inference API integrated with `distilabel`.

To further curate the data, we will use [`Argilla`](https://github.com/argilla-io/argilla), which allows us to provide human feedback on the data quality. Argilla is a collaboration tool for AI engineers and domain experts who need to build high-quality datasets for their projects.

## Setups

In [2]:
!pip install -qU "transformers~=4.0" "torch~=2.0" "distilabel[argilla, hf-inference-endpoints]"

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.4/10.4 MB[0m [31m68.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m59.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m32.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m38.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m5.6 MB/s[0m eta [36m0:0

In [None]:
from distilabel.llms import InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import(
    LoadDataFromHub,
    GroupColumns,
    FormatTextGenerationDPO,
    PreferenceToArgilla
)
from distilabel.steps.tasks import TextGeneration, UltraFeedback

## Define the pipeline

To generate our preference dataset, we will need to define a `Pipeline` with all the necessary steps.

### Load the dataset

We will use the [`argilla/10Kprompts-mini`](https://huggingface.co/datasets/argilla/10Kprompts-mini) dataset as our source data.

In [4]:
load_dataset = LoadDataFromHub(
    repo_id='argilla/10Kprompts-mini',
    num_examples=1,
    pipeline=Pipeline(name='showcase-pipeline')
)
load_dataset.load()
next(load_dataset.process())

README.md:   0%|          | 0.00/347 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/16.7k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/20 [00:00<?, ? examples/s]

([{'instruction': 'How can I create an efficient and robust workflow that utilizes advanced automation techniques to extract targeted data, including customer information, from diverse PDF documents and effortlessly integrate it into a designated Google Sheet? Furthermore, I am interested in establishing a comprehensive and seamless system that promptly activates an SMS notification on my mobile device whenever a new PDF document is uploaded to the Google Sheet, ensuring real-time updates and enhanced accessibility.',
   'topic': 'Software Development'}],
 True)

### Generate responses

We need to generate the responses for the given instructions. We will use two different models available on the HuggingFace Hub throught the serverless Inference API:
- [`meta-llama/Meta-Llama-3-8B-Instruct`](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)
- [`mistralai/Mixtral-8x7B-Instruct-v0.1`](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1)

In [None]:
generate_responses = [
    TextGeneration(
        llm=InferenceEndpointsLLM(
            model_id='meta-llama/Meta-Llama-3-8B-Instruct',
            tokenizer_id='meta-llama/Meta-Llama-3-8B-Instruct',
            generation_kwargs={'max_new_tokens': 512, 'temperature': 0.7}
        ),
        pipeline=Pipeline(name='showcase-pipeline')
    ),
    TextGeneration(
        llm=InferenceEndpointsLLM(
            model_id='mistralai/Mixtral-8x7B-Instruct-v0.1',
            tokenizer_id='mistralai/Mixtral-8x7B-Instruct-v0.1',
            generation_kwargs={'max_new_tokens': 512, 'temperature': 0.7}
        ),
        pipeline=Pipeline(name='showcase-pipeline')
    )
]

for task in generate_responses:
    task.load()
    print(next(task.process([{'instruction': 'Which are the top cities in Spain?'}])))

### Group the responses

The task to evaluate the responses needs input as a list of generations. However, each model response was saved in the generation column of the subsets `text_generation_0` and `text_generation_1`. We will combine these two columns into a single column and the `default` subset.

In [None]:
group_responses = GroupColumns(
    columns=['generation', 'model_name'],
    output_columns=['generations', 'model_names'],
    pipeline=Pipeline(name='showcase-pipeline')
)

next(
    group_responses.process(
        [{
            'generation': 'Madrid',
            'model_name': 'meta-llama/Meta-Llama-3-8B-Instruct'
        }],
        [{
            'generation': 'Barcelona',
            'model_name': 'mistralai/Mixtral-8x7B-Instruct-v0.1'
        }]
    )
)

### Evaluate the responses

To build our preference dataset, we need to evaluate the responses generated by the models. We will use [`meta-llama/Meta-Llama-3-70B-Instruct`](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) for this, applying the `UltraFeedback` task that judges the responses according to different dimensions (helpfulness, honesty, instruction-following, truthfulness).

In [None]:
evaluate_responses = UltraFeedback(
    aspect='overall-rating',
    llm=InferenceEndpointsLLM(
        model_id='meta-llama/Meta-Llama-3-70B-Instruct',
        tokenizer_id='meta-llama/Meta-Llama-3-70B-Instruct',
        generation_kwargs={'max_new_tokens': 512, 'temperature': 0.7}
    ),
    pipeline=Pipeline(name='showcase-pipeline')
)

evaluate_responses.load()
next(
    evaluate_responses.process(
        [{
            'instruction': "What's the capital of Spain?",
            'generations': ['Madrid', 'Barcelona'],

        }]
    )
)

### Convert to a preference dataset

We can automatically convert it to a preference dataset with the `chosen` and `rejected` columns.

In [None]:
format_dpo = FormatTextGenerationDPO(pipeline=Pipeline(name='showcase-pipeline'))
format_dpo.load()

next(
    format_dpo.process(
        [{
            'instruction': "What's the capital of Spain?",
            'generations': ['Madrid', 'Barcelona'],
            'generation_models': [
                'Meta-Llama-3-8B-Instruct',
                'Mixtral-8x7B-Instruct-v0.1'
            ],
            'ratings': [5, 1]
        }]
    )
)

Or we can use Argilla to manually label the data and convert it to a preference dataset.

In [None]:
to_argilla = PreferenceToArgilla(
    dataset_name='preference-dataset',
    dataset_workspace='argilla',
    api_url="https://<username>-<space-name>.hf.space",
    api_key="<api-key>",
    num_generations=2
)

## Run the pipeline

In [None]:
with Pipeline(name='generate-dataset') as pipeline:
    # Load dataset
    load_dataset = LoadDataFromHub(repo_id='argilla/10Kprompts-mini')

    # Generate responses
    generate_responses = [
        TextGeneration(
            llm=InferenceEndpointsLLM(
                model_id='meta-llama/Meta-Llama-3-8B-Instruct',
                tokenizer_id='meta-llama/Meta-Llama-3-8B-Instruct',
                generation_kwargs={'max_new_tokens': 512, 'temperature': 0.7}
            )
        ),
        TextGeneration(
            llm=InferenceEndpointsLLM(
                model_id='mistralai/Mixtral-8x7B-Instruct-v0.1',
                tokenizer_id='mistralai/Mixtral-8x7B-Instruct-v0.1',
                generation_kwargs={'max_new_tokens': 512, 'temperature': 0.7}
            )
        )
    ]

    # Group responses
    group_responses = GroupColumns(
        columns=['generation', 'model_name'],
        output_columns=['generations', 'model_names']
    )

    # Evaluate responses
    evaluate_responses = UltraFeedback(
        aspect='overall-rating',
        llm=InferenceEndpointsLLM(
            model_id='meta-llama/Meta-Llama-3-70B-Instruct',
            tokenizer_id='meta-llama/Meta-Llama-3-70B-Instruct',
            generation_kwargs={'max_new_tokens': 512, 'temperature': 0.7}
        )
    )

    # Convert to preference dataset
    format_dpo = FormatTextGenerationDPO()
    to_argilla = PreferenceToArgilla(
        dataset_name='preference-dataset',
        dataset_workspace='argilla',
        api_url="https://<username>-<space-name>.hf.space",
        api_key="<api-key>",
        num_generations=2
    )

    # Connect components
    for task in generate_responses:
        load_dataset.connect(task)
        task.connect(group_responses)

    group_responses.connect(evaluate_responses)
    evaluate_responses.connect(format_dpo, to_argilla)

In [None]:
# Run the pipeline
distiset = pipeline.run()