# Generate a dataset for preference alignment

This notebook will guide you through the process of generating a dataset for preference alignment. We'll use the `distilabel` package to generate a dataset for preference alignment.

So let's dig in to some preference alignment datasets.

<div style='background-color: lightblue; padding: 10px; border-radius: 5px; margin-bottom: 20px; color:black'>
    <h2 style='margin: 0;color:blue'>Exercise: Generate a dataset for preference alignment</h2>
    <p>Now that you've seen how to generate a dataset for preference alignment, try generating a dataset for preference alignment.</p>
    <p><b>Difficulty Levels</b></p>
    <p>🐢 Generate a dataset for preference alignment</p>
    <p>🐕 Generate a dataset for preference alignment with response evolution</p>
    <p>🦁 Generate a dataset for preference alignment with response evolution and model pooling</p>
</div>

## Install dependencies

Instead of transformers, you can also install `vllm` or `hf-inference-endpoints`.

In [1]:
!pip install "distilabel[hf-transformers,outlines,instructor]"

[0m

## Start synthesizing

As we've seen in the previous notebook, we can create a distilabel pipeline for preference dataset generation. The bare minimum pipline is already provided. You can continue work on this pipeline to generate a large dataset for preference alignment. Swap out models, model providers and generation arguments to see how they affect the quality of the dataset. Experiment small, scale up later.

Check out the [distilabel components gallery](https://distilabel.argilla.io/latest/components-gallery/) for information about the processing classes and how to use them. 

An example of loading data from the Hub instead of dictionaries is provided below.

```python
from datasets import load_dataset

with Pipeline(...) as pipeline:
    ...

if __name__ == "__main__:
    dataset = load_dataset("my-dataset", split="train")
    distiset = pipeline.run(dataset=dataset)
```

Don't forget to push your dataset to the Hub after running the pipeline!

In [None]:
from distilabel.llms import TransformersLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import GroupColumns, LoadDataFromDicts
from distilabel.steps.tasks import TextGeneration

with Pipeline() as pipeline:
    data = LoadDataFromDicts(data=[{"instruction": "What is synthetic data?"}])
    llm_a = TransformersLLM(model="HuggingFaceTB/SmolLM2-1.7B-Instruct")
    gen_a = TextGeneration(llm=llm_a)
    llm_b = TransformersLLM(model="Qwen/Qwen2.5-1.5B-Instruct")
    gen_b = TextGeneration(llm=llm_b)
    group = GroupColumns(columns=["generation"])
    data >> [gen_a, gen_b] >> group

distiset = pipeline.run()

In [12]:
data_row = distiset["default"]["train"][0]
print(f"Instruction: {data_row['instruction']}")
print(f"Generation 1: {data_row['grouped_generation'][0]}")
print(f"Generation 2: {data_row['grouped_generation'][1]}")

Instruction: What is synthetic data?
Generation 1: Synthetic data, also known as simulated or artificial data, refers to artificially generated data that mimics real-world patterns and characteristics. It's created using algorithms and statistical models to mimic the properties of real data, but it doesn't actually represent any specific real-world phenomenon. Synthetic data can be used in various fields such as machine learning, data science, and research. It's often used when collecting real data is difficult, expensive, or impossible due to privacy concerns, regulatory restrictions, or other reasons. For example, synthetic data might be used for training machine learning models on sensitive topics like healthcare or finance without revealing actual personal information.
Generation 2: Synthetic data refers to data that has been generated artificially rather than being collected from real-world sources. It can be used in various applications where privacy concerns or the need for larg

## 🌯 That's a wrap

You've now seen how to generate a dataset for preference alignment. You could use this to:

- Generate a dataset for preference alignment.
- Create evaluation datasets for preference alignment.

Next

🏋️‍♂️ Fine-tune a model with preference alignment with a synthetic dataset based on the [preference tuning chapter](../../2_preference_alignment/README.md) 
