# Exercise 2: Synthetic Data Generation with Seed Data

In this exercise, we will demonstrate how to use an existing dataset (**Seed Data**) to jumpstart the creation of high-quality training data. We will use a **BBC News** corpus as our source of truth to generate a dataset suitable for training models, specifically for **Direct Preference Optimization (DPO)** or **RAG** evaluation.

### **Learning Objectives**
By the end of this exercise, you will know how to:

1.  **Ingest Seed Data**: Load and clean an existing dataset (BBC News) to serve as the context for generation.
2.  **Augment Data with LLMs**: Configure Data Designer to read from the seed dataset and generate new, related columns.
3.  **Generate Complex Pairs**: Create **Question**, **Positive Answer** (Fact-based), and **Negative Answer** (Hallucination) triplets based on the seed context.
4.  **Scale Generation**: Run a batch generation job to create a dataset ready for DPO fine-tuning.
5.  **Profile Data**: Analyze the token usage and distribution of the generated text.

---

### 1. Prepare Seed Data
Load the BBC News dataset, select a subset, and clean it to use as our `context` column.

In [None]:
import pandas as pd
from datasets import load_dataset

In [None]:
ds = load_dataset("fawern/bbc-news-pretraining-corpus")

In [None]:
articles = []
for text in ds['train']:
    articles.append(text['text'])

In [None]:
seed_df = pd.DataFrame(articles[:100])
seed_df.columns = ['context']

In [None]:
seed_df.loc[0]['context']

In [None]:
seed_df['context'] = seed_df['context'].str.replace('\n', ' ', regex=False)
seed_df.to_csv('articles_sample.csv', index=False)

### 2. Imports and Client Setup
Initialize Data Designer and Model Provider.

In [None]:
from data_designer.essentials import (
    DataDesigner,
    ModelConfig,
    InferenceParameters,
    DataDesignerConfigBuilder,
    SeedConfig
)
from data_designer.config.models import ModelProvider

In [None]:
local_nim_provider = ModelProvider(
    name="local-nim",
    endpoint="http://localhost:8080/v1",
    provider_type="openai",
    api_key="dummy"
)

In [None]:
data_designer = DataDesigner(model_providers=[local_nim_provider])

In [None]:
MODEL_ID = "nvidia/nvidia-nemotron-nano-9b-v2" 
MODEL_PROVIDER = "local-nim"
MODEL_ALIAS = "local-model"
SYSTEM_PROMPT = "/no_think"

In [None]:
model_configs = [
    ModelConfig(
        alias=MODEL_ALIAS,
        model=MODEL_ID,
        provider=MODEL_PROVIDER,
        inference_parameters=InferenceParameters(
            temperature=0.5,
            top_p=1.0,
            max_tokens=1024
        )
    )
]

### 3. Configure Data Designer with Seed Data
Initialize the `DataDesignerConfigBuilder` pointing to our seed CSV file.

In [None]:
config_builder = DataDesignerConfigBuilder(model_configs=model_configs)

In [None]:
seed_dataset = SeedConfig(dataset='articles_sample.csv')

In [None]:
config_builder.with_seed_dataset(seed_dataset)

### 4. Define Synthetic Columns (RAG/DPO Use Case)
We will generate three columns based on the seed `context`:
- `question`: A factual question based on the text.
- `positive_answer`: The correct answer from the text.
- `negative_answer`: A plausible but incorrect answer (hallucination).

In [None]:
config_builder.add_column(
        name="question",
        model_alias=MODEL_ALIAS,
        column_type="llm-text",
        system_prompt=SYSTEM_PROMPT,
        prompt=(
            "Write a clear and factual question based on the following context:\n\n"
            "{{ context }}"
        ),
    )

In [None]:
config_builder.add_column(
        name="positive_answer",
        model_alias=MODEL_ALIAS,
        column_type="llm-text",
        system_prompt=SYSTEM_PROMPT,
        prompt=(
            "Text :\n{{ context }}\n\n"
            "Question: {{ question }}\n"
            "Provide a correct and factual answer taken directly from the text section. **Answer:** Provide a detailed answer."
        ),
    
)

In [None]:
config_builder.add_column(
        name="negative_answer",
        model_alias=MODEL_ALIAS,
        column_type="llm-text",
        system_prompt=SYSTEM_PROMPT,
        prompt=(
            "Text :\n{{ context }}\n\n"
            "Question: {{ question }}\n"
            "Provide a possible but incorrect answer that does not match the text fragment. Incorrect Answer: Provide detailed a answer."
        ),
)

In [None]:
config_builder.validate()

### 5. Preview and Generate
Verify the generation with a preview and then run the full generation task.

In [None]:
preview = data_designer.preview(config_builder, num_records=5)

In [None]:
preview.display_sample_record()

In [None]:
preview.dataset.head(5)

In [None]:
preview.analysis.to_report()

In [None]:
preview.dataset

### 6. Full Generation
Generate the larger dataset for DPO.

In [None]:
result = data_designer.create(config_builder, num_records=10, dataset_name='DPO-SDG')

In [None]:
dataset = result.load_dataset()

In [None]:
dataset.head()

### 7. Analysis
Analyze the generated dataset.

In [None]:
analysis = result.load_analysis()

In [None]:
analysis.to_report()