# Synthetic Data Generation Tutorial using LLaMA and Mixtral

This tutorial demonstrates how to use SDG repository to generate synthetic question-answer pairs from documents using large language models like LLaMA 3.3 70B. We will also generate data using Mixtral model for comparison. We'll cover:

1. Setting up the environment
2. Connecting to LLM servers
3. Configuring the data generation pipeline
4. Generating data with different models
5. Comparing results

In [21]:
# Enable auto-reloading of modules - useful during development
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Setup Instructions

Before running this notebook, you'll need to:

```bash 
pip install git+https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub.git
```

In [22]:
# Import required libraries
# datasets: For handling our data
# OpenAI: For interfacing with the LLM servers
# SDG components: For building our data generation pipeline
from datasets import load_dataset, Dataset
from openai import OpenAI
from dotenv import load_dotenv
import os

from sdg_hub.flow import Flow
from sdg_hub.pipeline import Pipeline
from sdg_hub.sdg import SDG
from sdg_hub.registry import PromptRegistry



### Setting up LLaMA 3.3 70B Model

First, we need to host the LLaMA model using vLLM. This creates an OpenAI-compatible API endpoint.

1. Start the vLLM server (run in terminal):
```bash
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.3-70B-Instruct \
    --dtype float16 \
    --tensor-parallel-size 8 
```

2. Connect to the model using OpenAI client below:

In [23]:
# Configure OpenAI client to connect to our local vLLM server
# endpoint = f"http://localhost:8000/v1"
endpoint_llama3 = f"https://inference-3scale-apicast-production.apps.rits.fmaas.res.ibm.com/llama-3-3-70b-instruct/v1"
endpoint_mixtral = f"https://inference-3scale-apicast-production.apps.rits.fmaas.res.ibm.com/mixtral-8x7b-instruct-v01/v1"
openai_api_key = "EMPTY"  # vLLM doesn't require real API key
openai_api_base = endpoint_llama3
load_dotenv()
display(os.environ['RITS_API_KEY'])
default_headers={'RITS_API_KEY': os.environ['RITS_API_KEY']}

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
    default_headers=default_headers,
)

# Verify we can see the model
teacher_model = client.models.list().data[0].id
print(f"Connected to model: {teacher_model}")

'5c1f0a242df0a53f6de3ed04a799f31c'

Connected to model: meta-llama/llama-3-3-70b-instruct


### Configure LLaMA 3.3 Prompt Template

We need to register the correct chat template for our model to ensure proper prompt formatting.

In [24]:
# Register the LLaMA 3.3 chat template
# This ensures proper formatting of prompts for the model
from transformers import AutoTokenizer

# Load the tokenizer to get the chat template
# tokenizer = AutoTokenizer.from_pretrained("meta-llama/llama-3-3-70b-instruct")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.3-70B-Instruct")

# Register the chat template in our prompt registry
# @PromptRegistry.register("meta-llama/Llama-3.3-70B-Instruct")
# @PromptRegistry.register("meta-llama/llama-3-3-70b-instruct")
@PromptRegistry.register(teacher_model)
def llama_3_3_70b_chat_template():
    return tokenizer.chat_template


### Configure the Data Generation Pipeline

Now we'll set up our Synthetic Data Generation (SDG) pipeline with the following components:
1. SDG Flow configuration from YAML
2. SDG Pipeline setup
3. SDG configuration with batch processing, number of workers, and save frequency parameters

In [25]:
# Load the flow configuration from YAML file
# flow_cfg = Flow(client).get_flow_from_file("synth_knowledge1.5_llama3.3.yaml")
flow_cfg = Flow(client).get_flow_from_file("synth_knowledge1.5_llama3.3_rits.yaml")

# Initialize the SDG pipeline with processing parameters
sdg = SDG(
    [Pipeline(flow_cfg)],
    num_workers=1,      # Number of parallel workers
    batch_size=1,       # Batch size for processing
    save_freq=1000,     # How often to save checkpoints
)

### Load and Prepare Seed Data

We'll load our seed data (documents) that will be used to generate question-answer pairs.

In [26]:
# Load the seed data from JSON file
# seed_data_path = "Your seed data path"  # Replace with your data path
# seed_data_path = "../instructlab/annotation/sample_data/emotion_classification.jsonl"
# seed_data_path = "../instructlab/skills/sample_data/mdtable_seeds.jsonl"
# seed_data_path = "../../../sample/seed_data_20250411_en.jsonl"
# seed_data_path = "../../../sample/seed_data_20250411_en_2.jsonl"
seed_data_path = "../../../sample/seed_data_20250411_ja.jsonl"
ds = load_dataset('json', data_files=seed_data_path, split='train')

# For testing, we'll use just one example
# example_index = 0
example_index = 9
ds = ds.select(range(example_index, example_index + 1))

### Generate Data with LLaMA 3.3

Now we'll use our configured pipeline to generate synthetic question-answer pairs.

In [27]:
# Generate synthetic data and save checkpoints
generated_data = sdg.generate(ds, checkpoint_dir="Tmp")

100%|██████████| 1/1 [00:00<00:00, 19508.39it/s]


  0%|          | 0/1 [00:00<?, ?it/s]

Filter: 100%|██████████| 36/36 [00:00<00:00, 7166.35 examples/s]
Filter: 100%|██████████| 36/36 [00:00<00:00, 5998.53 examples/s]




Map: 100%|██████████| 36/36 [00:00<00:00, 2034.48 examples/s]
Filter: 100%|██████████| 36/36 [00:00<00:00, 7685.79 examples/s]
Filter: 100%|██████████| 34/34 [00:00<00:00, 6038.80 examples/s]


Map: 100%|██████████| 32/32 [00:00<00:00, 2690.97 examples/s]
Filter: 100%|██████████| 32/32 [00:00<00:00, 6927.72 examples/s]
Filter: 100%|██████████| 32/32 [00:00<00:00, 5869.24 examples/s]


100%|██████████| 1/1 [13:11<00:00, 791.97s/it]


### Setting up Mixtral Model

For comparison, we'll also generate data using the Mixtral model. First, start the Mixtral server:

```bash
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.3-70B-Instruct \
    --dtype float16 \
    --tensor-parallel-size 8 
```

In [28]:
# Connect to Mixtral model running on a different server
mistral_client = OpenAI(
    api_key="EMPTY",
    base_url=endpoint_mixtral,  # Update with your Mixtral server address
    default_headers=default_headers,
)

# Verify connection to Mixtral model
mistral_client_teacher_model = mistral_client.models.list().data[0].id
print(f"Connected to Mixtral model: {mistral_client_teacher_model}")

Connected to Mixtral model: mistralai/mixtral-8x7B-instruct-v0.1


In [29]:
tokenizer_mixtral = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-Instruct-v0.1")

@PromptRegistry.register(mistral_client_teacher_model)
def mixtral_8x7b_instruct_chat_template():
    return tokenizer.chat_template

# @PromptRegistry.register("text-classifier-knowledge-v3-clm") ## ???
# def text_classifier_knowledge_v3_clm_template():
#     return ""


### Configure Mixtral Pipeline

Set up a similar pipeline for Mixtral model generation.

In [30]:
# Create flow configuration for Mixtral
flow_cfg_mistral = Flow(mistral_client).get_flow_from_file(
    # "../../src/sdg_hub/flows/generation/knowledge/synth_knowledge1.5.yaml"
    # "../../src/sdg_hub/flows/generation/knowledge/synth_knowledge1.5_rits.yaml"
    "synth_knowledge1.5_mixtral-8x7b_rits.yaml"
)

# Initialize SDG pipeline for Mixtral
sdg_mistral = SDG(
    [Pipeline(flow_cfg_mistral)],
    num_workers=1,
    batch_size=1,
    save_freq=1000,
)

### Generate Data with Mixtral

Generate synthetic data using the Mixtral model for comparison.

In [31]:
# Generate data using Mixtral model
generated_data_mistral = sdg_mistral.generate(ds, checkpoint_dir="Tmp")

100%|██████████| 1/1 [00:00<00:00, 19152.07it/s]


  0%|          | 0/1 [00:00<?, ?it/s]

Filter: 100%|██████████| 39/39 [00:00<00:00, 7298.35 examples/s]
Filter: 100%|██████████| 39/39 [00:00<00:00, 6230.35 examples/s]


Map: 100%|██████████| 38/38 [00:00<00:00, 3139.70 examples/s]
Filter: 100%|██████████| 38/38 [00:00<00:00, 8041.96 examples/s]
Filter: 100%|██████████| 38/38 [00:00<00:00, 6360.59 examples/s]


Map: 100%|██████████| 37/37 [00:00<00:00, 3099.76 examples/s]
Filter: 100%|██████████| 37/37 [00:00<00:00, 7886.03 examples/s]
Filter: 100%|██████████| 37/37 [00:00<00:00, 6327.80 examples/s]


100%|██████████| 1/1 [01:10<00:00, 70.02s/it]


### Compare Generated Data

Let's compare the outputs from both models by saving them to a markdown file for easy review.

In [32]:
# Save comparison results to markdown file
k = 5  # Number of examples to compare
output_file = "model_comparison.md"

with open(output_file, "w") as f:
    # Write the source document first
    f.write(f"### Document \n{generated_data[0]['document']}")
    
    # Compare generated Q&A pairs
    for i in range(min(len(generated_data), len(generated_data_mistral))):
        f.write("Example #{}\n".format(i+1))
        
        # LLaMA 3.3 results
        f.write("### Result from llama3.3\n")
        f.write(generated_data[i]['question'] + "\n")
        f.write("*******************************\n")
        f.write(generated_data[i]['response'] + "\n")
        f.write("=================================\n")
        
        # Mixtral results
        f.write("### Result from mistral\n") 
        f.write(generated_data_mistral[i]['question'] + "\n")
        f.write("*******************************\n")
        f.write(generated_data_mistral[i]['response'] + "\n")
        f.write("\n\n")

print(f"Wrote {k} examples to {output_file}")

Wrote 5 examples to model_comparison.md


### Production Usage

For large-scale data generation, use the command-line script instead of this notebook:

```bash
python scripts/generate.py --ds_path seed_data.jsonl \
    --bs 2 --num_workers 10 \
    --save_path <your_save_path> \
    --flow ../src/sdg_hub/flows/generation/knowledge/synth_knowledge1.5.yaml \
    --checkpoint_dir <your_checkpoint_dir> \
    --endpoint <your_endpoint>
```

Note: For LLaMA 3.3, use `synth_knowledge1.5_llama3.3.yaml` as the flow configuration file.