# Synthetic Data Generation

Synthetic Data Generation (SDG) is the process of creating data using statistical simulations or AI models.

### Why Synthetic Data?
- Scalability
- Privacy preservation
- Ability to simulate hard-to-capture scenarios

### Limitations of Synthetic Data
- Lack of Real-World Authenticity
- Overfitting and Bias

### Overcoming Limitations
- Hybrid approach
- Validation on real data
- Regularization
- Diversification

## Generate Synthetic Data:

### Login with HuggingFace access_token

Paste your HuggingFace token with approriate access to model being used, uncheck the option to git credentials and login

In [1]:
import os
from dotenv import load_dotenv
from huggingface_hub import login

def read_token() -> None:
    """
    Logs into Hugging Face using a token stored in a '.env' file under the key `HF_TOKEN`.
    If the '.env' file is missing or `HF_TOKEN` is not provided, the user will be prompted to log in.
    """
    load_dotenv()
    token = os.getenv("HF_TOKEN")
    login(token)

read_token()


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

### Synthetic Data Generation function

In [14]:
import random
from datetime import datetime
import pandas as pd
from transformers import pipeline
import re
from typing import Dict, List, Tuple

def parse_string(input_string: str) -> Tuple[str, str]:
    """
    Parses a string containing `OUTPUT:` and `REASONING:` sections and extracts their values.

    Args:
        input_string (str): The input string containing `OUTPUT:` and `REASONING:` labels.

    Returns:
        Tuple[str, str]: A tuple containing two strings:
                         - The content following `OUTPUT:`.
                         - The content following `REASONING:`.

    Raises:
        ValueError: If the input string does not match the expected format with both `OUTPUT:` and `REASONING:` sections.

    Note:
        - The function is case-sensitive and assumes `OUTPUT:` and `REASONING:` are correctly capitalized.
    """
    # Use regular expressions to extract OUTPUT and REASONING
    match = re.search(r"OUTPUT:\s*(.+?)\s*REASONING:\s*(.+)", input_string, re.DOTALL)

    if not match:
        raise ValueError(
            "The generated response is not in the expected 'OUTPUT:... REASONING:...' format."
        )

    # Extract the matched groups: output and reasoning
    output = match.group(1).strip()
    reasoning = match.group(2).strip()

    return output, reasoning

def sdg(
    sample_size: int,
    labels: List[str],
    label_descriptions: str,
    categories_types: Dict[str, str],
    use_case: str,
    prompt_examples: str,
    model: str,
    max_new_tokens: int,
    batch_size: int,
    output_dir: str,
    save_reasoning: bool,
) -> None:
    """
    Generates synthetic data based on specified categories and labels.

    Args:
        sample_size (int): The number of synthetic data samples to generate.
        labels (List[str]): The labels used to classify the synthetic data.
        label_descriptions (str): A description of the meaning of each label.
        categories_types (Dict[str, str]): The categories and their types for data generation and diversification.
        use_case (str): The use case of the synthetic data to provide context for the language model.
        prompt_examples (str): The examples used in the Few-Shot or Chain-of-Thought prompting.
        model (str): The large language model used for generating the synthetic data.
        max_new_tokens (int): The maximum number of new tokens to generate for each sample.
        batch_size (int): The number of samples per batch to append to the output file.
        output_dir (str): The directory path where the output file will be saved.
        save_reasoning (bool): Whether to save the reasoning or explanation behind the generated data.
    """

    categories = list(categories_types.keys())

    # Generate filename with current date and time
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    os.makedirs(output_dir, exist_ok=True)
    output_path = os.path.join(output_dir, f"{timestamp}.csv")

    # If sample_size is not divisible by batch_size, an extra batch is added
    num_batches = (sample_size + batch_size - 1) // batch_size

    print(
        f"\U0001f680  Synthetic data will be appended to {output_path} in {num_batches} batch(es)."
    )

    for batch in range(num_batches):
        # Calculate the start and end indices for the current batch
        start = batch * batch_size
        end = min(start + batch_size, sample_size)

        # Store results of the current batch
        batch_data = []

        # Assign random labels to the current batch
        batch_random_labels = random.choices(labels, k=batch_size)

        # Assign random categories to the current batch
        batch_random_categories = random.choices(categories, k=batch_size)

        for i in range(start, end):
            # Assign a random type to the ith category
            random_type = random.choices(
                categories_types[batch_random_categories[i - start]]
            )
            prompt = f"""You should create synthetic data for specified labels and categories. 
            This is especially useful for {use_case}.

            *Label Descriptions*
            {label_descriptions}

            *Examples*
            {prompt_examples}

            ####################

            Generate one output for the classification below.
            You may use the examples I have provided as a guide, but you cannot simply modify or rewrite them.
            Only return the OUTPUT and REASONING. 
            Do not return the LABEL, CATEGORY, or TYPE.

            LABEL: {batch_random_labels[i - start]}
            CATEGORY: {batch_random_categories[i - start]}
            TYPE: {random_type}
            OUTPUT:
            REASONING:
            """
            messages = [
                {
                    "role": "system",
                    "content": f"You are a helpful assistant designed to generate synthetic data for {use_case} with labels {labels} in categories {categories}.",
                },
                {"role": "user", "content": prompt},
            ]
            generator = pipeline("text-generation", model=model)
            result = generator(messages, max_new_tokens=max_new_tokens)[0][
                "generated_text"
            ][-1]["content"]

            # Uncomment to see the raw outputs
            # print(result)

            text, reasoning = parse_string(result)

            entry = {
                "text": text,
                "label": batch_random_labels[i - start],
                "model": model,
            }

            if save_reasoning:
                entry["reasoning"] = reasoning

            batch_data.append(entry)

        # Convert the batch results to a DataFrame
        batch_df = pd.DataFrame(batch_data)

        # Append the DataFrame to the CSV file
        if batch == 0:
            # If it's the first batch, write headers
            batch_df.to_csv(output_path, mode="w", index=False)
        else:
            # For subsequent batches, append without headers
            batch_df.to_csv(output_path, mode="a", header=False, index=False)
        print(f"\U000026a1  Saved batch number {batch + 1}/{num_batches}")

    print(f"\n\nGenerated Data:")
    df = pd.read_csv(output_path)
    display(df.head())

### Run Synthetic Data Generatation

Load configuration from config file.
Sample size is set to 3 for testing perpose, for final model creating you will need 10K + data for accurate results

In [15]:
import importlib.util
import sys

# Dynamically load the configuration module
config_path = "./config/polite-guard-config.py"
if not os.path.exists(config_path):
    print(f"Error: Configuration file not found at {config_path}")
    sys.exit(1)

try:
    spec = importlib.util.spec_from_file_location("config_module", config_path)
    if spec is None or spec.loader is None:
        raise ImportError(f"Could not load spec for module at {config_path}")
    sdg_config = importlib.util.module_from_spec(spec)
    spec.loader.exec_module(sdg_config)
except Exception as e:
    print(f"Error loading configuration from {config_path}: {e}")
    sys.exit(1)

sdg(
    sample_size=4,
    labels=sdg_config.labels,
    label_descriptions=sdg_config.label_descriptions,
    categories_types=sdg_config.categories_types,
    use_case=sdg_config.use_case,
    prompt_examples=sdg_config.prompt_examples,
    model="meta-llama/Llama-3.2-3B-Instruct",
    max_new_tokens=256,
    batch_size=20,
    output_dir="./output",
    save_reasoning="store_true",
)

🚀  Synthetic data will be appended to ./output/20250507_163808.csv in 1 batch(es).


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cpu
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cpu
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cpu
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cpu
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


⚡  Saved batch number 1/1


Generated Data:


Unnamed: 0,text,label,model,reasoning
0,"I'll check what options we have for you, and I...",somewhat polite,meta-llama/Llama-3.2-3B-Instruct,"This text would be classified as ""somewhat pol..."
1,I'm glad you're interested in joining our team...,polite,meta-llama/Llama-3.2-3B-Instruct,This text is polite because it expresses enthu...
2,Your credit score will be calculated based on ...,neutral,meta-llama/Llama-3.2-3B-Instruct,"This text would be classified as ""neutral."" Th..."
3,You're wasting your time with this play. It's ...,impolite,meta-llama/Llama-3.2-3B-Instruct,This statement is impolite because it directly...
