# **Generate Synthetic Data**

In this notebook, we focus on **generating synthetic data** for training and evaluation purposes. This involves retrieving chunks of data, generating synthetic questions based on these chunks, and filtering and embedding the generated questions.

### Objectives:
- **Retrieve Data Chunks:** Retrieve chunks of data from the search index.
- **Generate Synthetic Questions:** Generate synthetic questions based on the retrieved chunks.
- **Filter Questions:** Filter the generated questions to ensure quality.
- **Embed Questions:** Embed the filtered questions for further use.
- **Store Data:** Save the high-quality synthetic questions to a JSON Lines file.

### Key Steps:
1. **Retrieve Data Chunks:** Retrieve chunks of data from the search index using Azure Cognitive Search.
2. **Generate Synthetic Questions:** Use a synthetic data generator to create questions based on the retrieved chunks.
3. **Filter Questions:** Filter the generated questions using criteria such as Jaccard similarity and embedding similarity.
4. **Embed Questions:** Embed the filtered questions using a text embedding model.
5. **Store Data:** Save the high-quality synthetic questions to a JSON Lines file for future use.

This notebook ensures that high-quality synthetic data is generated, filtered, and stored effectively, providing valuable data for training and evaluation.

In [1]:
from dotenv import dotenv_values, load_dotenv
from typing import List
from typing_extensions import Annotated

from pydantic import BaseModel, Field

from azure.search.documents import SearchClient
from azure.core.credentials import AzureKeyCredential

# load_dotenv(".env")
config = dotenv_values(".env")

aoai_endpoint = config["AZURE_OPENAI_API_BASE"]
aoai_key = config["AZURE_OPENAI_API_KEY"]
aoai_api_version = config["AZURE_OPENAI_API_VERSION"]
aoai_chat_model = config["AZURE_OPENAI_MODEL"]
aoai_chat_model_mini = config["AZURE_OPENAI_MODEL_MINI"]
aoai_embedding_model = config["AZURE_OPENAI_EMBEDDING_MODEL"]
search_endpoint = config["SEARCH_ENDPOINT"]
search_key = config["SEARCH_KEY"]
credential = AzureKeyCredential(search_key)

### **Get all chunks**

In [2]:
def get_all_chunks(index, k=1000):
    try:
        # perform search query
        search_client = SearchClient(endpoint=search_endpoint, index_name=index, credential=credential)
        results = search_client.search(
            search_text="*",
            top=k # top parameter is capped at 1000 chunks/documents per retrieval call. If you have more than 1000 chunks/documents in your index, you will need to paginate through the results.
        )

        # format search results
        data = []
        for result in results:
            data.append(
                {
                    "chunk_id": result["chunk_id"],
                    "title": result["title"],
                    "chunk": result["chunk"],
                    "chunk_embedding": result["text_vector"]
                }
            )
        return data
    except Exception as e:
        print(e)
        return None

In [3]:
index_name = "product-customer-vector"

chunks = get_all_chunks(
    index=index_name,
    k=30
)

len(chunks)

30

### **Generate Synthetic chunk-dependent questions**

In [4]:
import json
import os
import random
import string

import dotenv
from tqdm import tqdm


import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

from utils.genai.generate_synthetic_questions import GenerateSyntheticData
from utils.genai.filter_synthetic_questions import filter_synthetic_questions
from utils.genai.invoker import embed_text
from utils.sdg_generator_helper import (
    get_instruction,
    get_domain,
    get_tone,
    get_question_length,
    get_difficulty,
    get_topic,
    get_language,
    set_is_grounded
)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\povelf\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [5]:
def get_model_config():
    dotenv.load_dotenv(".env")
    return {
        "azure_endpoint": aoai_endpoint,
        "api_key": aoai_key,
        "azure_deployment": aoai_chat_model, #"gpt-4o-mini-sdg-llm"
        "api_version": aoai_api_version
    }

In [6]:
def get_top_k_similar_chunks(chunks, current_index, k=3):
    """
    For the chunk at current_index, compute cosine similarity to all other chunks,
    and return the texts of the top-k most similar chunks (excluding the current one).
    """
    embeddings = np.array([chunk["chunk_embedding"] for chunk in chunks])
    current_embedding = embeddings[current_index].reshape(1, -1)
    similarities = cosine_similarity(current_embedding, embeddings).flatten()
    
    similarities[current_index] = -np.inf # Exclude the current chunk by setting its similarity to -inf

    top_k_indices = np.argpartition(similarities, -k)[-k:] # Get the indices of the top-k most similar chunks
    top_k_indices = top_k_indices[np.argsort(similarities[top_k_indices])[::-1]] # Sort the indices by similarity
    return [chunks[i]["chunk"] for i in top_k_indices]

def build_multi_chunk_context(primary_chunk, additional_chunks):
    """
    Combine the primary chunk with additional similar chunks into one context string.
    Use clear delimiters to indicate different parts.
    """
    context_parts = []
    context_parts.append("Primary Chunk:\n" + primary_chunk)
    for i, add_chunk in enumerate(additional_chunks, start=1):
        context_parts.append(f"Additional Context {i}:\n" + add_chunk)
    
    full_context = "\n\n".join(context_parts)
    return full_context


def generate_synthetic_questions(chunks, generator, num_questions_per_chunk, multi=False):
    """
    For each chunk, retrieve the top_k similar chunks using cosine similarity,
    build a multi-chunk context, and generate synthetic questions.
    """
    all_results = []
    failed_results = []
    total_iterations = len(chunks) * num_questions_per_chunk

    with tqdm(total=total_iterations, desc="Generating synthetic questions") as pbar:
        for current_index, chunk in enumerate(chunks):
            if multi:
                top_k=3
                similar_chunks = get_top_k_similar_chunks(chunks, current_index, k=top_k)
                multi_chunk_context = build_multi_chunk_context(chunk["chunk"], similar_chunks)
                context=multi_chunk_context
                task_path = "configs/settings/tasks/task_multi_grounded_not_grounded_questions.txt"
            else:
                context = chunk["chunk"]
                task_path = "configs/settings/tasks/task_single_grounded_not_grounded_questions.txt"
                multi_chunk_context = "single chunk was used"

            for question_index in range(num_questions_per_chunk):
                # Sample metadata for each question
                domain = get_domain()
                tone = get_tone()
                difficulty = get_difficulty()
                question_length = get_question_length(min_length=4)
                topic = get_topic()
                language = get_language()
                instructions = "None"
                task = get_instruction(task_path)
                is_grounded = set_is_grounded()

                try:
                    # Call the generator with the multi-chunk context
                    result = generator(
                        chunk=context,
                        is_grounded=is_grounded,
                        domain=domain,
                        difficulty=difficulty,
                        topic=topic,
                        language=language,
                        instructions=instructions,
                        question_length=question_length,
                        task=task
                    )
                    # Since we requested one question, grab the first element.
                    question = result["question"][0]
                    explanation = result["explanation"][0]
                    response = result["response"][0]
                    synthetic_chunk_id = f"{chunk['chunk_id']}_synthetic_{question_index+1}"


                    all_results.append({
                        "synthetic_question": question,
                        "explanation": explanation,
                        "synthetic_response": response,
                        "chunk_id": synthetic_chunk_id,
                        "is_grounded": is_grounded,
                        "chunk_data": chunk["chunk"],
                        "aggregated_context": multi_chunk_context,
                        "question_number": question_index,
                        "domain": domain,
                        "difficulty": difficulty,
                        "tone": tone,
                        "language": language,
                        "question_length": question_length
                    })
                except Exception as e:
                    failed_results.append({
                        "error": str(e),
                        "chunk_id": chunk["chunk_id"],
                        "is_grounded": is_grounded,
                        "chunk_data": chunk["chunk"],
                        "question_number": question_index,
                        "domain": domain,
                        "difficulty": difficulty,
                        "tone": tone,
                        "language": language,
                        "question_length": question_length
                    })

                pbar.update(1)

    return all_results, failed_results


In [7]:
# TODO: multi-chunk generation is not currently supported
multi_choice=False

# Get the model configuration
model_config = get_model_config()

# Instantiate your generator
sdg_generator = GenerateSyntheticData(model_config, multi=multi_choice)

# Generate synthetic chunk-dependent questions, each repeated N times with varied parameters
synthetic_data, failed_samples = generate_synthetic_questions(chunks, sdg_generator, num_questions_per_chunk=1, multi=multi_choice)

Generating synthetic questions: 100%|██████████| 30/30 [01:36<00:00,  3.23s/it]


In [8]:
synthetic_data[0]

{'synthetic_question': 'What is the price of the SummitClimber Backpack purchased by David Kim?',
 'explanation': "The provided chunk contains detailed information about David Kim's recent purchases, including the SummitClimber Backpack. The description specifies the price of the SummitClimber Backpack as $240, making the question directly answerable from the chunk.",
 'synthetic_response': 'The SummitClimber Backpack purchased by David Kim costs $240.',
 'chunk_id': '252167c94bb6_aHR0cHM6Ly9zdG9yYWdlcG92ZWwuYmxvYi5jb3JlLndpbmRvd3MubmV0L2NvbnRhaW5lcnBvdmVsL2N1c3RvbWVyXzUubWQ1_pages_0_synthetic_1',
 'is_grounded': True,
 'chunk_data': '## Customer_Info\n\nFirst Name: David \nLast Name: Kim \nAge: 42 \nEmail Address: davidkim@example.com \nPhone Number: 555-555-5555 \nShipping Address: 654 Pine St,  Suburbia USA, 23456 \nMembership: Gold \n\n## Recent_Purchases\n\norder_number: 7 \ndate: 2023-02-15 \nitem:\n- description:  Adventurer Pro Backpack, quantity 2, price $180 \n\xa0 item_numbe

In [9]:
accepted_samples, rejected_samples = filter_synthetic_questions(
    synthetic_data=synthetic_data,
    jaccard_threshold=0.8,
    embedding_threshold=0.8,
    combination_mode="or",
    min_question_length=5,
    max_question_length=150,
    use_embedding=False
)

print(f"\nNumber of filtered out synthetic questions: {len(rejected_samples)}")

Filtering synthetic questions: 100%|██████████| 30/30 [00:00<00:00, 1997.54it/s]


Number of filtered out synthetic questions: 0





**Adding embeddings**

In [10]:
# Count total questions
total_questions = len(accepted_samples)

# Create a single progress bar
with tqdm(total=total_questions, desc="Embedding all questions") as pbar:
    for item in accepted_samples:
        item["synthetic_embedding"] = embed_text(item["synthetic_question"])
        pbar.update(1)

Embedding all questions: 100%|██████████| 30/30 [00:04<00:00,  6.30it/s]


**Storing synthetic data**

In [11]:
def save_high_quality_synthetic_questions(json_list, path):
    # Write the list of JSON objects to a JSON Lines file
    with open(path, "w") as f:
        for obj in json_list:
            f.write(json.dumps(obj) + "\n")

In [12]:
save_high_quality_synthetic_questions(accepted_samples, "data/ft-judge/single/chunk-specific-synthetic-questions.jsonl")