# **Generate Synthetic Data**

In this notebook, we focus on **generating synthetic data** for training and evaluation purposes. This involves retrieving chunks of data, generating synthetic questions based on these chunks, and filtering and embedding the generated questions.

### Objectives:
- **Retrieve Data Chunks:** Retrieve chunks of data from the search index.
- **Generate Synthetic Questions:** Generate synthetic questions based on the retrieved chunks.
- **Filter Questions:** Filter the generated questions to ensure quality.
- **Embed Questions:** Embed the filtered questions for further use.
- **Store Data:** Save the high-quality synthetic questions to a JSON Lines file.

### Key Steps:
1. **Retrieve Data Chunks:** Retrieve chunks of data from the search index using Azure Cognitive Search.
2. **Generate Synthetic Questions:** Use a synthetic data generator to create questions based on the retrieved chunks.
3. **Filter Questions:** Filter the generated questions using criteria such as Jaccard similarity and embedding similarity.
4. **Embed Questions:** Embed the filtered questions using a text embedding model.
5. **Store Data:** Save the high-quality synthetic questions to a JSON Lines file for future use.

This notebook ensures that high-quality synthetic data is generated, filtered, and stored effectively, providing valuable data for training and evaluation.

In [1]:
from dotenv import dotenv_values, load_dotenv
from typing import List
from typing_extensions import Annotated

from pydantic import BaseModel, Field

from azure.search.documents import SearchClient
from azure.core.credentials import AzureKeyCredential

# load_dotenv(".env")
config = dotenv_values(".env")

aoai_endpoint = config["AZURE_OPENAI_API_BASE"]
aoai_key = config["AZURE_OPENAI_API_KEY"]
aoai_api_version = config["AZURE_OPENAI_API_VERSION"]
aoai_chat_model = config["AZURE_OPENAI_MODEL"]
aoai_chat_model_mini = config["AZURE_OPENAI_MODEL_MINI"]
aoai_embedding_model = config["AZURE_OPENAI_EMBEDDING_MODEL"]
search_endpoint = config["SEARCH_ENDPOINT"]
search_key = config["SEARCH_KEY"]
credential = AzureKeyCredential(search_key)

### **Get all chunks**

In [2]:
def get_all_chunks(index, k=1000):
    try:
        # perform search query
        search_client = SearchClient(endpoint=search_endpoint, index_name=index, credential=credential)
        results = search_client.search(
            search_text="*",
            top=k # top parameter is capped at 1000 chunks/documents per retrieval call. If you have more than 1000 chunks/documents in your index, you will need to paginate through the results.
        )

        # format search results
        data = []
        for result in results:
            data.append(
                {
                    "chunk_id": result["chunk_id"],
                    "title": result["title"],
                    "chunk": result["chunk"],
                    "chunk_embedding": result["text_vector"]
                }
            )
        return data
    except Exception as e:
        print(e)
        return None

### **Generate Synthetic chunk-dependent questions**

In [3]:
import json
import os

import dotenv
from tqdm import tqdm

from utils.genai.generate_synthetic_data import GenerateSyntheticData, generate_synthetic_questions
from utils.genai.filter_synthetic_data import filter_synthetic_questions
from utils.genai.invoker import embed_text

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\povelf\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [5]:
def get_model_config():
    dotenv.load_dotenv(".env")
    return {
        "azure_endpoint": aoai_endpoint,
        "api_key": aoai_key,
        "azure_deployment": aoai_chat_model, #"gpt-4o-mini-sdg-llm"
        "api_version": aoai_api_version
    }

**Retrieve chunks from index**

In [6]:
index_name = "product-customer-vector"

chunks = get_all_chunks(
    index=index_name,
    k=30
)

**Generate synthetic samples**

In [11]:
# TODO: multi-chunk generation is not currently supported
multi_choice=False

# Get the model configuration
model_config = get_model_config()

# Instantiate your generator
sdg_generator = GenerateSyntheticData(model_config, multi=multi_choice)

# Generate synthetic chunk-dependent questions, each repeated N times with varied parameters
synthetic_data, failed_samples = generate_synthetic_questions(chunks, sdg_generator, num_questions_per_chunk=1, multi=multi_choice)

Generating synthetic questions: 100%|██████████| 30/30 [03:52<00:00,  7.76s/it]


In [13]:
synthetic_data[0]

{'synthetic_question': 'What was the total amount spent by David Kim on backpacks in the provided purchase records?',
 'explanation': 'The question is grounded as it directly asks for the total amount spent on backpacks, which can be calculated using the purchase records provided in the chunk. The response accurately calculates the total cost by summing the cost of Adventurer Pro Backpacks and SummitClimber Backpacks, aligning with the information in the chunk.',
 'synthetic_response': 'David Kim spent a total of $840 on backpacks. He purchased two Adventurer Pro Backpacks for $180 each, totaling $360, and two SummitClimber Backpacks for $240 each, totaling $480. The combined total for the backpacks is $840.',
 'chunk_id': '252167c94bb6_aHR0cHM6Ly9zdG9yYWdlcG92ZWwuYmxvYi5jb3JlLndpbmRvd3MubmV0L2NvbnRhaW5lcnBvdmVsL2N1c3RvbWVyXzUubWQ1_pages_0_synthetic_1',
 'is_grounded': True,
 'chunk_data': '## Customer_Info\n\nFirst Name: David \nLast Name: Kim \nAge: 42 \nEmail Address: davidkim@examp

**Filter out low quality samples**

In [14]:
accepted_samples, rejected_samples = filter_synthetic_questions(
    synthetic_data=synthetic_data,
    jaccard_threshold=0.8,
    embedding_threshold=0.8,
    combination_mode="or",
    min_question_length=5,
    max_question_length=150,
    use_embedding=False
)

print(f"\nNumber of filtered out synthetic questions: {len(rejected_samples)}")

Filtering synthetic questions: 100%|██████████| 30/30 [00:00<00:00, 4117.85it/s]


Number of filtered out synthetic questions: 0





**Adding embeddings**

In [15]:
# Count total questions
total_questions = len(accepted_samples)

# Create a single progress bar
with tqdm(total=total_questions, desc="Embedding all questions") as pbar:
    for item in accepted_samples:
        item["synthetic_embedding"] = embed_text(item["synthetic_question"])
        pbar.update(1)

Embedding all questions: 100%|██████████| 30/30 [00:29<00:00,  1.02it/s]


**Storing synthetic data**

In [None]:
def save_data(json_list, file_path):
    # Create the directory if it doesn't exist
    os.makedirs(os.path.dirname(file_path), exist_ok=True)
    
    # Open the file in append mode so existing data is preserved
    with open(file_path, "w") as f:
        for obj in json_list:
            f.write(json.dumps(obj) + "\n")

In [None]:
save_data(accepted_samples, "data/ft-judge/single/chunk-specific-synthetic-questions.jsonl")