# **Generate Synthetic Data**

In this notebook, we focus on **generating synthetic data** for training and evaluation purposes. This involves retrieving chunks of data, generating synthetic questions based on these chunks, and filtering and embedding the generated questions.

### Key Steps:
1. **Retrieve Data Chunks:** Retrieve chunks of data from the search index using Azure Cognitive Search.
2. **Generate Synthetic Questions:** Use a synthetic data generator to create questions based on the retrieved chunks.
3. **Embed Questions:** Embed the filtered questions using a text embedding model.
4. **Filter Questions:** Filter the generated questions using criteria such as Jaccard similarity and embedding similarity.
5. **Store Data:** Save the high-quality synthetic questions to a JSON Lines file for future use.

This notebook ensures that high-quality synthetic data is generated, filtered, and stored effectively, providing valuable data for training and evaluation.

In [1]:
# from azure.ai.projects import AIProjectClient
# from azure.identity import DefaultAzureCredential
from azure.core.credentials import AzureKeyCredential
import os

from dotenv import load_dotenv
load_dotenv(".env", override=True)

# # FOUNDRY
# project_client = AIProjectClient(
#     credential=DefaultAzureCredential(),
#     endpoint=os.getenv("PROJECT_ENDPOINT")
# )
# foundry_key = os.getenv("FOUNDRY_KEY")

# # Connetions:
# search_connection = project_client.connections.get("searchpovel")

# AOAI
chatModel = os.getenv("chatModel")
chatModelMini = os.getenv("chatModelMini")
chatModelMiniFast = os.getenv("chatModelMiniFast")
aoai_embedding_model = os.getenv("embeddingModel")

aoai_version = os.getenv("AOAI_API_VERSION")
aoai_endpoint = os.getenv("AOAI_ENDPOINT")
aoai_key = os.getenv("AOAI_KEY")

# SEARCH
search_key = os.getenv("SEARCH_KEY")
search_credential = AzureKeyCredential(search_key)
search_endpoint = os.getenv("SEARCH_ENDPOINT") #search_connection.target
index_name = os.getenv("SEARCH_INDEX_NAME")


## **1 Retrieve Data Chunks**

In [2]:
from utils.search import get_all_chunks

# Retrieve all chunks from the search index
chunks = get_all_chunks(
    index=index_name,
    credential=search_credential,
    search_endpoint=search_endpoint,
    k=200  # Adjust k as needed
)


## **2 Generate synthetic samples**

**Generation Strategy:**
- For each chunk, find k=5 most similar chunks using cosine similarity (k-NN)
- Generate questions focused on the main chunk's topic
- A question can either be grounded, meaning that the correct answer can be found in the provided chunk(s), or not grounded, meaning that it cannot be answered by the provided chunk(s)

### **Generation Settings**

The synthetic questions are generated with various characteristics to ensure diverse, realistic test data. 

**Configuration Files:** All settings are stored in `configs/settings/` and can be customized:

- **`domains.jsonl`** - Question domains (e.g., "Related to Product", "Related to Customer")
- **`difficulties.jsonl`** - Complexity levels (Beginner, Intermediate, Advanced, Expert)
- **`tones.jsonl`** - Question styles (Neutral, Friendly, Angry)
- **`languages.jsonl`** - Language options (English, Swedish)
- **`length_categories.jsonl`** - Question lengths (Short 5-7 words, Medium 8-15 words, Long 16-25 words, Very Long 25+ words)
- **`topics.jsonl`** - Topic selection strategy

Each option has a **weight** that controls how often it's selected during generation. Higher weights = more frequently used.

**Run the cell below to see the current configuration:**

In [4]:
from utils.helper import load_and_display_settings
load_and_display_settings()

 SYNTHETIC QUESTION GENERATION SETTINGS

📋 Domains:
----------------------------------------------------------------------
  • Related to Product             Weight: 1.00  ████████████████████
  • Related to Customer            Weight: 1.00  ████████████████████

📋 Difficulties:
----------------------------------------------------------------------
  • Beginner                       Weight: 0.25  █████
  • Intermediate                   Weight: 0.30  ██████
  • Advanced                       Weight: 0.25  █████
  • Expert                         Weight: 0.20  ████

📋 Tones:
----------------------------------------------------------------------
  • Neutral                        Weight: 1.00  ████████████████████
  • Friendly                       Weight: 1.00  ████████████████████
  • Angry                          Weight: 0.50  ██████████

📋 Languages:
----------------------------------------------------------------------
  • English                        Weight: 1.00  ██████████████

**Generate data**

In [None]:
from utils.generate_data import SyntheticDataGenerator, generate_synthetic_questions

# Instantiate your generator
sdg_generator = SyntheticDataGenerator(chatModel)

# Generate synthetic questions using k-NN for similar chunks
synthetic_data, failed_samples = generate_synthetic_questions(
    chunks,
    sdg_generator
)

In [None]:
synthetic_data[0]

In [None]:
print(f"number of successful samples: {len(synthetic_data)}")
print(f"number of failed samples: {len(failed_samples)}")

## **3 Embed synthetic question**

In [None]:
from utils.llm import embed_text

def embed_single_item(item):
    """Helper function to embed a single item"""
    item["synthetic_question_embedding"] = embed_text(item["synthetic_question"])
    return item

In [None]:
from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm import tqdm

# Count total questions
total_questions = len(synthetic_data)

# Parallel embedding with progress bar
with ThreadPoolExecutor(max_workers=10) as executor:
    futures = []
    for item in synthetic_data:
        future = executor.submit(embed_single_item, item)
        futures.append(future)
    
    with tqdm(total=total_questions, desc="Embedding all questions") as pbar:
        for future in as_completed(futures):
            future.result()  # This updates the item in-place
            pbar.update(1)

## **4 Filter out low quality samples**
- Remove duplicates, too short, and too long questions

In [None]:
from utils.clean_data import filter_synthetic_questions

accepted_samples, rejected_samples = filter_synthetic_questions(
    synthetic_data=synthetic_data,
    min_question_length=5,
    max_question_length=150,
    similarity_threshold=0.95, # Adjust as needed
    remove_duplicates=True # Set to False to skip duplicate detection
)

print(f"\n✅ Processing complete!")
print(f"Accepted samples: {len(accepted_samples)}")
print(f"Rejected samples: {len(rejected_samples)}")

## **5 Storing synthetic data**

In [None]:
import json
import pandas as pd

def save_data(json_list, file_path):
    # Create the directory if it doesn't exist
    os.makedirs(os.path.dirname(file_path), exist_ok=True)

    # Serialize list fields to JSON strings for CSV storage
    for item in json_list:
        if 'similar_chunks' in item and isinstance(item['similar_chunks'], list):
            item['similar_chunks'] = json.dumps(item['similar_chunks'])
    
    df = pd.json_normalize(json_list)
    df.to_csv(file_path, index=False)

In [None]:
# save_data(accepted_samples, 'data/synthetic_samples.csv')