# Generate Dataset from Azure Search Index using Simulator

## Overview

This notebook demonstrates how to generate a synthetic dataset of queries and responses using your Azure Search index with the Simulator tool. The generated dataset can be useful for:

- Testing and evaluating RAG workflows
- Fine-tuning prompts
- Benchmarking search capabilities
- Creating synthetic training data

**Prerequisites:**

- Azure OpenAI Service access
- Azure AI Search service with an indexed dataset

## 1. Setup

### Configure environment variables

The following environment variables must be set before proceeding:

In [1]:
import os

from dotenv import load_dotenv
load_dotenv()

# Check for required environment variables
assert os.environ.get("AZURE_OPENAI_API_KEY") is not None, "Please set the AZURE_OPENAI_API_KEY environment variable"
assert os.environ.get("AZURE_OPENAI_ENDPOINT") is not None, "Please set the AZURE_OPENAI_ENDPOINT environment variable"
assert os.environ.get("AZURE_OPENAI_API_VERSION") is not None, "Please set the AZURE_OPENAI_API_VERSION environment variable"
assert os.environ.get("AZURE_OPENAI_DEPLOYMENT") is not None, "Please set the AZURE_OPENAI_DEPLOYMENT environment variable"
assert os.environ.get("AZURE_SEARCH_ENDPOINT") is not None, "Please set the AZURE_SEARCH_ENDPOINT environment variable"
assert os.environ.get("AZURE_SEARCH_API_KEY") is not None, "Please set the AZURE_SEARCH_API_KEY environment variable"
assert os.environ.get("AZURE_SEARCH_INDEX_NAME") is not None, "Please set the AZURE_SEARCH_INDEX_NAME environment variable"

In [2]:
# Set up search variables for later use
search_endpoint = os.environ.get("AZURE_SEARCH_ENDPOINT")
index_name = os.environ.get("AZURE_SEARCH_INDEX_NAME")
api_key = os.environ.get("AZURE_SEARCH_API_KEY")

## 2. Initialize the Simulator

### 2.1 Configure Azure OpenAI model

In [4]:
from azure.ai.evaluation import AzureOpenAIModelConfiguration

# Configure the model for the simulator
model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=os.environ.get("AZURE_OPENAI_ENDPOINT"),
    azure_deployment=os.environ.get("AZURE_OPENAI_DEPLOYMENT"),
    api_key=os.environ.get("AZURE_OPENAI_API_KEY"),
    api_version=os.environ.get("AZURE_OPENAI_API_VERSION"),
)
print(model_config)

{'azure_endpoint': 'https://aoai-51324400.openai.azure.com/', 'azure_deployment': 'gpt-4o-mini', 'api_key': '55b32d2e39584a7f9a17fa750261ffb7', 'api_version': '2025-01-01-preview'}


### 2.2 Create simulator instance

In [5]:
from azure.ai.evaluation.simulator import Simulator

simulator = Simulator(model_config=model_config)

Class Simulator: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.


## 3. Connect to Search Index

### 3.1 Create a function to retrieve data from the search index

In [6]:
import json
import requests

def generate_text_from_index(search_term: str) -> str:
    # Create the search request
    url = f"{search_endpoint}/indexes/{index_name}/docs/search?api-version=2023-11-01"
    headers = {
        "Content-Type": "application/json",
        "api-key": api_key
    }
    search_query = {"search": search_term, "top": 10}
    
    # Send the request
    response = requests.post(url=url, headers=headers, json=search_query)
        
    # Check for errors
    # Extract text from response
    text = ""
    if response.status_code == 200:
        results = response.json()
        for result in results["value"]:
            # Change this field based on your index schema
            if "content" in result:
                text += result["content"] + " "
    
    # Limit text length to prevent token limit issues
    return text[:500]

### 3.2 Test the search functionality

In [7]:
# Choose a search term relevant to your data
search_term = "Hiking Boots"
text = generate_text_from_index(search_term)
print(f"Generated text length: {len(text)} characters")
print("\nSample of retrieved text:")
print(text[:300] + "...")

Generated text length: 500 characters

Sample of retrieved text:
Introducing the TrekReady Hiking Boots - stepping up your hiking game, one footprint at a time! Crafted from leather, these stylistic Trailmates are made to last. TrekReady infuses durability with its reinforced stitching and toe protection, making sure your journey is never stopped short. Comfort? ...


## 4. Create Application Callback

Define the callback functions that the simulator will use to interact with your index.

In [10]:
from typing import List, Dict, Any, Optional
from openai import AzureOpenAI

async def callback(
    messages: Dict,
    stream: bool = False,
    session_state: Any = None,
    context: Optional[Dict[str, Any]] = None,
) -> dict:
    # Get the latest message
    messages_list = messages["messages"]
    latest_message = messages_list[-1]
    query = latest_message["content"]

    deployment = os.environ.get("AZURE_OPENAI_DEPLOYMENT")
    endpoint = os.environ.get("AZURE_OPENAI_ENDPOINT")
    
    # Initialize Azure OpenAI client
    client = AzureOpenAI(
        azure_endpoint=endpoint,
        api_version=os.environ.get("AZURE_OPENAI_API_VERSION"),
        api_key=os.environ.get("AZURE_OPENAI_API_KEY"),
    )

    # Generate text from the index
    context = generate_text_from_index(query)
    
    # Call the OpenAI API
    completion = client.chat.completions.create(
        model=deployment,
        messages=[
            {
                "role": "user",
                "content": context,
            },
            {
                "role": "user",
                "content": query,
            }
        ],
        max_tokens=800,
        temperature=0.7,
    )
    
    # Extract and return the response
    response = completion.choices[0].message.content
    
    # Format the response
    formatted_response = {
        "content": response,
        "role": "assistant",
        "context": context,
    }
    
    # Add the response to messages
    messages["messages"].append(formatted_response)
    return {"messages": messages["messages"], "stream": stream, "session_state": session_state, "context": context}

## 5. Generate Dataset

### 5.1 Define tasks and run the simulator
- Simulator uses text retreived from index to generate queries
- Simulator uses these queries to get response by calling `target` callback

In [11]:
from pathlib import Path

# Run the simulator
outputs = await simulator(
    target=callback,
    text=text,
    num_queries=4,         # Number of query-response pairs to generate
    max_conversation_turns=1,  # Number of conversation turns
)

Generating: 100%|████████████████████████████████████████████████| 4/4 [00:07<00:00,  1.91s/message]


### 5.2 Save the generated dataset

In [12]:
# Save the outputs to a file
output_file = Path("05-simulate-datasets.results.json")
with output_file.open("a") as f:
    for output in outputs:
        f.write(output.to_eval_qr_json_lines())
    
print(f"Dataset saved to {output_file.absolute()}")

Dataset saved to /workspaces/BUILD25-LAB334/labs/05-simulate-datasets.results.json


### 5.2 Review the generated dataset

In [13]:
import pandas as pd

pd.read_json(output_file, lines=True).head(5)

Unnamed: 0,query,response,context
0,What are the TrekReady Hiking Boots made from?,The TrekReady Hiking Boots are crafted from le...,Introducing the TrekReady Hiking Boots - stepp...
1,What feature ensures the TrekReady Hiking Boot...,The durability of the TrekReady Hiking Boots i...,Introducing the TrekReady Hiking Boots - stepp...
2,What type of materials do the TrekReady Hiking...,The TrekReady Hiking Boots use breathable mate...,Introducing the TrekReady Hiking Boots - stepp...
3,What design aspect contributes to the lightwei...,The lightweight nature of the TrekReady Hiking...,Introducing the TrekReady Hiking Boots - stepp...


## 6. Next Steps

Now that you have generated a dataset, you can:

1. Use it to evaluate retrieval quality
2. Fine-tune prompts based on common query patterns
3. Create test cases for your application
4. Analyze the dataset to identify improvement opportunities

To customize this notebook for your needs:

- Modify the `search_term` to target specific content in your index
- Update the `tasks` list to reflect your domain-specific use cases
- Adjust the field names in `generate_text_from_index()` to match your index schema