# Generate Dataset from Azure Search Index using Simulator

---

## Overview

This notebook demonstrates how to generate a synthetic dataset of queries and responses using your Azure Search index with the Simulator tool. The generated dataset can be useful for:

- Testing and evaluating RAG workflows
- Fine-tuning prompts
- Benchmarking search capabilities
- Creating synthetic training data


## Pre-Requisites

1. An Azure OpenAI model deployment (chat completion)
1. An Azure AI Search index ("contoso-products")

---

## 1. Setup Environment

In [11]:
# ------- 1. Check that required environment variables are defined
import os

from dotenv import load_dotenv
load_dotenv()

assert os.environ.get("AZURE_OPENAI_API_KEY") is not None, "Please set the AZURE_OPENAI_API_KEY environment variable"
assert os.environ.get("AZURE_OPENAI_ENDPOINT") is not None, "Please set the AZURE_OPENAI_ENDPOINT environment variable"
assert os.environ.get("AZURE_OPENAI_API_VERSION") is not None, "Please set the AZURE_OPENAI_API_VERSION environment variable"
assert os.environ.get("AZURE_OPENAI_DEPLOYMENT") is not None, "Please set the AZURE_OPENAI_DEPLOYMENT environment variable"
assert os.environ.get("AZURE_SEARCH_ENDPOINT") is not None, "Please set the AZURE_SEARCH_ENDPOINT environment variable"
assert os.environ.get("AZURE_SEARCH_API_KEY") is not None, "Please set the AZURE_SEARCH_API_KEY environment variable"
assert os.environ.get("AZURE_SEARCH_INDEX_NAME") is not None, "Please set the AZURE_SEARCH_INDEX_NAME environment variable"

In [12]:
# ------- 2. Initialize the required variables to work with the Azure AI Search service
search_endpoint = os.environ.get("AZURE_SEARCH_ENDPOINT")
index_name = os.environ.get("AZURE_SEARCH_INDEX_NAME")
api_key = os.environ.get("AZURE_SEARCH_API_KEY")

---

## 2. Initialize the Simulator

### 2.1 Create a Model Configuration

In [13]:
from azure.ai.evaluation import AzureOpenAIModelConfiguration

model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=os.environ.get("AZURE_OPENAI_ENDPOINT"),
    azure_deployment=os.environ.get("AZURE_OPENAI_DEPLOYMENT"),
    api_key=os.environ.get("AZURE_OPENAI_API_KEY"),
    api_version=os.environ.get("AZURE_OPENAI_API_VERSION"),
)
print(model_config)

{'azure_endpoint': 'https://aoai-51373678.openai.azure.com/', 'azure_deployment': 'gpt-4o-mini', 'api_key': 'eaaadc31147242c19e433416ed2df037', 'api_version': '2025-01-01-preview'}


### 2.2 Instantiate Simulator with the model

In [14]:
from azure.ai.evaluation.simulator import Simulator

simulator = Simulator(model_config=model_config)

---

## 3. Connect to the Search Index

### 3.1 Define function to retrieve search results for query

In [15]:
import requests

def generate_text_from_index(search_term: str) -> str:

    # Create the search request
    url = f"{search_endpoint}/indexes/{index_name}/docs/search?api-version=2023-11-01"
    headers = {
        "Content-Type": "application/json",
        "api-key": api_key
    }
    search_query = {"search": search_term, "top": 10}
    
    # Send the request
    response = requests.post(url=url, headers=headers, json=search_query)
        
    # Check for errors
    # Extract text from response
    text = ""
    if response.status_code == 200:
        results = response.json()
        for result in results["value"]:
            # Change this field based on your index schema
            if "content" in result:
                text += result["content"] + " "
    
    # Limit text length to prevent token limit issues
    return text[:500]

### 3.2 Test the function works with a query

In [16]:
# Choose a search term relevant to your data
search_term = "Dining Table"
text = generate_text_from_index(search_term)
print(f"Generated text length: {len(text)} characters")
print("\nSample of retrieved text:")
print(text[:300] + "...")

Generated text length: 500 characters

Sample of retrieved text:
CampBuddy's BaseCamp Folding Table is an adventurer's best friend. Lightweight yet powerful, the table is a testament to fun-meets-function and will elevate any outing to new heights. Crafted from resilient, rust-resistant aluminum, the table boasts a generously sized 48 x 24 inches tabletop, perfec...


---

## 4. Create Application Callback

Define the callback functions that the simulator will use to interact with your index.

In [17]:
from typing import List, Dict, Any, Optional
from openai import AzureOpenAI

async def callback(
    messages: Dict,
    stream: bool = False,
    session_state: Any = None,
    context: Optional[Dict[str, Any]] = None,
) -> dict:
    # Get the latest message
    messages_list = messages["messages"]
    latest_message = messages_list[-1]
    query = latest_message["content"]

    deployment = os.environ.get("AZURE_OPENAI_DEPLOYMENT")
    endpoint = os.environ.get("AZURE_OPENAI_ENDPOINT")
    
    # Initialize Azure OpenAI client
    client = AzureOpenAI(
        azure_endpoint=endpoint,
        api_version=os.environ.get("AZURE_OPENAI_API_VERSION"),
        api_key=os.environ.get("AZURE_OPENAI_API_KEY"),
    )

    # Generate text from the index
    context = generate_text_from_index(query)
    
    # Call the OpenAI API
    completion = client.chat.completions.create(
        model=deployment,
        messages=[
            {
                "role": "user",
                "content": context,
            },
            {
                "role": "user",
                "content": query,
            }
        ],
        max_tokens=800,
        temperature=0.7,
    )
    
    # Extract and return the response
    response = completion.choices[0].message.content
    
    # Format the response
    formatted_response = {
        "content": response,
        "role": "assistant",
        "context": context,
    }
    
    # Add the response to messages
    messages["messages"].append(formatted_response)
    return {"messages": messages["messages"], "stream": stream, "session_state": session_state, "context": context}

---

## 5. Generate & Save Dataset

### 5.1 Define tasks and run the simulator
- Simulator uses text retrieved from index to generate queries
- Simulator uses these queries to get response by calling `target` callback

In [18]:
from pathlib import Path

# Run the simulator
outputs = await simulator(
    target=callback,
    text=text,
    num_queries=4,         # Number of query-response pairs to generate
    max_conversation_turns=1,  # Number of conversation turns
)

Generating: 100%|████████████████████████████████████████████████| 4/4 [00:07<00:00,  1.90s/message]


### 5.2 Save the generated dataset

In [19]:
# Save the outputs to a file
output_file = Path("00-simulate-datasets.results.jsonl")
with output_file.open("a") as f:
    for output in outputs:
        f.write(output.to_eval_qr_json_lines())
    
print(f"Dataset saved to {output_file.absolute()}")

Dataset saved to /workspaces/BUILD25-LAB334/labs/00-simulate-datasets.results.jsonl


### 5.2 Review the generated dataset

In [20]:
import pandas as pd

pd.read_json(output_file, lines=True).head(5)

Unnamed: 0,query,response,context
0,What is the size of CampBuddy's BaseCamp Foldi...,The size of CampBuddy's BaseCamp Folding Table...,CampBuddy's BaseCamp Folding Table is an adven...
1,What material is CampBuddy's BaseCamp Folding ...,CampBuddy's BaseCamp Folding Table is made fro...,CampBuddy's BaseCamp Folding Table is an adven...
2,What feature of the CampBuddy's BaseCamp Foldi...,The adjustable legs of the CampBuddy's BaseCam...,CampBuddy's BaseCamp Folding Table is an adven...
3,What is the primary purpose of CampBuddy's Bas...,The primary purpose of CampBuddy's BaseCamp Fo...,CampBuddy's BaseCamp Folding Table is an adven...


## 5.3 Review the saved dataset file

- Open the `00-simulate-datasets.results.jsonl` file in your Visual Studio Code editor
- You should see a list of generated {query-response-context} lines 

---

## 6. Next Steps

Now that you have generated a dataset, you can:

1. Use it to evaluate retrieval quality
2. Fine-tune prompts based on common query patterns
3. Create test cases for your application
4. Analyze the dataset to identify improvement opportunities

---

## 7. Homework: Try It Out

To customize this notebook for your needs:

- Recreate the Azure AI Search index with your data and index name
- Modify the `search_term` to target specific content in your index
- Update the `tasks` list to reflect your domain-specific use cases
- Adjust the field names in `generate_text_from_index()` to match your index schema