Synthetic Dataset Generation for RAG Systems

This notebook demonstrates how to create synthetic datasets for Retrieval-Augmented Generation (RAG) systems.
The process involves:
1. Document loading and chunking
2. Embedding generation and similarity-based context selection
3. Query generation from contexts
4. Query evolution through multiple transformation steps
5. Expected output generation
6. Final synthetic dataset creation using DeepEval




## 1. Import Packages and Initialize Models

Setting up all required dependencies and initializing the core models:
- CustomLLModel: Wrapper for language model operations
- CustomEmbeddingModel: Handles text embeddings
- DeepEvalSynthesizer: Advanced synthetic data generation



In [1]:
import logging
import sys
from datetime import datetime
from pydantic import BaseModel
from typing import Optional, List
import random
import numpy as np
from langchain.document_loaders import TextLoader
from langchain.text_splitter import TokenTextSplitter
from custom_models import CustomLLModel, CustomEmbeddingModel
from synthetic_dataset_generate import DeepEvalSynthesizer
from utils.llm_con import get_chat_openai

# Configure logging for better tracking (Windows-compatible)
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('synthetic_dataset_generation.log', encoding='utf-8')
    ]
)
logger = logging.getLogger(__name__)

# Set console handler encoding to handle Unicode properly
for handler in logging.getLogger().handlers:
    if isinstance(handler, logging.StreamHandler):
        handler.setStream(sys.stdout)
logger = logging.getLogger(__name__)

print("="*60)
print("SYNTHETIC DATASET GENERATION PIPELINE")
print("="*60)
print(f"Started at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print()

# Initialize models
logger.info("Initializing language models and embedding models...")
try:
    llm = get_chat_openai()
    deep_e = DeepEvalSynthesizer(CustomLLModel(llm), CustomEmbeddingModel())
    embeddings = CustomEmbeddingModel()
    logger.info("Models initialized successfully")
    print("✅ All models loaded and ready")
except Exception as e:
    logger.error(f"Failed to initialize models: {str(e)}")
    raise


SYNTHETIC DATASET GENERATION PIPELINE
Started at: 2025-08-03 10:12:28

2025-08-03 10:12:28,694 - INFO - Initializing language models and embedding models...
2025-08-03 10:12:29,012 - INFO - Models initialized successfully
✅ All models loaded and ready


## 2. Document Loading and Text Chunking

Loading documents and splitting them into manageable chunks:
- TokenTextSplitter ensures consistent chunk sizes based on token count
- chunk_size=1000: Each chunk contains roughly 1000 tokens
- chunk_overlap=0: No overlap between chunks to avoid redundancy

Note: Currently set up for both .docx and .txt files, but only .txt is being used
"""


In [2]:
logger.info("Starting document loading and chunking process...")

# Document paths
docx_path = '../../Datasets/docx_example.docx'  # Available but not used in current flow
txt_path = "../../Datasets/txt_example.txt"

# Configure text splitter
chunk_size = 1000
chunk_overlap = 0
text_splitter = TokenTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)

logger.info(f"Text splitter configured - chunk_size: {chunk_size}, overlap: {chunk_overlap}")

# Load and split document
try:
    loader = TextLoader(txt_path)
    raw_chunks = loader.load_and_split(text_splitter)
    
    logger.info(f"Document loaded successfully from: {txt_path}")
    logger.info(f"Generated {len(raw_chunks)} chunks")
    
    print(f"Document Processing Summary:")
    print(f"- Source file: {txt_path}")
    print(f"- Total chunks created: {len(raw_chunks)}")
    print(f"- Chunk size: {chunk_size} tokens")
    print(f"- Chunk overlap: {chunk_overlap} tokens")
    
except Exception as e:
    logger.error(f"Failed to load document: {str(e)}")
    raise


2025-08-03 10:12:38,283 - INFO - Starting document loading and chunking process...
2025-08-03 10:12:38,711 - INFO - Text splitter configured - chunk_size: 1000, overlap: 0
2025-08-03 10:12:38,714 - INFO - Document loaded successfully from: ../../Datasets/txt_example.txt
2025-08-03 10:12:38,714 - INFO - Generated 1 chunks
Document Processing Summary:
- Source file: ../../Datasets/txt_example.txt
- Total chunks created: 1
- Chunk size: 1000 tokens
- Chunk overlap: 0 tokens


In [3]:
# Display chunk information for verification
print("\n" + "="*50)
print("CHUNK PREVIEW")
print("="*50)

for i, chunk in enumerate(raw_chunks[:3]):  # Show first 3 chunks
    print(f"\nChunk {i+1}:")
    print(f"Content length: {len(chunk.page_content)} characters")
    print(f"Preview: {chunk.page_content[:200]}...")
    if i < len(raw_chunks) - 1:
        print("-" * 30)

if len(raw_chunks) > 3:
    print(f"\n... and {len(raw_chunks) - 3} more chunks")

raw_chunks



CHUNK PREVIEW

Chunk 1:
Content length: 1292 characters
Preview: Apple Turnovers

2 prepared 15 oz. pie crusts
3 cups thinly sliced apples with peel
1/2 cup brown sugar
1 tsp. cinnamon
2 tsp. fresh lemon juice
2 Tbsp. flour
2 Tbsp. sugar
1/2 tsp. salt
1 tsp. vanill...


[Document(metadata={'source': '../../Datasets/txt_example.txt'}, page_content='Apple Turnovers\n\n2 prepared 15 oz. pie crusts\n3 cups thinly sliced apples with peel\n1/2 cup brown sugar\n1 tsp. cinnamon\n2 tsp. fresh lemon juice\n2 Tbsp. flour\n2 Tbsp. sugar\n1/2 tsp. salt\n1 tsp. vanilla\n2 Tbsp. Butter\n\nLet pie crust stand at room temperature while preparing the other\ningredients. Combine apples, brown sugar, cinnamon and lemon \njuice in pan. Add 2 Tbsp. water to allow easy mixing.  Cook\nover medium heat until mixture bubbles.  Cover and continue cooking\nover low heat for 10 minutes stirring occasionally.\nGradually add flour, sugar and salt to mixture and cook until the \nmixture begins to thicken.  Add in vanilla and butter and remove \nmixture from heat.  Spread out pie crusts on ungreased cookie sheet.\nSpread apple mixture evenly on half of each crust.  Fold over\nother side of crust and press edges with a little warm water to\nseal.  Cut small slits in top of crust and b

## 3. Embedding Generation and Context Selection

Creating embeddings for all chunks and selecting relevant contexts:

Process:
1. Extract text content from all chunks
2. Generate embeddings for each chunk
3. Randomly select a reference chunk
4. Find similar chunks using cosine similarity
5. Combine reference and similar chunks as context

Similarity threshold: 0.7 (configurable)
""


In [4]:
logger.info("Starting embedding generation and context selection...")

# Extract content from chunks
content = [rc.page_content for rc in raw_chunks]
logger.info(f"Extracted content from {len(content)} chunks")

# Generate embeddings
print("🔄 Generating embeddings for all chunks...")
try:
    embeddings_vectors = embeddings.embed_texts(content)
    logger.info(f"Generated embeddings for {len(embeddings_vectors)} text chunks")
    print(f"✅ Embeddings generated successfully")
    print(f"- Embedding dimensions: {len(embeddings_vectors[0]) if embeddings_vectors else 'N/A'}")
except Exception as e:
    logger.error(f"Failed to generate embeddings: {str(e)}")
    raise


2025-08-03 10:13:01,869 - INFO - Starting embedding generation and context selection...
2025-08-03 10:13:01,870 - INFO - Extracted content from 1 chunks
🔄 Generating embeddings for all chunks...


  self.model = HuggingFaceEmbeddings(


2025-08-03 10:13:10,192 - INFO - PyTorch version 2.6.0+cu118 available.
2025-08-03 10:13:10,671 - INFO - Load pretrained SentenceTransformer: PartAI/Tooka-SBERT-V2-Large
2025-08-03 10:13:18,275 - INFO - Generated embeddings for 1 text chunks
✅ Embeddings generated successfully
- Embedding dimensions: 1024


In [5]:
# Context selection using similarity
print("\n🔄 Selecting relevant contexts using similarity matching...")

# Randomly select reference chunk
reference_index = random.randint(0, len(embeddings_vectors) - 1)
reference_embedding = embeddings_vectors[reference_index]
contexts = [content[reference_index]]

logger.info(f"Reference chunk selected: index {reference_index}")
print(f"📌 Reference chunk index: {reference_index}")

# Find similar chunks
similarity_threshold = 0.7
similar_indices = []

print(f"🔍 Finding chunks with similarity >= {similarity_threshold}...")

for i, embedding in enumerate(embeddings_vectors):
    # Calculate cosine similarity
    product = np.dot(reference_embedding, embedding)
    norm = np.linalg.norm(reference_embedding) * np.linalg.norm(embedding)
    
    if norm == 0:
        similarity = 0
    else:
        similarity = product / norm
    
    if similarity >= similarity_threshold:
        similar_indices.append((i, similarity))
        logger.debug(f"Similar chunk found: index {i}, similarity: {similarity:.3f}")

# Sort by similarity (highest first)
similar_indices.sort(key=lambda x: x[1], reverse=True)

# Add similar chunks to contexts
for i, similarity in similar_indices:
    if i != reference_index:  # Avoid duplicating reference chunk
        contexts.append(content[i])

logger.info(f"Context selection completed:")
logger.info(f"- Reference chunk: 1")
logger.info(f"- Similar chunks found: {len(similar_indices)}")
logger.info(f"- Total contexts: {len(contexts)}")

print(f"\n📊 Context Selection Results:")
print(f"- Reference chunk: index {reference_index}")
print(f"- Similar chunks found: {len(similar_indices)}")
print(f"- Total context pieces: {len(contexts)}")

if similar_indices:
    print(f"- Similarity scores: {[f'{sim:.3f}' for _, sim in similar_indices[:5]]}")



🔄 Selecting relevant contexts using similarity matching...
2025-08-03 10:13:24,285 - INFO - Reference chunk selected: index 0
📌 Reference chunk index: 0
🔍 Finding chunks with similarity >= 0.7...
2025-08-03 10:13:24,287 - INFO - Context selection completed:
2025-08-03 10:13:24,288 - INFO - - Reference chunk: 1
2025-08-03 10:13:24,289 - INFO - - Similar chunks found: 1
2025-08-03 10:13:24,290 - INFO - - Total contexts: 1

📊 Context Selection Results:
- Reference chunk: index 0
- Similar chunks found: 1
- Total context pieces: 1
- Similarity scores: ['1.000']


## 4. Initial Query Generation

Generate initial queries based on the selected contexts using the LLM.
The prompt asks the model to create questions or statements that can be
addressed by the given context.

This step creates the foundation queries that will be evolved in the next step.



In [6]:
logger.info("Starting initial query generation...")

print("\n🔄 Generating initial queries from contexts...")

# Original prompt - DO NOT MODIFY
prompt = f"""I want you act as a copywriter. Based on the given context,
which is list of strings, please generate a list of JSON objects
with a `input` key. The `input` can either be a question or a
statement that can be addressed by the given context.

contexts:
{contexts}"""

try:
    query_response = llm.invoke(prompt)
    logger.info("Initial query generated successfully")
    print("✅ Initial queries generated")
    
except Exception as e:
    logger.error(f"Failed to generate initial query: {str(e)}")
    raise


2025-08-03 10:13:41,932 - INFO - Starting initial query generation...

🔄 Generating initial queries from contexts...
2025-08-03 10:13:47,766 - INFO - HTTP Request: POST http://localhost:11456/v1/chat/completions "HTTP/1.1 200 OK"
2025-08-03 10:13:47,785 - INFO - Initial query generated successfully
✅ Initial queries generated


In [7]:
print("\n" + "="*50)
print("GENERATED QUERIES")
print("="*50)
print(query_response.content)
print("="*50)



GENERATED QUERIES
[
  {
    "input": "What ingredients are needed to make Apple Turnovers?"
  },
  {
    "input": "How long should the apple mixture be cooked before adding the flour and sugar?"
  },
  {
    "input": "What temperature should the oven be preheated to when baking Apple Turnovers?"
  },
  {
    "input": "How long should Apple Turnovers be baked in the oven?"
  },
  {
    "input": "What should be done to seal the edges of the Apple Turnovers?"
  },
  {
    "input": "Can you make individual-sized Apple Turnovers using this recipe?"
  },
  {
    "input": "What can you serve Apple Turnovers with?"
  },
  {
    "input": "Why might someone consider this recipe 'skinny'?"
  },
  {
    "input": "What is the purpose of adding lemon juice to the apple mixture?"
  },
  {
    "input": "How do you prepare the pie crusts before adding the filling?"
  },
  {
    "input": "What is the role of butter in the final apple mixture?"
  },
  {
    "input": "Can you bake Apple Turnovers without

## 5. Query Evolution Process

Evolve the initial query through multiple transformation steps to create
more sophisticated and challenging questions.

Evolution Templates:
1. Multi-context: Requires information from all context elements
2. Reasoning: Requests multi-step logical reasoning
3. Hypothetical scenario: Incorporates speculative scenarios

Process:
- Start with example query
- Apply random evolution templates for specified steps
- Each step transforms the query to be more complex

Configuration:
- Example query: "what is recipe of Apple Turnovers"
- Evolution steps: 3
- Templates: Multi-context, Reasoning, Hypothetical scenario



In [8]:
logger.info("Starting query evolution process...")

# Evolution configuration
example_generated_query = "what is recipe of Apple Turnovers"
context = contexts
original_input = example_generated_query
num_evolution_steps = 3

logger.info(f"Evolution configuration:")
logger.info(f"- Original query: '{original_input}'")
logger.info(f"- Evolution steps: {num_evolution_steps}")
logger.info(f"- Context pieces: {len(context)}")

print(f"\n🔄 Query Evolution Process")
print(f"- Original query: '{original_input}'")
print(f"- Evolution steps: {num_evolution_steps}")
print(f"- Available templates: Multi-context, Reasoning, Hypothetical scenario")

# Evolution templates - DO NOT MODIFY THESE PROMPTS
multi_context_template = f"""
I want you to rewrite the given `input` so that it requires readers to use information from all elements in `Context`.

1. `Input` should require information from all `Context` elements.
2. `Rewritten Input` must be concise and fully answerable from `Context`.
3. Do not use phrases like 'based on the provided context.'
4. `Rewritten Input` should not exceed 15 words.

Context: {context}
Input: {original_input}
Rewritten Input:
"""

reasoning_template = f"""
I want you to rewrite the given `input` so that it explicitly requests multi-step reasoning.

1. `Rewritten Input` should require multiple logical connections or inferences.
2. `Rewritten Input` should be concise and understandable.
3. Do not use phrases like 'based on the provided context.'
4. `Rewritten Input` must be fully answerable from `Context`.
5. `Rewritten Input` should not exceed 15 words.

Context: {context}
Input: {original_input}
Rewritten Input:
"""

hypothetical_scenario_template = f"""
I want you to rewrite the given `input` to incorporate a hypothetical or speculative scenario.

1. `Rewritten Input` should encourage applying knowledge from `Context` to deduce outcomes.
2. `Rewritten Input` should be concise and understandable.
3. Do not use phrases like 'based on the provided context.'
4. `Rewritten Input` must be fully answerable from `Context`.
5. `Rewritten Input` should not exceed 15 words.

Context: {context}
Input: {original_input}
Rewritten Input:
"""

evolution_templates = [multi_context_template, reasoning_template, hypothetical_scenario_template]
template_names = ["Multi-context", "Reasoning", "Hypothetical scenario"]

logger.info(f"Loaded {len(evolution_templates)} evolution templates")

def evolve_query(original_input: str, context, steps: int) -> str:
    """
    Evolve a query through multiple transformation steps.
    
    Args:
        original_input: The initial query to evolve
        context: The context information
        steps: Number of evolution steps to perform
    
    Returns:
        The final evolved query
    """
    current_input = original_input
    
    print(f"\n📈 Evolution Steps:")
    print(f"Step 0 (Original): '{current_input}'")
    
    for step in range(steps):
        # Choose random template
        template_idx = random.randint(0, len(evolution_templates) - 1)
        chosen_template = evolution_templates[template_idx]
        template_name = template_names[template_idx]
        
        # Prepare prompt
        evolved_prompt = (
            chosen_template
            .replace("{context}", str(context))
            .replace("{original_input}", current_input)
        )
        
        logger.info(f"Evolution step {step + 1}: Using {template_name} template")
        print(f"\nStep {step + 1}: Applying {template_name} template")
        
        # Log the prompt being sent (truncated for readability)
        logger.debug(f"Prompt sent to LLM (step {step + 1}):\n{evolved_prompt[:200]}...")
        
        try:
            response = llm.invoke(evolved_prompt)
            current_input = response.content.strip()
            
            print(f"Result: '{current_input}'")
            logger.info(f"Evolution step {step + 1} completed successfully")
            
        except Exception as e:
            logger.error(f"Evolution step {step + 1} failed: {str(e)}")
            print(f"❌ Evolution step {step + 1} failed, keeping previous version")
            break

    return current_input

# Perform query evolution
print("\n" + "="*60)
print("QUERY EVOLUTION PROCESS")
print("="*60)

try:
    evolved_query = evolve_query(original_input, context, num_evolution_steps)
    logger.info("Query evolution completed successfully")
    logger.info(f"Final evolved query: '{evolved_query}'")
    
except Exception as e:
    logger.error(f"Query evolution failed: {str(e)}")
    evolved_query = original_input
    print(f"❌ Evolution failed, using original query: '{evolved_query}'")


2025-08-03 10:14:02,301 - INFO - Starting query evolution process...
2025-08-03 10:14:02,302 - INFO - Evolution configuration:
2025-08-03 10:14:02,303 - INFO - - Original query: 'what is recipe of Apple Turnovers'
2025-08-03 10:14:02,304 - INFO - - Evolution steps: 3
2025-08-03 10:14:02,305 - INFO - - Context pieces: 1

🔄 Query Evolution Process
- Original query: 'what is recipe of Apple Turnovers'
- Evolution steps: 3
- Available templates: Multi-context, Reasoning, Hypothetical scenario
2025-08-03 10:14:02,307 - INFO - Loaded 3 evolution templates

QUERY EVOLUTION PROCESS

📈 Evolution Steps:
Step 0 (Original): 'what is recipe of Apple Turnovers'
2025-08-03 10:14:02,309 - INFO - Evolution step 1: Using Hypothetical scenario template

Step 1: Applying Hypothetical scenario template
2025-08-03 10:14:04,631 - INFO - HTTP Request: POST http://localhost:11456/v1/chat/completions "HTTP/1.1 200 OK"
Result: 'How do you make apple turnovers using two pie crusts and spiced apples?'
2025-08-03 1

In [9]:
print(f"\n🎯 FINAL EVOLVED QUERY:")
print(f"'{evolved_query}'")
print("="*60)

evolved_query



🎯 FINAL EVOLVED QUERY:
'How do you make apple turnovers using two pie crusts and spiced apples?'


'How do you make apple turnovers using two pie crusts and spiced apples?'

## 6. Expected Output Generation

Generate the expected answer for the evolved query based on the provided context.
This creates the ground truth answer that can be used for evaluation.

The prompt ensures the answer is:
- Factually aligned with the provided context
- Comprehensive and accurate
- Suitable as a reference answer for evaluation



In [10]:
logger.info("Starting expected output generation...")

print("\n🔄 Generating expected output (ground truth answer)...")

# Expected output template - DO NOT MODIFY
expected_output_template = f"""
I want you to generate an answer for the given `input`. This answer has to be factually aligned to the provided context.

Context: {context}
Input: {evolved_query}
Answer:
"""

# Prepare the prompt
prompt = expected_output_template.replace("{context}", str(context)).replace("{evolved_query}", evolved_query)

logger.info(f"Generating expected output for query: '{evolved_query}'")

try:
    expected_output = llm.invoke(prompt)
    logger.info("Expected output generated successfully")
    print("✅ Expected output generated")
    
except Exception as e:
    logger.error(f"Failed to generate expected output: {str(e)}")
    raise


2025-08-03 10:14:35,214 - INFO - Starting expected output generation...

🔄 Generating expected output (ground truth answer)...
2025-08-03 10:14:35,215 - INFO - Generating expected output for query: 'How do you make apple turnovers using two pie crusts and spiced apples?'
2025-08-03 10:14:39,853 - INFO - HTTP Request: POST http://localhost:11456/v1/chat/completions "HTTP/1.1 200 OK"
2025-08-03 10:14:39,855 - INFO - Expected output generated successfully
✅ Expected output generated


In [11]:
print("\n" + "="*50)
print("EXPECTED OUTPUT (GROUND TRUTH)")
print("="*50)
print(expected_output.content)
print("="*50)



EXPECTED OUTPUT (GROUND TRUTH)
To make apple turnovers using two pie crusts and spiced apples, first let the pie crusts sit at room temperature. In a pan, combine 3 cups of thinly sliced apples (with peel), 1/2 cup brown sugar, 1 teaspoon cinnamon, and 2 teaspoons fresh lemon juice. Add 2 tablespoons of water to help with mixing, then cook over medium heat until the mixture bubbles. Reduce the heat to low, cover, and cook for 10 minutes, stirring occasionally. Gradually stir in 2 tablespoons flour, 2 tablespoons sugar, and 1/2 teaspoon salt until the filling thickens. Remove from heat and stir in 1 teaspoon vanilla and 2 tablespoons butter. On an ungreased cookie sheet, place the two 15 oz. prepared pie crusts. Spread the spiced apple mixture evenly over half of each crust. Fold the other half of the crust over the filling, pressing the edges together with a little warm water to seal. Cut small slits in the tops for ventilation. Bake at 375°F for 30 minutes, or until the crusts are go

## 7. Synthetic Data Structure Creation
Create a structured synthetic data object using Pydantic model.
This ensures data consistency and provides a clean interface for the synthetic dataset.

Structure includes:
- query: The evolved query/question
- expected_output: The ground truth answer
- context: List of context strings used



In [12]:
logger.info("Creating synthetic data structure...")

class SyntheticData(BaseModel):
    """
    Pydantic model for synthetic data entries.
    
    Attributes:
        query: The question or statement to be answered
        expected_output: The ground truth answer
        context: List of context strings that contain the information
    """
    query: str
    expected_output: Optional[str]
    context: List[str]
    
    class Config:
        """Pydantic configuration"""
        validate_assignment = True

def as_str(x):
    """
    Helper function to extract string content from various object types.
    
    Args:
        x: Object that might have a 'content' attribute or is already a string
        
    Returns:
        String representation of the object
    """
    return x.content if hasattr(x, "content") else x

# Create synthetic data entry
try:
    synthetic_data = SyntheticData(
        query=evolved_query,
        expected_output=as_str(expected_output),
        context=context,
    )
    
    logger.info("Synthetic data structure created successfully")
    print("✅ Synthetic data entry created")
    
    # Initialize dataset list
    synthetic_dataset = []
    synthetic_dataset.append(synthetic_data)
    
    logger.info(f"Synthetic dataset initialized with {len(synthetic_dataset)} entry")
    
except Exception as e:
    logger.error(f"Failed to create synthetic data structure: {str(e)}")
    raise


2025-08-03 10:14:57,018 - INFO - Creating synthetic data structure...
2025-08-03 10:14:57,021 - INFO - Synthetic data structure created successfully
✅ Synthetic data entry created
2025-08-03 10:14:57,022 - INFO - Synthetic dataset initialized with 1 entry


In [13]:
print("\n" + "="*50)
print("SYNTHETIC DATA SUMMARY")
print("="*50)
print(f"Query: {synthetic_data.query}")
print(f"Context pieces: {len(synthetic_data.context)}")
print(f"Expected output length: {len(synthetic_data.expected_output)} characters")
print("="*50)

# Display context for verification
print("\nContext used:")
for i, ctx in enumerate(synthetic_dataset[0].context):
    print(f"Context {i+1}: {ctx[:100]}...")
    if i >= 2:  # Limit display to first 3 contexts
        print(f"... and {len(synthetic_dataset[0].context) - 3} more context pieces")
        break

synthetic_dataset[0].context



SYNTHETIC DATA SUMMARY
Query: How do you make apple turnovers using two pie crusts and spiced apples?
Context pieces: 1
Expected output length: 1119 characters

Context used:
Context 1: Apple Turnovers

2 prepared 15 oz. pie crusts
3 cups thinly sliced apples with peel
1/2 cup brown su...


['Apple Turnovers\n\n2 prepared 15 oz. pie crusts\n3 cups thinly sliced apples with peel\n1/2 cup brown sugar\n1 tsp. cinnamon\n2 tsp. fresh lemon juice\n2 Tbsp. flour\n2 Tbsp. sugar\n1/2 tsp. salt\n1 tsp. vanilla\n2 Tbsp. Butter\n\nLet pie crust stand at room temperature while preparing the other\ningredients. Combine apples, brown sugar, cinnamon and lemon \njuice in pan. Add 2 Tbsp. water to allow easy mixing.  Cook\nover medium heat until mixture bubbles.  Cover and continue cooking\nover low heat for 10 minutes stirring occasionally.\nGradually add flour, sugar and salt to mixture and cook until the \nmixture begins to thicken.  Add in vanilla and butter and remove \nmixture from heat.  Spread out pie crusts on ungreased cookie sheet.\nSpread apple mixture evenly on half of each crust.  Fold over\nother side of crust and press edges with a little warm water to\nseal.  Cut small slits in top of crust and bake at 375 degrees\nfor 30 minutes until crust is golden brown.  Serve warm. 

## 8. Advanced Synthetic Dataset Generation with DeepEval
Use DeepEval's advanced synthetic data generation capabilities to create
additional high-quality synthetic datasets from documents.

DeepEval provides:
- Sophisticated query generation strategies
- Multiple evolution techniques
- Quality scoring and filtering
- Batch processing capabilities

This approach can generate larger volumes of synthetic data with consistent quality.



In [14]:
logger.info("Starting advanced synthetic dataset generation with DeepEval...")

print("\n🔄 Generating synthetic datasets using DeepEval...")
print("This process may take several minutes depending on document size...")

# Document path for DeepEval processing
deepeval_document_path = '../../Datasets/txt_example.txt'

logger.info(f"Processing document: {deepeval_document_path}")

try:
    # Generate golden datasets using DeepEval
    print("🔄 DeepEval processing in progress...")
    result = deep_e.generate_goldens_from_documents(document_paths=deepeval_document_path)
    
    logger.info("DeepEval synthetic dataset generation completed")
    print("✅ DeepEval dataset generation completed")
    
    # Get results as DataFrame for analysis
    df = deep_e.to_dataframe()
    
    logger.info(f"Generated {len(df)} synthetic data entries using DeepEval")
    print(f"📊 Generated {len(df)} entries in the DeepEval dataset")
    
except Exception as e:
    logger.error(f"DeepEval dataset generation failed: {str(e)}")
    print(f"❌ DeepEval generation failed: {str(e)}")
    df = None


2025-08-03 10:15:13,641 - INFO - Starting advanced synthetic dataset generation with DeepEval...

🔄 Generating synthetic datasets using DeepEval...
This process may take several minutes depending on document size...
2025-08-03 10:15:13,642 - INFO - Processing document: ../../Datasets/txt_example.txt


Output()

2025-08-03 10:15:15,451 - INFO - Load pretrained SentenceTransformer: PartAI/Tooka-SBERT-V2-Large
2025-08-03 10:15:22,021 - INFO - HTTP Request: POST http://localhost:11456/v1/chat/completions "HTTP/1.1 200 OK"
2025-08-03 10:15:22,031 - INFO - HTTP Request: POST http://localhost:11456/v1/chat/completions "HTTP/1.1 200 OK"
2025-08-03 10:15:22,033 - INFO - HTTP Request: POST http://localhost:11456/v1/chat/completions "HTTP/1.1 200 OK"


2025-08-03 10:15:24,614 - INFO - HTTP Request: POST http://localhost:11456/v1/chat/completions "HTTP/1.1 200 OK"
2025-08-03 10:15:24,634 - INFO - HTTP Request: POST http://localhost:11456/v1/chat/completions "HTTP/1.1 200 OK"
2025-08-03 10:15:24,708 - INFO - HTTP Request: POST http://localhost:11456/v1/chat/completions "HTTP/1.1 200 OK"
2025-08-03 10:15:25,433 - INFO - HTTP Request: POST http://localhost:11456/v1/chat/completions "HTTP/1.1 200 OK"
2025-08-03 10:15:25,435 - INFO - HTTP Request: POST http://localhost:11456/v1/chat/completions "HTTP/1.1 200 OK"
2025-08-03 10:15:25,453 - INFO - HTTP Request: POST http://localhost:11456/v1/chat/completions "HTTP/1.1 200 OK"
2025-08-03 10:15:26,356 - INFO - HTTP Request: POST http://localhost:11456/v1/chat/completions "HTTP/1.1 200 OK"
2025-08-03 10:15:26,357 - INFO - HTTP Request: POST http://localhost:11456/v1/chat/completions "HTTP/1.1 200 OK"
2025-08-03 10:15:27,175 - INFO - HTTP Request: POST http://localhost:11456/v1/chat/completions "

2025-08-03 10:15:31,835 - INFO - DeepEval synthetic dataset generation completed
✅ DeepEval dataset generation completed
2025-08-03 10:15:31,838 - INFO - Generated 24 synthetic data entries using DeepEval
📊 Generated 24 entries in the DeepEval dataset


In [15]:
# Display DeepEval results if successful
if df is not None and not df.empty:
    print("\n" + "="*60)
    print("DEEPEVAL SYNTHETIC DATASET SUMMARY")
    print("="*60)
    print(f"Total entries: {len(df)}")
    print(f"Columns: {list(df.columns)}")
    
    if len(df) > 0:
        print(f"\nFirst entry preview:")
        for col in df.columns:
            value = df.iloc[0][col]
            if isinstance(value, str) and len(value) > 100:
                print(f"{col}: {value[:100]}...")
            else:
                print(f"{col}: {value}")
    
    # Display DataFrame
    display(df)
else:
    print("No DeepEval results to display")

print("="*60)
print("PIPELINE COMPLETION SUMMARY")
print("="*60)
print(f"✅ Document processed: {len(raw_chunks)} chunks created")
print(f"✅ Embeddings generated: {len(embeddings_vectors)} vectors")
print(f"✅ Contexts selected: {len(contexts)} pieces")
print(f"✅ Query evolved: {num_evolution_steps} steps")
print(f"✅ Manual synthetic data: {len(synthetic_dataset)} entry")
if df is not None:
    print(f"✅ DeepEval synthetic data: {len(df)} entries")
print(f"Completed at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("="*60)

logger.info("Synthetic dataset generation pipeline completed successfully")



DEEPEVAL SYNTHETIC DATASET SUMMARY
Total entries: 24
Columns: ['input', 'actual_output', 'expected_output', 'context', 'retrieval_context', 'n_chunks_per_context', 'context_length', 'evolutions', 'context_quality', 'synthetic_input_quality', 'source_file']

First entry preview:
input: What ingredients and steps are required to prepare the thickened filling for apple turnovers using a...
actual_output: None
expected_output: To prepare the thickened filling for apple turnovers, combine 3 cups thinly sliced apples with peel,...
context: ['Apple Turnovers\n\n2 prepared 15 oz. pie crusts\n3 cups thinly sliced apples with peel\n1/2 cup brown sugar\n1 tsp. cinnamon\n2 tsp. fresh lemon juice\n2 Tbsp. flour\n2 Tbsp. sugar\n1/2 tsp. salt\n1 tsp. vanilla\n2 Tbsp. Butter\n\nLet pie crust stand at room temperature while preparing the other\ningredients. Combine apples, brown sugar, cinnamon and lemon \njuice', ' brown sugar, cinnamon and lemon \njuice in pan. Add 2 Tbsp. water to allow easy mixing

Unnamed: 0,input,actual_output,expected_output,context,retrieval_context,n_chunks_per_context,context_length,evolutions,context_quality,synthetic_input_quality,source_file
0,What ingredients and steps are required to pre...,,To prepare the thickened filling for apple tur...,[Apple Turnovers\n\n2 prepared 15 oz. pie crus...,,2,765,[Multi-context],,1.0,../../Datasets/txt_example.txt
1,How does cooking the apple filling before addi...,,Cooking the apple filling before adding it to ...,[Apple Turnovers\n\n2 prepared 15 oz. pie crus...,,2,765,[Comparative],,1.0,../../Datasets/txt_example.txt
2,What ingredients and steps are required to pre...,,To prepare the thickened filling for apple tur...,[Apple Turnovers\n\n2 prepared 15 oz. pie crus...,,2,765,[Multi-context],,1.0,../../Datasets/txt_example.txt
3,How does cooking the apple filling before addi...,,Cooking the apple filling before adding it to ...,[Apple Turnovers\n\n2 prepared 15 oz. pie crus...,,2,765,[Comparative],,1.0,../../Datasets/txt_example.txt
4,Bake individual apple turnovers in mini pie cr...,,Bake individual apple turnovers in mini pie cr...,[ pie crusts\ninto smaller pieces and make ind...,,2,658,[In-Breadth],,1.0,../../Datasets/txt_example.txt
5,"How should apple turnovers be baked, and what ...",,Bake apple turnovers at 375 degrees for 30 min...,[ pie crusts\ninto smaller pieces and make ind...,,2,658,[Constrained],,1.0,../../Datasets/txt_example.txt
6,What ingredients and steps are required to pre...,,To prepare the thickened filling for apple tur...,[Apple Turnovers\n\n2 prepared 15 oz. pie crus...,,2,765,[Multi-context],,1.0,../../Datasets/txt_example.txt
7,How does cooking the apple filling before addi...,,Cooking the apple filling before adding it to ...,[Apple Turnovers\n\n2 prepared 15 oz. pie crus...,,2,765,[Comparative],,1.0,../../Datasets/txt_example.txt
8,Bake individual apple turnovers in mini pie cr...,,Bake individual apple turnovers in mini pie cr...,[ pie crusts\ninto smaller pieces and make ind...,,2,658,[In-Breadth],,1.0,../../Datasets/txt_example.txt
9,"How should apple turnovers be baked, and what ...",,Bake apple turnovers at 375 degrees for 30 min...,[ pie crusts\ninto smaller pieces and make ind...,,2,658,[Constrained],,1.0,../../Datasets/txt_example.txt


PIPELINE COMPLETION SUMMARY
✅ Document processed: 1 chunks created
✅ Embeddings generated: 1 vectors
✅ Contexts selected: 1 pieces
✅ Query evolved: 3 steps
✅ Manual synthetic data: 1 entry
✅ DeepEval synthetic data: 24 entries
Completed at: 2025-08-03 10:15:38
2025-08-03 10:15:38,630 - INFO - Synthetic dataset generation pipeline completed successfully
