# ðŸ”¬ Evaluation Dataset Generation

> **Important Note**: This notebook has been sanitized for open-source publication. Original evaluation questions containing company-specific data have been removed.

## ðŸ“š Documentation Available

Instead of embedded code, please refer to:

1. **[EVALUATION_METHODOLOGY.md](./EVALUATION_METHODOLOGY.md)** - Complete evaluation framework and methodology
2. **[ORIGINAL_DATASET_ARCHIVED.md](./ORIGINAL_DATASET_ARCHIVED.md)** - Information about the original dataset

## ðŸŽ¯ What This Notebook Originally Did

This notebook was used to systematically create evaluation questions that tested:
- Natural language understanding
- Data analysis accuracy  
- Query complexity handling
- System limitation recognition

The methodology and learnings are preserved in the documentation above, while sensitive company data has been removed.

In [None]:
# Evaluation Dataset Generation Notebook
# ==========================================
# This notebook was used to systematically generate evaluation questions
# for testing the SmartSCM Assistant's query understanding and data analysis capabilities.
# 
# For detailed methodology, see: EVALUATION_METHODOLOGY.md
#
# Note: This notebook has been sanitized. Original evaluation used company data
# which has been replaced with synthetic datasets.

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# Evaluation Question Categories

This notebook originally contained 99 evaluation questions across three difficulty levels:
- **33 Easy Questions**: Simple queries testing basic filtering and aggregation
- **33 Medium Questions**: Multi-condition queries with grouping and calculations  
- **33 Hard Questions**: Complex multi-step analysis with advanced pandas operations
- **20+ Out-of-Scope Questions**: Testing the system's ability to recognize limitations

## Methodology

The evaluation dataset was created following a systematic approach:

1. **Schema Analysis**: Reviewed all dataset columns and relationships
2. **Pattern Identification**: Identified common analytical query patterns
3. **Complexity Progression**: Questions ranged from simple counts to complex statistical analysis
4. **Natural Language Variation**: Tested different phrasings and question formats

## Sample Question Categories

### Easy Level Examples
- Count operations: "How many orders do we have?"
- Simple filters: "Show me orders by customer"
- Basic aggregations: "What is the total revenue?"

### Medium Level Examples  
- Conditional filtering: "Which customers have more than 100 orders?"
- Date-based analysis: "How many orders were placed in January?"
- Grouped calculations: "Average order value by region"

### Hard Level Examples
- Multi-step analysis: "For each customer, average quantity per product"
- Statistical queries: "Regions with >50% express shipping"
- Complex conditions: "Orders with more than 5 different products"

### Out-of-Scope Examples
- Missing data: "How many cancelled orders?" (no cancellation field)
- Wrong timeframe: "Show orders from 2018" (data starts 2024)
- Non-existent values: Questions about data not in the dataset

---

**For complete methodology and insights, see [`EVALUATION_METHODOLOGY.md`](./EVALUATION_METHODOLOGY.md)**

**Note**: Original questions referenced company-specific data and have been removed to protect confidentiality. The evaluation framework and methodology remain documented for educational purposes.