# Creating Custom Datasets for LLMRouter

**Estimated Time:** 45 minutes  
**Level:** Advanced  
**Prerequisites:** 00_Quick_Start, 02_Data_Preparation

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ulab-uiuc/LLMRouter/blob/main/tutorials/notebooks/10_Creating_Custom_Datasets.ipynb)

## Learning Objectives

By the end of this tutorial, you will:
- ‚úÖ Understand all data formats in LLMRouter
- ‚úÖ Create query datasets from scratch
- ‚úÖ Create routing ground truth data
- ‚úÖ Convert existing datasets (ChatBot Arena, MT-Bench, etc.)
- ‚úÖ Create domain-specific datasets
- ‚úÖ Validate data quality

---

In [None]:
# Setup
!git clone https://github.com/ulab-uiuc/LLMRouter.git
%cd LLMRouter
!pip install -e . -q

## 1. Understanding Data Formats

LLMRouter uses three main data types:

### 1.1 Query Data (JSONL)

Simple list of queries, one per line:

```jsonl
{"query": "What is machine learning?", "id": "q1"}
{"query": "Explain quantum physics", "id": "q2"}
```

### 1.2 Routing Data (JSONL)

Ground truth for training - which model is best for each query:

```jsonl
{"query": "What is ML?", "best_llm": "gpt-4", "performance": 0.95}
{"query": "Code a sort", "best_llm": "code-llama", "performance": 0.88}
```

### 1.3 LLM Candidates (JSON)

Available models (covered in Tutorial 09)

In [None]:
import json
import pandas as pd

# Load example data
with open('data/example_data/query_data/default_query_test.jsonl', 'r') as f:
    queries = [json.loads(line) for line in f]

with open('data/example_data/routing_data/default_routing_test_data.jsonl', 'r') as f:
    routing_data = [json.loads(line) for line in f]

print(f"Sample query: {queries[0]}")
print(f"\nSample routing data: {routing_data[0]}")

## 2. Creating Query Dataset from Scratch

Let's create a custom domain-specific dataset.

**Example Domain:** Programming Questions

In [None]:
# Create programming queries dataset
programming_queries = [
    {"query": "Write a Python function to reverse a string", "id": "prog_1", "category": "coding"},
    {"query": "Explain the difference between == and === in JavaScript", "id": "prog_2", "category": "concept"},
    {"query": "What is a binary search tree?", "id": "prog_3", "category": "theory"},
    {"query": "Debug this code: for i in range(10) print(i)", "id": "prog_4", "category": "debugging"},
    {"query": "How do I optimize SQL queries?", "id": "prog_5", "category": "optimization"},
    {"query": "Implement quicksort in C++", "id": "prog_6", "category": "coding"},
    {"query": "What are design patterns? Give examples", "id": "prog_7", "category": "architecture"},
    {"query": "Convert this loop to list comprehension: result = []; for i in range(10): result.append(i*2)", "id": "prog_8", "category": "refactoring"},
]

# Save to JSONL
with open('my_programming_queries.jsonl', 'w') as f:
    for q in programming_queries:
        f.write(json.dumps(q) + '\n')

print(f"‚úÖ Created {len(programming_queries)} programming queries")
print("\nCategories:")
categories = {}
for q in programming_queries:
    cat = q['category']
    categories[cat] = categories.get(cat, 0) + 1

for cat, count in sorted(categories.items()):
    print(f"  {cat}: {count}")

## 3. Creating Routing Ground Truth

### Method 1: Manual Labeling

For small datasets, manually assign best models.

In [None]:
# Manual routing labels
# Assume we have these models: gpt-4, claude-3, code-llama, llama-3-8b

routing_labels = [
    {"query": "Write a Python function to reverse a string", 
     "best_llm": "code-llama", 
     "performance": 0.92,
     "reason": "code generation task"},
    
    {"query": "Explain the difference between == and === in JavaScript", 
     "best_llm": "gpt-4", 
     "performance": 0.95,
     "reason": "requires clear explanation"},
    
    {"query": "What is a binary search tree?", 
     "best_llm": "claude-3", 
     "performance": 0.90,
     "reason": "conceptual explanation"},
    
    {"query": "Debug this code: for i in range(10) print(i)", 
     "best_llm": "code-llama", 
     "performance": 0.88,
     "reason": "code debugging"},
    
    {"query": "How do I optimize SQL queries?", 
     "best_llm": "gpt-4", 
     "performance": 0.93,
     "reason": "complex technical topic"},
    
    {"query": "Implement quicksort in C++", 
     "best_llm": "code-llama", 
     "performance": 0.94,
     "reason": "algorithm implementation"},
    
    {"query": "What are design patterns? Give examples", 
     "best_llm": "gpt-4", 
     "performance": 0.91,
     "reason": "requires examples and explanation"},
    
    {"query": "Convert this loop to list comprehension: result = []; for i in range(10): result.append(i*2)", 
     "best_llm": "code-llama", 
     "performance": 0.96,
     "reason": "code transformation"},
]

# Save routing data
with open('my_routing_labels.jsonl', 'w') as f:
    for label in routing_labels:
        f.write(json.dumps(label) + '\n')

print(f"‚úÖ Created {len(routing_labels)} routing labels")

# Analyze distribution
model_counts = {}
for label in routing_labels:
    model = label['best_llm']
    model_counts[model] = model_counts.get(model, 0) + 1

print("\nModel distribution:")
for model, count in sorted(model_counts.items()):
    print(f"  {model}: {count} queries ({count/len(routing_labels)*100:.1f}%)")

### Method 2: Generate from Existing Evaluations

If you have evaluation results from running multiple models on queries.

In [None]:
# Example: You ran 4 models on each query and measured performance
evaluation_results = [
    {
        "query": "Write a Python function to reverse a string",
        "results": {
            "gpt-4": {"score": 0.85, "cost": 0.01},
            "claude-3": {"score": 0.83, "cost": 0.015},
            "code-llama": {"score": 0.92, "cost": 0.002},
            "llama-3-8b": {"score": 0.75, "cost": 0.0005},
        }
    },
    {
        "query": "Explain quantum physics",
        "results": {
            "gpt-4": {"score": 0.95, "cost": 0.01},
            "claude-3": {"score": 0.93, "cost": 0.015},
            "code-llama": {"score": 0.60, "cost": 0.002},
            "llama-3-8b": {"score": 0.70, "cost": 0.0005},
        }
    },
]

def generate_routing_from_eval(evaluations, strategy='best_performance'):
    """Generate routing data from evaluation results.
    
    Strategies:
    - best_performance: Choose model with highest score
    - cost_aware: Balance score and cost
    - threshold: Use cheapest model above threshold
    """
    routing_data = []
    
    for eval_item in evaluations:
        query = eval_item['query']
        results = eval_item['results']
        
        if strategy == 'best_performance':
            # Select model with highest score
            best_model = max(results.items(), key=lambda x: x[1]['score'])
            routing_data.append({
                'query': query,
                'best_llm': best_model[0],
                'performance': best_model[1]['score'],
                'cost': best_model[1]['cost'],
            })
        
        elif strategy == 'cost_aware':
            # Maximize score/cost ratio
            best_model = max(results.items(), 
                           key=lambda x: x[1]['score'] / max(x[1]['cost'], 0.0001))
            routing_data.append({
                'query': query,
                'best_llm': best_model[0],
                'performance': best_model[1]['score'],
                'cost': best_model[1]['cost'],
                'strategy': 'cost_aware',
            })
    
    return routing_data

# Generate routing data
auto_routing = generate_routing_from_eval(evaluation_results, 'best_performance')

print("Generated routing data:")
for item in auto_routing:
    print(f"  {item['query'][:40]}... ‚Üí {item['best_llm']} (score: {item['performance']})")

## 4. Converting Existing Datasets

### 4.1 From ChatBot Arena Format

In [None]:
# Example ChatBot Arena data (conversation format)
chatbot_arena_data = [
    {
        "conversation": [
            {"role": "user", "content": "Explain neural networks"},
            {"role": "assistant", "content": "Neural networks are..."},
        ],
        "model_a": "gpt-4",
        "model_b": "claude-3",
        "winner": "model_a",  # or "model_b" or "tie"
    },
]

def convert_chatbot_arena(arena_data):
    """Convert ChatBot Arena format to LLMRouter format."""
    query_data = []
    routing_data = []
    
    for i, item in enumerate(arena_data):
        # Extract first user message as query
        query = next((msg['content'] for msg in item['conversation'] 
                     if msg['role'] == 'user'), None)
        
        if not query:
            continue
        
        # Add to query dataset
        query_data.append({
            'query': query,
            'id': f'arena_{i}',
        })
        
        # Determine best model from winner
        if item['winner'] == 'model_a':
            best_model = item['model_a']
            performance = 1.0
        elif item['winner'] == 'model_b':
            best_model = item['model_b']
            performance = 1.0
        else:  # tie
            best_model = item['model_a']  # arbitrary choice
            performance = 0.5
        
        routing_data.append({
            'query': query,
            'best_llm': best_model,
            'performance': performance,
            'source': 'chatbot_arena',
        })
    
    return query_data, routing_data

queries, routing = convert_chatbot_arena(chatbot_arena_data)
print(f"‚úÖ Converted {len(queries)} ChatBot Arena examples")

### 4.2 From MT-Bench Format

In [None]:
# MT-Bench format (multi-turn benchmark)
mt_bench_data = [
    {
        "question_id": 1,
        "category": "writing",
        "turns": [
            "Write a short story about AI",
            "Now make it a poem",
        ],
        "reference_answer": "...",
    },
]

def convert_mt_bench(mt_bench_data):
    """Convert MT-Bench to LLMRouter format."""
    query_data = []
    
    for item in mt_bench_data:
        # Use first turn as query
        query_data.append({
            'query': item['turns'][0],
            'id': f"mtbench_{item['question_id']}",
            'category': item['category'],
            'multi_turn': len(item['turns']) > 1,
        })
    
    return query_data

queries = convert_mt_bench(mt_bench_data)
print(f"‚úÖ Converted {len(queries)} MT-Bench examples")

## 5. Data Quality Validation

Always validate your dataset before training.

In [None]:
def validate_dataset(query_file, routing_file, llm_file):
    """Validate dataset consistency and quality."""
    
    # Load data
    with open(query_file, 'r') as f:
        queries = [json.loads(line) for line in f]
    
    with open(routing_file, 'r') as f:
        routing = [json.loads(line) for line in f]
    
    with open(llm_file, 'r') as f:
        llm_data = json.load(f)
    
    print("üìä Dataset Validation Report\n" + "="*60)
    
    # 1. Check sizes
    print(f"\n1. Dataset Sizes:")
    print(f"   Queries: {len(queries)}")
    print(f"   Routing labels: {len(routing)}")
    print(f"   LLM models: {len(llm_data)}")
    
    if len(queries) != len(routing):
        print(f"   ‚ö†Ô∏è WARNING: Query count != Routing count")
    
    # 2. Check required fields
    print(f"\n2. Field Validation:")
    for i, q in enumerate(queries[:5]):
        if 'query' not in q:
            print(f"   ‚ùå Query {i} missing 'query' field")
    
    for i, r in enumerate(routing[:5]):
        if 'query' not in r or 'best_llm' not in r:
            print(f"   ‚ùå Routing {i} missing required fields")
    
    print("   ‚úÖ All required fields present")
    
    # 3. Check model names
    print(f"\n3. Model Name Validation:")
    used_models = set(r['best_llm'] for r in routing)
    available_models = set(llm_data.keys())
    
    missing = used_models - available_models
    if missing:
        print(f"   ‚ùå Models in routing but not in LLM data: {missing}")
    else:
        print(f"   ‚úÖ All models are available")
    
    # 4. Distribution analysis
    print(f"\n4. Model Distribution:")
    model_counts = {}
    for r in routing:
        model = r['best_llm']
        model_counts[model] = model_counts.get(model, 0) + 1
    
    for model in sorted(model_counts.keys()):
        count = model_counts[model]
        pct = count / len(routing) * 100
        print(f"   {model}: {count} ({pct:.1f}%)")
    
    # 5. Query length analysis
    print(f"\n5. Query Length Statistics:")
    lengths = [len(q['query'].split()) for q in queries]
    print(f"   Min: {min(lengths)} words")
    print(f"   Max: {max(lengths)} words")
    print(f"   Mean: {sum(lengths)/len(lengths):.1f} words")
    
    print("\n" + "="*60)
    print("‚úÖ Validation complete")

# Run validation
# validate_dataset(
#     'my_programming_queries.jsonl',
#     'my_routing_labels.jsonl',
#     'data/example_data/llm_candidates/default_llm.json'
# )

## 6. Creating Train/Test Splits

In [None]:
import random

def create_train_test_split(data, test_ratio=0.2, random_seed=42):
    """Split data into train and test sets."""
    random.seed(random_seed)
    
    # Shuffle
    shuffled = data.copy()
    random.shuffle(shuffled)
    
    # Split
    split_idx = int(len(shuffled) * (1 - test_ratio))
    train = shuffled[:split_idx]
    test = shuffled[split_idx:]
    
    return train, test

# Example
all_data = routing_labels  # from earlier
train_data, test_data = create_train_test_split(all_data, test_ratio=0.2)

# Save splits
with open('train_routing.jsonl', 'w') as f:
    for item in train_data:
        f.write(json.dumps(item) + '\n')

with open('test_routing.jsonl', 'w') as f:
    for item in test_data:
        f.write(json.dumps(item) + '\n')

print(f"‚úÖ Split complete:")
print(f"   Train: {len(train_data)} examples")
print(f"   Test: {len(test_data)} examples")

## 7. Advanced: Synthetic Data Generation

Generate synthetic queries using LLMs.

In [None]:
# This requires an LLM API
# Example template for generating queries

generation_prompt = """
Generate 10 diverse programming questions that vary in:
- Difficulty (beginner to advanced)
- Topic (algorithms, debugging, concepts, etc.)
- Length (short to long)

Format each as:
{"query": "...", "difficulty": "...", "topic": "..."}

One per line.
"""

# You would call an LLM API here
# generated_queries = call_llm(generation_prompt)

print("Example synthetic generation prompt:")
print(generation_prompt)

## 8. Best Practices

### Dataset Size Guidelines

- **Minimum**: 100 training examples
- **Good**: 500-1000 examples
- **Excellent**: 5000+ examples

### Quality > Quantity

‚úÖ **Good practices:**
- Diverse query types
- Balanced model distribution
- Clear routing decisions
- Real-world representative

‚ùå **Avoid:**
- Duplicate queries
- Biased distributions
- Ambiguous labels
- Out-of-domain queries

### Model Coverage

Each model should have:
- At least 50-100 examples
- Examples showcasing its strengths
- Coverage across difficulty levels

## Summary

### What You Learned:
- ‚úÖ All LLMRouter data formats
- ‚úÖ Creating query datasets
- ‚úÖ Creating routing ground truth
- ‚úÖ Converting existing datasets
- ‚úÖ Data validation
- ‚úÖ Train/test splitting
- ‚úÖ Best practices

### Key Files Created:
1. `my_programming_queries.jsonl` - Custom queries
2. `my_routing_labels.jsonl` - Routing ground truth
3. `train_routing.jsonl` - Training split
4. `test_routing.jsonl` - Test split

### Next Steps:
- **[03_Training_Single_Round_Routers.ipynb](03_Training_Single_Round_Routers.ipynb)** - Train with your data
- **[05_Inference_and_Evaluation.ipynb](05_Inference_and_Evaluation.ipynb)** - Evaluate performance
- **[11_Advanced_Customization.ipynb](11_Advanced_Customization.ipynb)** - Advanced techniques