# KNNRouter - Training

This notebook demonstrates how to train the **KNNRouter** (K-Nearest Neighbors Router).

## Overview

KNNRouter uses a K-Nearest Neighbors classifier to route queries to the most suitable LLM based on:
- Query embeddings (using Longformer)
- Historical performance data

**Key Features**:
- Simple and interpretable
- Fast training and inference
- Works well with limited training data

## 1. Environment Setup

In [None]:
# Install required packages (for Colab)
# !pip install llmrouter scikit-learn transformers torch

In [None]:
import os
import sys
from pathlib import Path

# Set project root
PROJECT_ROOT = Path(os.getcwd()).parent.parent
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

os.chdir(PROJECT_ROOT)
print(f"Working directory: {os.getcwd()}")

In [None]:
# Import required modules
from llmrouter.models.knnrouter import KNNRouter, KNNRouterTrainer
from llmrouter.utils import setup_environment

setup_environment()
print("Environment setup complete!")

## 2. Configuration

KNNRouter uses the following configuration parameters:

| Parameter | Description | Default |
|-----------|-------------|--------|
| `n_neighbors` | Number of neighbors (K value) | 5 |
| `weights` | Weight function: "uniform" or "distance" | "uniform" |
| `algorithm` | Algorithm: "auto", "ball_tree", "kd_tree", "brute" | "auto" |
| `metric` | Distance metric | "minkowski" |
| `p` | Power for Minkowski metric (1=Manhattan, 2=Euclidean) | 2 |

In [None]:
import yaml

# Configuration file path
CONFIG_PATH = "configs/model_config_train/knnrouter.yaml"

# Load and display configuration
with open(CONFIG_PATH, 'r') as f:
    config = yaml.safe_load(f)

print("Current Configuration:")
print("=" * 50)
print(yaml.dump(config, default_flow_style=False))

In [None]:
# Optionally modify configuration
# You can create a custom config for experimentation

CUSTOM_CONFIG = {
    'data_path': {
        'query_data_train': 'data/example_data/query_data/default_query_train.jsonl',
        'query_data_test': 'data/example_data/query_data/default_query_test.jsonl',
        'query_embedding_data': 'data/example_data/routing_data/query_embeddings_longformer.pt',
        'routing_data_train': 'data/example_data/routing_data/default_routing_train_data.jsonl',
        'routing_data_test': 'data/example_data/routing_data/default_routing_test_data.jsonl',
        'llm_data': 'data/example_data/llm_candidates/default_llm.json',
        'llm_embedding_data': 'data/example_data/llm_candidates/default_llm_embeddings.json'
    },
    'model_path': {
        'ini_model_path': '',
        'save_model_path': 'saved_models/knnrouter/knnrouter.pkl'
    },
    'metric': {
        'weights': {
            'performance': 1,
            'cost': 0,
            'llm_judge': 0
        }
    },
    'hparam': {
        'n_neighbors': 5,
        'weights': 'uniform',
        'algorithm': 'auto',
        'leaf_size': 30,
        'p': 2,
        'metric': 'minkowski',
        'n_jobs': -1
    }
}

# Save custom config (optional)
# custom_config_path = 'configs/model_config_train/knnrouter_custom.yaml'
# with open(custom_config_path, 'w') as f:
#     yaml.dump(CUSTOM_CONFIG, f)
# print(f"Custom config saved to {custom_config_path}")

## 3. Initialize Router

In [None]:
# Initialize KNNRouter with configuration
router = KNNRouter(yaml_path=CONFIG_PATH)

print("Router initialized successfully!")
print(f"Number of training samples: {len(router.routing_data_train)}")
print(f"Number of LLM candidates: {len(router.llm_data)}")
print(f"LLM candidates: {list(router.llm_data.keys())}")

In [None]:
# Inspect the KNN model configuration
print("KNN Model Parameters:")
print(router.knn_model.get_params())

## 4. Data Exploration

In [None]:
import pandas as pd

# Explore training data
train_df = router.routing_data_train
print("Training Data Shape:", train_df.shape)
print("\nColumns:", list(train_df.columns))
print("\nSample data:")
train_df.head()

In [None]:
# Analyze performance distribution
print("Performance Statistics:")
print(train_df['performance'].describe())

print("\nPerformance by Model:")
print(train_df.groupby('model_name')['performance'].mean().sort_values(ascending=False))

In [None]:
# Visualize performance distribution
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Performance histogram
axes[0].hist(train_df['performance'], bins=20, edgecolor='black')
axes[0].set_xlabel('Performance')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Performance Distribution')

# Performance by model
model_perf = train_df.groupby('model_name')['performance'].mean().sort_values()
model_perf.plot(kind='barh', ax=axes[1])
axes[1].set_xlabel('Average Performance')
axes[1].set_title('Average Performance by Model')

plt.tight_layout()
plt.show()

## 5. Training

In [None]:
# Initialize trainer
trainer = KNNRouterTrainer(router=router, device='cpu')

print("Trainer initialized!")
print(f"Training samples: {len(trainer.query_embedding_list)}")
print(f"Save path: {trainer.save_model_path}")

In [None]:
# Train the model
print("Starting training...")
print("=" * 50)

trainer.train()

print("=" * 50)
print("Training completed!")

## 6. Model Verification

In [None]:
# Verify the trained model
from llmrouter.utils import load_model

# Load the saved model
saved_model = load_model(trainer.save_model_path)

print("Model loaded successfully!")
print(f"Model type: {type(saved_model).__name__}")
print(f"Number of classes: {len(saved_model.classes_)}")
print(f"Classes: {saved_model.classes_}")

In [None]:
# Quick prediction test
import numpy as np

# Use first training sample for testing
test_embedding = trainer.query_embedding_list[0].reshape(1, -1)
prediction = saved_model.predict(test_embedding)

print(f"Test prediction: {prediction[0]}")

# Get prediction probabilities
proba = saved_model.predict_proba(test_embedding)
print(f"\nPrediction probabilities:")
for model, prob in zip(saved_model.classes_, proba[0]):
    print(f"  {model}: {prob:.4f}")

## 7. Hyperparameter Tuning (Optional)

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
import numpy as np

# Prepare data
X = np.array(trainer.query_embedding_list)
y = np.array(trainer.model_name_list)

print(f"Feature shape: {X.shape}")
print(f"Labels shape: {y.shape}")
print(f"Unique labels: {len(np.unique(y))}")

In [None]:
# Grid search for optimal K
k_values = [1, 3, 5, 7, 9, 11, 15]
results = []

print("Cross-validation for different K values:")
print("=" * 40)

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k, n_jobs=-1)
    scores = cross_val_score(knn, X, y, cv=5, scoring='accuracy')
    mean_score = scores.mean()
    std_score = scores.std()
    results.append((k, mean_score, std_score))
    print(f"K={k:2d}: {mean_score:.4f} (+/- {std_score:.4f})")

# Find best K
best_k, best_score, _ = max(results, key=lambda x: x[1])
print(f"\nBest K: {best_k} with accuracy: {best_score:.4f}")

In [None]:
# Visualize K selection
import matplotlib.pyplot as plt

k_values_plot = [r[0] for r in results]
scores_plot = [r[1] for r in results]
stds_plot = [r[2] for r in results]

plt.figure(figsize=(10, 6))
plt.errorbar(k_values_plot, scores_plot, yerr=stds_plot, marker='o', capsize=5)
plt.axvline(x=best_k, color='r', linestyle='--', label=f'Best K={best_k}')
plt.xlabel('K (Number of Neighbors)')
plt.ylabel('Cross-Validation Accuracy')
plt.title('KNN Performance vs. K Value')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## 8. Save Final Model

In [None]:
# Retrain with best K if different from original
if best_k != router.cfg['hparam']['n_neighbors']:
    print(f"Retraining with optimal K={best_k}...")
    
    # Create new KNN model with best K
    from sklearn.neighbors import KNeighborsClassifier
    from llmrouter.utils import save_model
    
    best_knn = KNeighborsClassifier(
        n_neighbors=best_k,
        weights='uniform',
        algorithm='auto',
        n_jobs=-1
    )
    
    # Train with best K
    best_knn.fit(X, y)
    
    # Save optimal model
    optimal_model_path = trainer.save_model_path.replace('.pkl', '_optimal.pkl')
    save_model(best_knn, optimal_model_path)
    print(f"Optimal model saved to: {optimal_model_path}")
else:
    print(f"Original K={best_k} is already optimal!")

## Summary

In this notebook, we:

1. **Loaded Configuration**: Set up KNNRouter with YAML configuration
2. **Explored Data**: Analyzed training data distribution
3. **Trained Model**: Used KNNRouterTrainer to fit the KNN classifier
4. **Verified Model**: Loaded and tested the saved model
5. **Tuned Hyperparameters**: Found optimal K value using cross-validation

**Next Steps**:
- Use `02_knnrouter_inference.ipynb` to perform inference with the trained model
- Experiment with different distance metrics (cosine, euclidean)
- Try weighted voting with `weights='distance'`