<a href="https://colab.research.google.com/github/your-username/moe-llm/blob/main/notebooks/moe_llm_colab_evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Evaluating a Mixture of Experts (MoE) LLM in Google Colab

This notebook provides a complete workflow for evaluating a Mixture of Experts (MoE) Large Language Model on various benchmarks in Google Colab. The implementation is optimized for Colab's resource constraints.

## Overview

1. **Setup Environment**: Clone the repository and install dependencies
2. **Load Model**: Load a pre-trained MoE model
3. **Evaluate on Benchmarks**: Evaluate the model on standard benchmarks
4. **Visualize Results**: Visualize the evaluation results
5. **Compare with DeepSeek-V3**: Compare the results with DeepSeek-V3's published results

Let's get started!

## 1. Setup Environment

First, let's check what GPU we have available and set up our environment.

In [None]:
# Check GPU availability
!nvidia-smi

Now, let's clone the repository and install the required dependencies:

In [None]:
# Clone the repository
!git clone https://github.com/your-username/moe-llm.git
%cd moe-llm

In [None]:
# Install dependencies
!pip install -r requirements.txt

# Install additional dependencies for evaluation
!pip install accelerate bitsandbytes sentencepiece datasets matplotlib

## 2. Load Model

Now, let's load a pre-trained MoE model. You can either use a model you've trained previously or load a smaller model for demonstration purposes.

In [None]:
import torch
import os
import json
import logging
from model.config import MoEConfig
from model.model import MoELLM
from data.tokenizer import get_tokenizer
from utils.logging import configure_logging

# Configure logging
logger = configure_logging(log_level="INFO", log_file="logs/evaluation.log")

# Check if we have a trained model
if os.path.exists("trained_model"):
    print("Loading previously trained model...")
    model_path = "trained_model"
else:
    print("No trained model found. Creating a smaller model for demonstration...")
    model_path = None

# Load or create model
if model_path:
    # Load tokenizer
    tokenizer = get_tokenizer(
        tokenizer_name_or_path=model_path,
        use_fast=True
    )
    
    # Load model configuration
    if os.path.exists(os.path.join(model_path, "config.json")):
        with open(os.path.join(model_path, "config.json"), "r") as f:
            config_dict = json.load(f)
        model_config = MoEConfig(**config_dict)
    else:
        # Create default config
        model_config = MoEConfig(
            vocab_size=len(tokenizer),
            hidden_size=768,
            intermediate_size=2048,
            num_hidden_layers=12,
            num_attention_heads=12,
            num_experts=8,
            num_experts_per_token=2
        )
    
    # Initialize model
    model = MoELLM(model_config)
    
    # Load model weights
    if os.path.exists(os.path.join(model_path, "pytorch_model.bin")):
        model.load_state_dict(torch.load(os.path.join(model_path, "pytorch_model.bin"), map_location="cpu"))
    else:
        print("Warning: No model weights found. Using randomly initialized weights.")
else:
    # Load tokenizer from a pre-trained model
    tokenizer = get_tokenizer(
        tokenizer_name_or_path="EleutherAI/gpt-neo-1.3B",
        use_fast=True
    )
    
    # Create a smaller model for demonstration
    model_config = MoEConfig(
        vocab_size=len(tokenizer),
        hidden_size=512,  # Very small for demonstration
        intermediate_size=1024,
        num_hidden_layers=6,
        num_attention_heads=8,
        num_experts=4,
        num_experts_per_token=2,
        max_position_embeddings=1024,
        max_sequence_length=1024
    )
    
    # Initialize model
    model = MoELLM(model_config)

# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Set model to evaluation mode
model.eval()

# Print model information
print(f"Model loaded with {sum(p.numel() for p in model.parameters()) / 1e6:.2f}M parameters")

## 3. Evaluate on Benchmarks

Now, let's evaluate the model on various benchmarks. We'll start with a simple perplexity evaluation on a small dataset, and then move on to more complex benchmarks if resources allow.

In [None]:
from evaluation.perplexity import calculate_perplexity
from datasets import load_dataset

# Load a small validation dataset
print("Loading validation dataset...")
try:
    validation_dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="validation")
    
    # Calculate perplexity on a subset of the data to save time
    print("Calculating perplexity...")
    perplexity_metrics = calculate_perplexity(
        model=model,
        tokenizer=tokenizer,
        texts=validation_dataset["text"][:100],  # Use only 100 examples
        device=device,
        batch_size=1,
        stride=512,
        max_length=1024
    )
    
    print(f"Perplexity: {perplexity_metrics['perplexity']:.4f}")
except Exception as e:
    print(f"Error calculating perplexity: {e}")
    perplexity_metrics = {"perplexity": float('nan')}

Now, let's evaluate the model on some standard benchmarks. Note that these evaluations can be resource-intensive, so we'll use smaller versions or subsets of the benchmarks.

In [None]:
from evaluation.benchmarks import BenchmarkEvaluator

# Initialize benchmark evaluator
print("Initializing benchmark evaluator...")
evaluator = BenchmarkEvaluator(
    model=model,
    tokenizer=tokenizer,
    device=device,
    output_dir="benchmark_results"
)

# Initialize dictionary to store all metrics
all_metrics = {}
all_metrics["perplexity"] = perplexity_metrics["perplexity"]

### 3.1 Evaluate on MMLU

The Massive Multitask Language Understanding (MMLU) benchmark tests the model's knowledge across various subjects.

In [None]:
# Evaluate on MMLU (using a small subset)
try:
    print("Evaluating on MMLU (subset)...")
    # Note: This will download the MMLU dataset which is quite large
    # We'll use a very small number of examples and few-shot examples to save time
    mmlu_metrics = evaluator.evaluate_mmlu(
        data_path="cais/mmlu",
        num_few_shot=2,  # Reduced for speed
        batch_size=2
    )
    
    print(f"MMLU accuracy: {mmlu_metrics['mmlu_accuracy']:.4f}")
    all_metrics.update(mmlu_metrics)
except Exception as e:
    print(f"Error evaluating on MMLU: {e}")
    all_metrics["mmlu_accuracy"] = float('nan')

### 3.2 Evaluate on GSM8K

The Grade School Math 8K (GSM8K) benchmark tests the model's mathematical reasoning abilities.

In [None]:
# Evaluate on GSM8K (using a small subset)
try:
    print("Evaluating on GSM8K (subset)...")
    gsm8k_metrics = evaluator.evaluate_gsm8k(
        data_path="gsm8k",
        split="test",
        num_few_shot=2,  # Reduced for speed
        batch_size=2
    )
    
    print(f"GSM8K accuracy: {gsm8k_metrics['gsm8k_accuracy']:.4f}")
    all_metrics.update(gsm8k_metrics)
except Exception as e:
    print(f"Error evaluating on GSM8K: {e}")
    all_metrics["gsm8k_accuracy"] = float('nan')

## 4. Generate Text with the Model

Let's generate some text with the model to demonstrate its capabilities.

In [None]:
# Define prompts
prompts = [
    "Explain the concept of a Mixture of Experts (MoE) architecture in simple terms.",
    "What are the advantages of using a Mixture of Experts model compared to a dense model?",
    "Solve the following math problem step by step: If a train travels at 60 mph for 3 hours and then at 80 mph for 2 hours, what is the average speed for the entire journey?"
]

# Generate text for each prompt
for prompt in prompts:
    print(f"\nPrompt: {prompt}\n")
    
    # Tokenize prompt
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
    
    # Generate text
    with torch.no_grad():
        output_ids = model.generate(
            input_ids=input_ids,
            max_length=min(input_ids.shape[1] + 200, model_config.max_sequence_length),
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            repetition_penalty=1.2
        )
    
    # Decode output
    generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    
    # Print generated text
    print(f"Generated text:\n{generated_text}\n")
    print("-" * 80)

## 5. Visualize Results

Let's visualize the evaluation results.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Save all metrics to a file
os.makedirs("benchmark_results", exist_ok=True)
with open("benchmark_results/all_metrics.json", "w") as f:
    json.dump(all_metrics, f, indent=2)

# Extract benchmark metrics
benchmark_metrics = {
    "MMLU": all_metrics.get("mmlu_accuracy", float('nan')),
    "GSM8K": all_metrics.get("gsm8k_accuracy", float('nan')),
    "Perplexity": 1.0 / all_metrics.get("perplexity", float('nan'))  # Invert perplexity for visualization
}

# Filter out NaN values
benchmark_metrics = {k: v for k, v in benchmark_metrics.items() if not np.isnan(v)}

if benchmark_metrics:
    # Create bar chart
    plt.figure(figsize=(10, 6))
    plt.bar(benchmark_metrics.keys(), benchmark_metrics.values())
    plt.ylim(0, 1)
    plt.title("Benchmark Results")
    plt.ylabel("Score")
    
    # Add value labels
    for i, (key, value) in enumerate(benchmark_metrics.items()):
        plt.text(i, value + 0.02, f"{value:.4f}", ha="center")
    
    plt.tight_layout()
    plt.savefig("benchmark_results/benchmark_results.png")
    plt.show()
else:
    print("No valid benchmark metrics to visualize.")

## 6. Compare with DeepSeek-V3

Let's compare our results with DeepSeek-V3's published results.

In [None]:
# DeepSeek-V3 published results (placeholder values - replace with actual published results)
deepseek_results = {
    "MMLU": 0.80,
    "GSM8K": 0.85,
    "Perplexity": 0.90  # Normalized for visualization
}

# Filter out benchmarks we didn't evaluate
common_benchmarks = set(benchmark_metrics.keys()) & set(deepseek_results.keys())
filtered_our_results = {k: benchmark_metrics[k] for k in common_benchmarks}
filtered_deepseek_results = {k: deepseek_results[k] for k in common_benchmarks}

if filtered_our_results:
    # Create comparison bar chart
    plt.figure(figsize=(12, 6))
    
    x = np.arange(len(common_benchmarks))
    width = 0.35
    
    plt.bar(x - width/2, [filtered_our_results[k] for k in common_benchmarks], width, label="Our Model")
    plt.bar(x + width/2, [filtered_deepseek_results[k] for k in common_benchmarks], width, label="DeepSeek-V3")
    
    plt.xlabel("Benchmark")
    plt.ylabel("Score")
    plt.title("Comparison with DeepSeek-V3")
    plt.xticks(x, common_benchmarks)
    plt.ylim(0, 1)
    plt.legend()
    
    # Add value labels
    for i, benchmark in enumerate(common_benchmarks):
        plt.text(i - width/2, filtered_our_results[benchmark] + 0.02, f"{filtered_our_results[benchmark]:.4f}", ha="center")
        plt.text(i + width/2, filtered_deepseek_results[benchmark] + 0.02, f"{filtered_deepseek_results[benchmark]:.4f}", ha="center")
    
    plt.tight_layout()
    plt.savefig("benchmark_results/comparison_with_deepseek.png")
    plt.show()
    
    # Calculate and print performance gap
    print("Performance Gap Analysis:")
    for benchmark in common_benchmarks:
        gap = filtered_our_results[benchmark] - filtered_deepseek_results[benchmark]
        print(f"  {benchmark}: {gap:.4f} ({'+' if gap >= 0 else ''}{gap * 100:.2f}%)")
else:
    print("No common benchmarks to compare.")

## 7. Analyze Model Behavior

Let's analyze the model's behavior by examining the expert routing patterns.

In [None]:
# Define a function to extract expert routing patterns
def get_expert_routing_patterns(model, tokenizer, text, device):
    # Tokenize input
    inputs = tokenizer(text, return_tensors="pt").to(device)
    
    # Create hooks to capture router outputs
    router_outputs = []
    hooks = []
    
    def hook_fn(module, input, output):
        # Capture routing weights
        router_outputs.append(output[2].detach().cpu().numpy())
    
    # Register hooks for all MoE layers
    for name, module in model.named_modules():
        if "moe_layer" in name and hasattr(module, "router"):
            hook = module.router.register_forward_hook(hook_fn)
            hooks.append(hook)
    
    # Forward pass
    with torch.no_grad():
        model(**inputs)
    
    # Remove hooks
    for hook in hooks:
        hook.remove()
    
    return router_outputs

# Analyze expert routing for a sample text
try:
    sample_text = "The Mixture of Experts architecture is a neural network design that routes different inputs to different specialized sub-networks (experts) based on the input content."
    
    print("Analyzing expert routing patterns...")
    routing_patterns = get_expert_routing_patterns(model, tokenizer, sample_text, device)
    
    if routing_patterns:
        # Visualize routing patterns for the first layer
        plt.figure(figsize=(12, 6))
        plt.imshow(routing_patterns[0][0], aspect="auto", cmap="viridis")
        plt.colorbar(label="Routing Weight")
        plt.xlabel("Expert")
        plt.ylabel("Token")
        plt.title("Expert Routing Pattern (First Layer)")
        plt.tight_layout()
        plt.savefig("benchmark_results/expert_routing_pattern.png")
        plt.show()
        
        # Calculate expert utilization
        expert_utilization = routing_patterns[0][0].sum(axis=0) / routing_patterns[0][0].sum()
        
        plt.figure(figsize=(10, 6))
        plt.bar(range(len(expert_utilization)), expert_utilization)
        plt.xlabel("Expert")
        plt.ylabel("Utilization")
        plt.title("Expert Utilization (First Layer)")
        plt.tight_layout()
        plt.savefig("benchmark_results/expert_utilization.png")
        plt.show()
    else:
        print("No routing patterns captured.")
except Exception as e:
    print(f"Error analyzing expert routing: {e}")

## 8. Conclusion

In this notebook, we've evaluated a Mixture of Experts (MoE) Large Language Model on various benchmarks and compared its performance with DeepSeek-V3. We've also analyzed the model's behavior by examining the expert routing patterns.

### Summary of Results

- Perplexity: {perplexity_metrics['perplexity']:.4f}
- MMLU Accuracy: {all_metrics.get('mmlu_accuracy', 'N/A')}
- GSM8K Accuracy: {all_metrics.get('gsm8k_accuracy', 'N/A')}

### Next Steps

- Train the model on larger datasets for better performance
- Evaluate on more benchmarks
- Fine-tune the model on specific tasks
- Optimize the model for inference

### Resources

- [GitHub Repository](https://github.com/your-username/moe-llm)
- [Technical Report](https://github.com/your-username/moe-llm/blob/main/report/technical_report.md)
- [DeepSeek-V3 Paper](https://arxiv.org/pdf/2412.19437)