# Evaluating a Mixture of Experts (MoE) LLM

This notebook demonstrates how to evaluate a Mixture of Experts (MoE) Large Language Model on various benchmarks.

## Setup

First, let's install the required dependencies and set up the environment.

In [None]:
# Clone the repository
!git clone https://github.com/your-username/moe-llm.git
!cd moe-llm

In [None]:
# Install dependencies
!pip install -r requirements.txt

In [None]:
# Import libraries
import os
import torch
import logging
import json
from transformers import AutoTokenizer

# Import our modules
from model.config import MoEConfig, EvaluationConfig
from model.model import MoELLM
from data.tokenizer import get_tokenizer
from evaluation.benchmarks import BenchmarkEvaluator
from evaluation.perplexity import calculate_perplexity
from utils.logging import configure_logging
from utils.visualization import plot_benchmark_results, plot_comparison_with_deepseek

# Configure logging
logger = configure_logging(log_level="INFO", log_file="logs/evaluation.log")

## Load Model and Tokenizer

Let's load the trained model and tokenizer.

In [None]:
# Load tokenizer
tokenizer = get_tokenizer(
    tokenizer_name_or_path="model",  # Path to the saved model
    use_fast=True
)

In [None]:
# Load model configuration
model_config = MoEConfig.from_pretrained("model")

# Initialize model
model = MoELLM(model_config)

# Load model weights
model.load_state_dict(torch.load("model/pytorch_model.bin", map_location="cpu"))

# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Set model to evaluation mode
model.eval()

## Calculate Perplexity

Let's calculate the perplexity of the model on a validation dataset.

In [None]:
# Load validation dataset
from datasets import load_dataset

validation_dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="validation")

# Calculate perplexity
perplexity_metrics = calculate_perplexity(
    model=model,
    tokenizer=tokenizer,
    texts=validation_dataset["text"],
    device=device,
    batch_size=1,
    stride=512,
    max_length=1024
)

print(f"Perplexity: {perplexity_metrics['perplexity']:.4f}")

## Evaluate on Benchmarks

Let's evaluate the model on various benchmarks.

In [None]:
# Initialize benchmark evaluator
evaluator = BenchmarkEvaluator(
    model=model,
    tokenizer=tokenizer,
    device=device,
    output_dir="benchmark_results"
)

In [None]:
# Evaluate on MMLU
mmlu_metrics = evaluator.evaluate_mmlu(
    data_path="cais/mmlu",
    num_few_shot=5,
    batch_size=4
)

print(f"MMLU accuracy: {mmlu_metrics['mmlu_accuracy']:.4f}")

In [None]:
# Evaluate on GSM8K
gsm8k_metrics = evaluator.evaluate_gsm8k(
    data_path="gsm8k",
    split="test",
    num_few_shot=8,
    batch_size=4
)

print(f"GSM8K accuracy: {gsm8k_metrics['gsm8k_accuracy']:.4f}")

In [None]:
# Evaluate on MATH
math_metrics = evaluator.evaluate_math(
    data_path="hendrycks_math",
    split="test",
    num_few_shot=4,
    batch_size=4
)

print(f"MATH accuracy: {math_metrics['math_accuracy']:.4f}")

In [None]:
# Evaluate on BBH
bbh_metrics = evaluator.evaluate_bbh(
    data_path="lukaemon/bbh",
    num_few_shot=3,
    batch_size=4
)

print(f"BBH accuracy: {bbh_metrics['bbh_accuracy']:.4f}")

## Combine and Visualize Results

Let's combine all the benchmark results and visualize them.

In [None]:
# Combine all benchmark results
all_metrics = {}
all_metrics.update(mmlu_metrics)
all_metrics.update(gsm8k_metrics)
all_metrics.update(math_metrics)
all_metrics.update(bbh_metrics)
all_metrics["perplexity"] = perplexity_metrics["perplexity"]

# Save all results
with open("benchmark_results/all_benchmark_results.json", "w") as f:
    json.dump(all_metrics, f, indent=2)

# Plot benchmark results
plot_path = plot_benchmark_results(
    results_file="benchmark_results/all_benchmark_results.json",
    output_dir="plots",
    output_name="benchmark_results.png"
)

# Display the plot
from IPython.display import Image
Image(filename=plot_path)

## Compare with DeepSeek-V3

Let's compare our results with DeepSeek-V3's published results.

In [None]:
# DeepSeek-V3 published results
# These are placeholder values - replace with actual published results
deepseek_results = {
    "mmlu_accuracy": 0.80,
    "gsm8k_accuracy": 0.85,
    "math_accuracy": 0.50,
    "bbh_accuracy": 0.75
}

# Plot comparison
comparison_plot_path = plot_comparison_with_deepseek(
    our_results=all_metrics,
    deepseek_results=deepseek_results,
    output_dir="plots",
    output_name="comparison_with_deepseek.png"
)

# Display the plot
Image(filename=comparison_plot_path)

## Analyze Performance Discrepancies

Let's analyze the performance discrepancies between our model and DeepSeek-V3.

In [None]:
# Calculate performance discrepancies
discrepancies = {}
for key in deepseek_results:
    if key in all_metrics:
        discrepancies[key] = all_metrics[key] - deepseek_results[key]

# Print discrepancies
print("Performance Discrepancies:")
for key, value in discrepancies.items():
    print(f"{key}: {value:.4f} ({'+' if value >= 0 else ''}{value * 100:.2f}%)")

## Generate Text with the Model

Let's generate some text with the model to demonstrate its capabilities.

In [None]:
# Define prompts
prompts = [
    "Explain the concept of a Mixture of Experts (MoE) architecture in simple terms.",
    "What are the advantages of using a Mixture of Experts model compared to a dense model?",
    "Solve the following math problem step by step: If a train travels at 60 mph for 3 hours and then at 80 mph for 2 hours, what is the average speed for the entire journey?"
]

# Generate text for each prompt
for prompt in prompts:
    print(f"Prompt: {prompt}\n")
    
    # Tokenize prompt
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
    
    # Generate text
    with torch.no_grad():
        output_ids = model.generate(
            input_ids=input_ids,
            max_length=input_ids.shape[1] + 200,
            do_sample=True,
            temperature=0.7,
            top_p=0.9
        )
    
    # Decode output
    generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    
    # Print generated text
    print(f"Generated text: {generated_text}\n")
    print("-" * 80 + "\n")

## Conclusion

In this notebook, we've evaluated a Mixture of Experts (MoE) Large Language Model on various benchmarks and compared its performance with DeepSeek-V3. We've also analyzed the performance discrepancies and demonstrated the model's text generation capabilities.