# Training a Mixture of Experts (MoE) LLM in Google Colab

This notebook demonstrates how to train a Mixture of Experts (MoE) Large Language Model in Google Colab with memory-efficient techniques.

## Setup

First, let's install the required dependencies and set up the environment.

In [None]:
# Clone the repository
!git clone https://github.com/your-username/moe-llm.git
!cd moe-llm

In [None]:
# Install dependencies
!pip install -r requirements.txt

In [None]:
# Import libraries
import os
import torch
import logging
from datasets import load_dataset
from transformers import AutoTokenizer

# Import our modules
from model.config import MoEConfig, TrainingConfig
from model.model import MoELLM
from data.dataset import load_and_prepare_datasets, create_dataloaders
from data.tokenizer import get_tokenizer
from training.trainer import MoETrainer, create_optimizer, create_lr_scheduler
from training.memory_utils import get_memory_stats, print_memory_stats, optimize_memory_efficiency
from utils.logging import configure_logging, TensorboardLogger, WandBLogger

# Configure logging
logger = configure_logging(log_level="INFO", log_file="logs/training.log")

## Check Available Resources

Let's check the available resources in Google Colab.

In [None]:
# Check GPU availability
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")
    print(f"CUDA memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.2f} GB")

# Check CPU resources
import psutil
print(f"CPU count: {psutil.cpu_count()}")
print(f"RAM: {psutil.virtual_memory().total / 1024**3:.2f} GB")

## Configure Model and Training

Let's configure the model and training parameters.

In [None]:
# Model configuration
model_config = MoEConfig(
    vocab_size=32000,
    hidden_size=1024,  # Reduced for Colab
    intermediate_size=2816,  # Reduced for Colab
    num_hidden_layers=16,  # Reduced for Colab
    num_attention_heads=16,  # Reduced for Colab
    num_experts=8,
    num_experts_per_token=2,
    max_position_embeddings=2048,  # Reduced for Colab
    max_sequence_length=2048,  # Reduced for Colab
    use_mla=True,  # Multi-head Latent Attention
    mla_dim=64,  # Reduced for Colab
    use_mtp=True,  # Multi-Token Prediction
    mtp_num_tokens=2  # Reduced for Colab
)

# Training configuration
training_config = TrainingConfig(
    batch_size=2,  # Reduced for Colab
    gradient_accumulation_steps=8,  # Increased for Colab
    learning_rate=5e-5,
    weight_decay=0.01,
    max_steps=10000,  # Reduced for Colab
    warmup_steps=500,
    optimizer_type="8bit-adam",  # Memory-efficient optimizer
    lr_scheduler_type="cosine",
    use_gradient_checkpointing=True,  # Memory optimization
    mixed_precision="bf16",  # Memory optimization
    logging_steps=10,
    save_steps=500,
    eval_steps=500,
    max_seq_length=2048,  # Reduced for Colab
    preprocessing_num_workers=2  # Reduced for Colab
)

## Load Tokenizer and Datasets

Let's load the tokenizer and prepare the datasets.

In [None]:
# Load tokenizer
tokenizer = get_tokenizer(
    tokenizer_name_or_path="EleutherAI/gpt-neox-20b",  # Using an existing tokenizer
    use_fast=True
)

# Update model config with tokenizer vocab size
model_config.vocab_size = len(tokenizer)

In [None]:
# Load and prepare datasets
# Using a smaller dataset for Colab
dataset_paths = ["roneneldan/TinyStories"]

datasets = load_and_prepare_datasets(
    tokenizer=tokenizer,
    dataset_paths=dataset_paths,
    max_seq_length=training_config.max_seq_length,
    streaming=False,  # Set to True for larger datasets
    text_column="text",
    preprocessing_num_workers=training_config.preprocessing_num_workers
)

# Create dataloaders
dataloaders = create_dataloaders(
    datasets=datasets,
    batch_size=training_config.batch_size,
    num_workers=training_config.preprocessing_num_workers
)

## Initialize Model

Let's initialize the model with the configured parameters.

In [None]:
# Initialize model
model = MoELLM(model_config)

# Apply memory optimizations
model = optimize_memory_efficiency(model)

# Print model size
from training.memory_utils import get_model_size
model_size = get_model_size(model)
print(f"Model size: {model_size['total_parameters'] / 1e6:.2f}M parameters")
print(f"Parameter size: {model_size['parameter_size_mb']:.2f} MB")
print(f"Estimated total size: {model_size['estimated_total_size_mb']:.2f} MB")

## Set Up Training

Let's set up the optimizer, learning rate scheduler, and trainer.

In [None]:
# Create optimizer
optimizer = create_optimizer(
    model=model,
    learning_rate=training_config.learning_rate,
    weight_decay=training_config.weight_decay,
    optimizer_type=training_config.optimizer_type
)

# Calculate number of training steps
if hasattr(dataloaders["train"], "__len__"):
    num_training_steps = len(dataloaders["train"]) * training_config.max_steps
else:
    num_training_steps = training_config.max_steps

# Create learning rate scheduler
lr_scheduler = create_lr_scheduler(
    optimizer=optimizer,
    num_training_steps=num_training_steps,
    warmup_steps=training_config.warmup_steps,
    lr_scheduler_type=training_config.lr_scheduler_type
)

# Initialize trainer
trainer = MoETrainer(
    model=model,
    train_dataloader=dataloaders["train"],
    eval_dataloader=dataloaders.get("validation"),
    optimizer=optimizer,
    lr_scheduler=lr_scheduler,
    config=training_config,
    use_wandb=False  # Set to True to use Weights & Biases
)

## Train the Model

Now, let's train the model.

In [None]:
# Train the model
trainer.train()

## Evaluate the Model

Let's evaluate the trained model.

In [None]:
# Evaluate the model
eval_metrics = trainer.evaluate()
print(f"Evaluation metrics: {eval_metrics}")

## Save the Model

Let's save the trained model for inference.

In [None]:
# Save the model
from utils.checkpoint import save_model_for_inference

save_model_for_inference(
    model=model,
    tokenizer=tokenizer,
    output_dir="model",
    save_format="pytorch",
    quantization="int8"  # Apply quantization for smaller model size
)

## Generate Text with the Model

Let's generate some text with the trained model.

In [None]:
# Generate text
prompt = "Once upon a time, there was a"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)

# Generate text
with torch.no_grad():
    output_ids = model.generate(
        input_ids=input_ids,
        max_length=100,
        do_sample=True,
        temperature=0.7,
        top_p=0.9
    )

# Decode output
generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(f"Generated text: {generated_text}")

## Conclusion

In this notebook, we've trained a Mixture of Experts (MoE) Large Language Model in Google Colab with memory-efficient techniques. The model can be further improved by training on larger datasets and with more resources.