<a href="https://colab.research.google.com/github/nassro199/SimpleLLM/blob/main/notebooks/SimpleLLM_colab_training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Training SimpleLLM in Google Colab

This notebook provides a complete end-to-end workflow for training the SimpleLLM (a Mixture of Experts Large Language Model) in Google Colab. The implementation is optimized for Colab's resource constraints and includes memory-efficient techniques to enable training of large-scale models.

## Overview

1. **Setup Environment**: Clone the repository and install dependencies
2. **Check Available Resources**: Verify GPU and memory availability
3. **Configure Model**: Set up model architecture and training parameters
4. **Prepare Data**: Download and preprocess training data
5. **Train Model**: Train the model with memory-efficient techniques
6. **Generate Text**: Test the model with simple text generation
7. **Save and Export**: Save the model for future use

Let's get started!

## 1. Setup Environment

First, let's check what GPU we have available and set up our environment.

In [None]:
# Check GPU availability
!nvidia-smi

Now, let's clone the repository and install the required dependencies:

In [None]:
# Clone the repository
!git clone https://github.com/nassro199/SimpleLLM.git
%cd SimpleLLM

In [None]:
# Install dependencies
!pip install -r requirements.txt

# Install additional dependencies for Colab
!pip install accelerate bitsandbytes sentencepiece datasets wandb

## 2. Check Available Resources

Let's check the available resources in detail to help us configure our model appropriately:

In [None]:
import torch
import psutil
import os
import GPUtil

# Check CPU resources
print(f"CPU Count: {psutil.cpu_count()}")
print(f"Available Memory: {psutil.virtual_memory().available / (1024**3):.2f} GB")

# Check GPU resources
if torch.cuda.is_available():
    print(f"\nGPU Information:")
    print(f"GPU Device: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / (1024**3):.2f} GB")
    
    # Get more detailed GPU info
    gpus = GPUtil.getGPUs()
    for i, gpu in enumerate(gpus):
        print(f"\nGPU {i}: {gpu.name}")
        print(f"Memory Free: {gpu.memoryFree} MB")
        print(f"Memory Used: {gpu.memoryUsed} MB")
        print(f"Memory Total: {gpu.memoryTotal} MB")
        print(f"GPU Utilization: {gpu.load*100:.2f}%")
else:
    print("No GPU available. Please change runtime type to include a GPU.")

Based on the available resources, we'll configure our model. Let's set up some helper functions to monitor memory usage during training:

In [None]:
def print_gpu_memory_summary():
    """Print a summary of GPU memory usage."""
    if torch.cuda.is_available():
        print("\nGPU Memory Summary:")
        print(f"Allocated: {torch.cuda.memory_allocated() / (1024**3):.2f} GB")
        print(f"Cached: {torch.cuda.memory_reserved() / (1024**3):.2f} GB")
        print(f"Max Allocated: {torch.cuda.max_memory_allocated() / (1024**3):.2f} GB")
        print(f"Max Cached: {torch.cuda.max_memory_reserved() / (1024**3):.2f} GB")
    else:
        print("No GPU available.")

# Import our memory utilities
from training.memory_utils import get_memory_stats, print_memory_stats, clear_memory

# Clear memory and print initial stats
clear_memory()
print_gpu_memory_summary()

## 3. Configure Model

Now, let's configure our model architecture and training parameters based on the available resources. We'll use a smaller configuration for Colab's constraints.

In [None]:
# Import our configuration classes
from model.config import MoEConfig, TrainingConfig

# Define model configuration
# Adjust these parameters based on your available GPU memory
model_config = MoEConfig(
    vocab_size=32000,
    hidden_size=768,  # Reduced for Colab
    intermediate_size=2048,  # Reduced for Colab
    num_hidden_layers=12,  # Reduced for Colab
    num_attention_heads=12,  # Reduced for Colab
    num_experts=8,
    num_experts_per_token=2,
    expert_capacity=0,  # Auto-calculate
    router_jitter_noise=0.1,
    router_z_loss_coef=0.001,
    router_aux_loss_coef=0.001,
    max_position_embeddings=2048,  # Reduced for Colab
    max_sequence_length=2048,  # Reduced for Colab
    hidden_dropout_prob=0.1,
    attention_dropout_prob=0.1,
    use_rms_norm=True,
    position_embedding_type="rotary",
    rotary_dim=64,  # Reduced for Colab
    use_mla=True,  # Multi-head Latent Attention
    mla_dim=64,  # Reduced for Colab
    use_mtp=True,  # Multi-Token Prediction
    mtp_num_tokens=2  # Reduced for Colab
)

# Define training configuration
training_config = TrainingConfig(
    batch_size=2,  # Small batch size for Colab
    gradient_accumulation_steps=8,  # Accumulate gradients to simulate larger batch
    learning_rate=5e-5,
    weight_decay=0.01,
    max_steps=5000,  # Reduced for Colab
    warmup_steps=200,
    optimizer_type="8bit-adam",  # Memory-efficient optimizer
    lr_scheduler_type="cosine",
    use_gradient_checkpointing=True,  # Memory optimization
    mixed_precision="bf16",  # Memory optimization (use "fp16" if bf16 not supported)
    logging_steps=10,
    save_steps=500,
    eval_steps=500,
    max_seq_length=2048,  # Reduced for Colab
    preprocessing_num_workers=2  # Reduced for Colab
)

# Print configurations
print("Model Configuration:")
for key, value in vars(model_config).items():
    print(f"  {key}: {value}")

print("\nTraining Configuration:")
for key, value in vars(training_config).items():
    print(f"  {key}: {value}")

## 4. Prepare Data

Now, let's prepare the data for training. We'll use a smaller dataset for demonstration purposes, but you can replace it with your own dataset.

In [None]:
# Import data utilities
from data.tokenizer import get_tokenizer
from data.dataset import load_and_prepare_datasets, create_dataloaders
from datasets import load_dataset

# Load tokenizer
# We'll use an existing tokenizer for simplicity
tokenizer = get_tokenizer(
    tokenizer_name_or_path="EleutherAI/gpt-neo-1.3B",  # Using an existing tokenizer
    use_fast=True
)

# Update model config with tokenizer vocab size
model_config.vocab_size = len(tokenizer)
print(f"Updated vocab size to {model_config.vocab_size}")

Let's load a small dataset for training. For this example, we'll use the TinyStories dataset, which is small enough for Colab but still useful for training language models.

In [None]:
# Load a small dataset for demonstration
# You can replace this with your own dataset
print("Loading dataset...")
dataset_paths = ["roneneldan/TinyStories"]

datasets = load_and_prepare_datasets(
    tokenizer=tokenizer,
    dataset_paths=dataset_paths,
    max_seq_length=training_config.max_seq_length,
    streaming=False,  # Set to True for larger datasets
    text_column="text",
    preprocessing_num_workers=training_config.preprocessing_num_workers
)

# Create dataloaders
dataloaders = create_dataloaders(
    datasets=datasets,
    batch_size=training_config.batch_size,
    num_workers=training_config.preprocessing_num_workers
)

# Print dataset information
print("\nDataset Information:")
for split, dataset in datasets.items():
    print(f"  {split}: {len(dataset)} examples")

# Show a sample from the dataset
print("\nSample from dataset:")
sample = datasets["train"][0]
print(f"Input IDs shape: {sample['input_ids'].shape}")
print(f"Attention Mask shape: {sample['attention_mask'].shape}")
print(f"Labels shape: {sample['labels'].shape}")

# Decode a sample to show the text
decoded_text = tokenizer.decode(sample['input_ids'][:50])  # Show first 50 tokens
print(f"\nDecoded sample (first 50 tokens): {decoded_text}...")

## 5. Initialize Model

Now, let's initialize our SimpleLLM model with the configured parameters.

In [None]:
# Import model
from model.model import MoELLM
from training.memory_utils import get_model_size, optimize_memory_efficiency

# Initialize model
print("Initializing model...")
model = MoELLM(model_config)

# Apply memory optimizations
model = optimize_memory_efficiency(model)

# Get model size information
model_size_info = get_model_size(model)
print("\nModel Size Information:")
print(f"  Total Parameters: {model_size_info['total_parameters'] / 1e6:.2f}M")
print(f"  Trainable Parameters: {model_size_info['trainable_parameters'] / 1e6:.2f}M")
print(f"  Parameter Size: {model_size_info['parameter_size_mb']:.2f} MB")
print(f"  Estimated Total Size: {model_size_info['estimated_total_size_mb']:.2f} MB")

# Check memory usage after model initialization
print_gpu_memory_summary()

## 6. Set Up Training

Now, let's set up the optimizer, learning rate scheduler, and trainer.

In [None]:
# Import training utilities
from training.trainer import MoETrainer, create_optimizer, create_lr_scheduler
from utils.logging import configure_logging

# Configure logging
logger = configure_logging(log_level="INFO", log_file="logs/training.log")

# Create optimizer
optimizer = create_optimizer(
    model=model,
    learning_rate=training_config.learning_rate,
    weight_decay=training_config.weight_decay,
    optimizer_type=training_config.optimizer_type
)

# Calculate number of training steps
if hasattr(dataloaders["train"], "__len__"):
    num_training_steps = len(dataloaders["train"]) * training_config.max_steps
else:
    num_training_steps = training_config.max_steps

# Create learning rate scheduler
lr_scheduler = create_lr_scheduler(
    optimizer=optimizer,
    num_training_steps=num_training_steps,
    warmup_steps=training_config.warmup_steps,
    lr_scheduler_type=training_config.lr_scheduler_type
)

# Initialize trainer
trainer = MoETrainer(
    model=model,
    train_dataloader=dataloaders["train"],
    eval_dataloader=dataloaders.get("validation"),
    optimizer=optimizer,
    lr_scheduler=lr_scheduler,
    config=training_config,
    use_wandb=False  # Set to True if you want to use Weights & Biases
)

## 7. Train the Model

Now, let's train the model. This will take some time, so be patient. You can adjust the number of steps based on your available time and resources.

In [None]:
# Optional: Connect to Weights & Biases for experiment tracking
# Uncomment and run this cell if you want to use W&B
"""
import wandb
wandb.login()
wandb.init(
    project="simplellm-colab",
    name="simplellm-training",
    config={
        "model_config": vars(model_config),
        "training_config": vars(training_config)
    }
)
"""

In [None]:
# Train the model
print("Starting training...")
trainer.train()

## 8. Generate Text with the Model

Let's generate some text with our trained model to see how it performs.

In [None]:
# Set model to evaluation mode
model.eval()

# Define prompts for text generation
prompts = [
    "Once upon a time, there was a",
    "The best way to learn is to",
    "In the future, artificial intelligence will"
]

# Generate text for each prompt
for prompt in prompts:
    print(f"\nPrompt: {prompt}")
    
    # Tokenize prompt
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)
    
    # Generate text
    with torch.no_grad():
        output_ids = model.generate(
            input_ids=input_ids,
            max_length=100,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            repetition_penalty=1.2
        )
    
    # Decode output
    generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    print(f"Generated: {generated_text}")

## 9. Save the Model

Finally, let's save the trained model for future use.

In [None]:
# Import checkpoint utilities
from utils.checkpoint import save_model_for_inference

# Save the model
print("Saving model...")
output_dir = "trained_model"
save_model_for_inference(
    model=model,
    tokenizer=tokenizer,
    output_dir=output_dir,
    save_format="pytorch",
    quantization="int8"  # Apply quantization for smaller model size
)
print(f"Model saved to {output_dir}")

## 10. Download the Model

You can download the trained model to your local machine for future use.

In [None]:
# Zip the model directory for easier download
!zip -r trained_model.zip trained_model

# Download the model
from google.colab import files
files.download('trained_model.zip')

## 11. Conclusion

Congratulations! You've successfully trained the SimpleLLM model in Google Colab. Here's a summary of what we've accomplished:

1. Set up the environment and installed dependencies
2. Configured a memory-efficient MoE model architecture
3. Prepared a dataset for training
4. Trained the model with memory-efficient techniques
5. Generated text with the trained model
6. Saved the model for future use

### Next Steps

- Try training on larger datasets for better performance
- Experiment with different model configurations
- Fine-tune the model on specific tasks
- Use the model for various text generation tasks
- Deploy the model for inference

### Resources

- [GitHub Repository](https://github.com/nassro199/SimpleLLM)
- [Technical Report](https://github.com/nassro199/SimpleLLM/blob/main/report/technical_report.md)