# Causal Language Model (CLM) Training Tutorial

This tutorial demonstrates how to train a causal language model using the Continual Pretraining Framework. We'll cover the following topics:

1. Understanding CLM training concepts
2. Setting up the training configuration
3. Loading a tokenized dataset
4. Selecting a training strategy
5. Training a model using the ContinualOrchestrator
6. Monitoring training progress
7. Evaluating the trained model
8. Best practices and optimization tips

This tutorial assumes you have already completed the tokenization tutorial and have a tokenized dataset available.

## 1. Understanding CLM Training Concepts

Causal Language Model (CLM) training is a fundamental technique for training large language models. In CLM training, the model learns to predict the next token in a sequence given all previous tokens. This is also known as autoregressive language modeling.

Key concepts in CLM training include:

- **Autoregressive Prediction**: The model predicts one token at a time, with each prediction conditioned on all previous tokens.
- **Causal Attention Mask**: Ensures that the model can only attend to previous tokens in the sequence, not future ones.
- **Next Token Prediction Loss**: The training objective is to minimize the negative log-likelihood of predicting the correct next token.
- **Distributed Training**: Large models often require training across multiple GPUs or nodes using strategies like FSDP, DDP, or DeepSpeed.
- **Gradient Accumulation**: Accumulating gradients across multiple batches to simulate larger batch sizes.
- **Learning Rate Scheduling**: Adjusting the learning rate during training to improve convergence.

The Continual Pretraining Framework provides a comprehensive implementation for CLM training with various distributed training strategies, making it easy to train large language models efficiently.

## 2. Setting Up the Environment

First, let's import the necessary modules and set up our environment:

In [None]:
import os
import torch
import yaml
from box import Box
from datasets import Dataset, DatasetDict
from pathlib import Path

# Import the CLM training components
from src.tasks.clm_training import execute
from src.tasks.clm_training.orchestrator import ContinualOrchestrator
from src.config.config_loader import ConfigValidator

# Check if CUDA is available
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device count: {torch.cuda.device_count()}")
    print(f"CUDA device name: {torch.cuda.get_device_name(0)}")

## 3. Creating a Sample Dataset

For this tutorial, we'll use a small sample dataset. In a real-world scenario, you would typically use the tokenized dataset from the tokenization step. Let's first create a small sample dataset for demonstration purposes:

In [None]:
# Create a sample dataset directory
sample_dataset_dir = Path("sample_tokenized_dataset")
sample_dataset_dir.mkdir(exist_ok=True)

# Create a small sample tokenized dataset
def create_sample_tokenized_dataset():
    # Sample tokenized data (input_ids, attention_mask, and labels)
    # This simulates what would come from the tokenization step
    train_data = {
        "input_ids": [
            [101, 2023, 2003, 1037, 4937, 2361, 1012, 102] + [0] * 8,  # "This is a sample text." padded
            [101, 2023, 2003, 1037, 2828, 2361, 1012, 102] + [0] * 8   # "This is a short text." padded
        ],
        "attention_mask": [
            [1, 1, 1, 1, 1, 1, 1, 1] + [0] * 8,  # Mask for the first sequence
            [1, 1, 1, 1, 1, 1, 1, 1] + [0] * 8   # Mask for the second sequence
        ],
        "labels": [
            [101, 2023, 2003, 1037, 4937, 2361, 1012, 102] + [0] * 8,  # Same as input_ids for CLM
            [101, 2023, 2003, 1037, 2828, 2361, 1012, 102] + [0] * 8   # Same as input_ids for CLM
        ]
    }
    
    valid_data = {
        "input_ids": [
            [101, 2023, 2003, 1037, 3231, 2361, 1012, 102] + [0] * 8,  # "This is a test text." padded
        ],
        "attention_mask": [
            [1, 1, 1, 1, 1, 1, 1, 1] + [0] * 8,  # Mask for the sequence
        ],
        "labels": [
            [101, 2023, 2003, 1037, 3231, 2361, 1012, 102] + [0] * 8,  # Same as input_ids for CLM
        ]
    }
    
    # Create datasets
    train_dataset = Dataset.from_dict(train_data)
    valid_dataset = Dataset.from_dict(valid_data)
    
    # Create a DatasetDict
    dataset_dict = DatasetDict({
        "train": train_dataset,
        "valid": valid_dataset
    })
    
    # Save the dataset
    dataset_dict.save_to_disk(sample_dataset_dir)
    
    return dataset_dict

# Create and display the sample dataset
sample_dataset = create_sample_tokenized_dataset()
print("Sample dataset created:")
print(f"Train split: {len(sample_dataset['train'])} examples")
print(f"Validation split: {len(sample_dataset['valid'])} examples")
print("\nSample from train split:")
print(sample_dataset['train'][0])

## 4. Loading a Tokenized Dataset

In a real-world scenario, you would load the tokenized dataset created in the tokenization step. Let's see how to load a tokenized dataset from disk:

In [None]:
from datasets import load_from_disk

# Load the sample dataset we just created
loaded_dataset = load_from_disk(str(sample_dataset_dir))
print(f"Loaded dataset splits: {list(loaded_dataset.keys())}")
print(f"Number of examples in train split: {len(loaded_dataset['train'])}")
print(f"Number of examples in validation split: {len(loaded_dataset['valid'])}")

# Display the first example from the train split
print("\nFirst example from train split:")
print(loaded_dataset['train'][0])

# Verify that the dataset has the required columns for CLM training
required_columns = ["input_ids", "attention_mask", "labels"]
all_columns_present = all(col in loaded_dataset['train'].column_names for col in required_columns)
print(f"\nAll required columns present: {all_columns_present}")

## 5. Setting Up the Training Configuration

For CLM training, we need to create a configuration that specifies the model, training parameters, and distributed training strategy. The Continual Pretraining Framework uses YAML configuration files, but we can also create a configuration programmatically:

In [None]:
# Create a configuration for CLM training
def create_clm_training_config(dataset_path, output_dir="output", model_name="meta-llama/Llama-3.2-1B"):
    """
    Create a configuration for CLM training.
    
    Args:
        dataset_path: Path to the tokenized dataset
        output_dir: Directory to save model checkpoints and logs
        model_name: Name or path of the pretrained model to use
        
    Returns:
        Box: Configuration object
    """
    config = {
        "task": "pretraining",
        "experiment_name": "tutorial_clm_training",
        "verbose_level": 4,
        
        # Dataset configuration
        "dataset": {
            "source": "local",
            "nameOrPath": dataset_path
        },
        
        # Output directory
        "output_dir": output_dir,
        
        # Model configuration
        "model_name": model_name,
        "precision": "bf16-true" if torch.cuda.is_available() else "32-true",
        
        # Training parameters
        "number_epochs": 1,
        "batch_size": 2,
        
        # Validation parameters
        "validate_on_end": True,
        "validate_after_epoch": True,
        "validate_after_k_steps": None,
        
        # Gradient parameters
        "gradient_accumulation": True,
        "gradient_accumulation_steps": 2,
        "grad_clip": 1.0,
        
        # Optimizer parameters
        "lr": 2e-5,
        "lr_decay": True,
        "weight_decay": 0.01,
        "beta1": 0.9,
        "beta2": 0.95,
        
        # Scheduler parameters
        "lr_scheduler": "warmup_linear",
        "warmup_proportion": 0.06,
        
        # Distributed training strategy
        "parallelization_strategy": "none",  # For single GPU, use "none"
        "num_workers": 1,
        "gradient_checkpointing": False,
    }
    
    # Convert to Box object for dot notation access
    return Box(config)

# Create the configuration
clm_config = create_clm_training_config(
    dataset_path=str(sample_dataset_dir),
    output_dir="clm_training_output",
    model_name="gpt2"  # Using a small model for demonstration
)

# Print the configuration
print("CLM Training Configuration:")
for key, value in clm_config.items():
    if isinstance(value, dict):
        print(f"{key}:")
        for k, v in value.items():
            print(f"  {k}: {v}")
    else:
        print(f"{key}: {value}")


## 6. Understanding Distributed Training Strategies

The Continual Pretraining Framework supports various distributed training strategies:

1. **FSDP (Fully Sharded Data Parallel)**: Shards model parameters, gradients, and optimizer states across GPUs, enabling training of very large models.
2. **DDP (Distributed Data Parallel)**: Replicates the model on each GPU and synchronizes gradients, suitable for medium-sized models.
3. **DeepSpeed**: Implements ZeRO optimization for efficient large model training with memory optimizations.
4. **DP (Data Parallel)**: Simple data parallelism for single-node multi-GPU setups.
5. **None**: Single GPU or CPU training.

Let's see how to configure each strategy:

In [None]:
# Example configurations for different distributed training strategies

# 1. FSDP Configuration
def create_fsdp_config():
    config = clm_config.copy()
    config.parallelization_strategy = "fsdp"
    config.auto_wrap_policy = "gpt2"  # or "llama" for LLaMA models
    config.sharding_strategy = "FULL_SHARD"
    config.state_dict_type = "sharded"
    config.limit_all_gathers = True
    config.cpu_offload = False
    config.gradient_checkpointing = True
    return config

# 2. DDP Configuration
def create_ddp_config():
    config = clm_config.copy()
    config.parallelization_strategy = "ddp"
    config.find_unused_parameters = False
    config.process_group_backend = "nccl"
    config.static_graph = True
    return config

# 3. DeepSpeed Configuration
def create_deepspeed_config():
    config = clm_config.copy()
    config.parallelization_strategy = "deep_speed"
    config.zero_stage = 2
    config.offload_optimizer = False
    config.offload_parameters = False
    return config

# 4. DP Configuration
def create_dp_config():
    config = clm_config.copy()
    config.parallelization_strategy = "dp"
    return config

# Print the FSDP configuration as an example
if torch.cuda.device_count() > 1:
    print("Multi-GPU setup detected. FSDP configuration example:")
    fsdp_config = create_fsdp_config()
    for key, value in fsdp_config.items():
        if key not in clm_config or fsdp_config[key] != clm_config[key]:
            print(f"{key}: {value}")
else:
    print("Single GPU or CPU setup detected. Using 'none' strategy for training.")

## 7. Training a Model with ContinualOrchestrator

Now that we have our dataset and configuration, let's train a model using the ContinualOrchestrator. For this tutorial, we'll use a small model and our sample dataset:

In [None]:
# Create output directory
os.makedirs("clm_training_output", exist_ok=True)

# Initialize the orchestrator
orchestrator = ContinualOrchestrator(clm_config)

# Load the dataset
dataset = load_from_disk(str(sample_dataset_dir))

# Train the model (this is a simplified version for demonstration)
# In a real scenario, you would use the execute function
try:
    # This is a simplified training loop for demonstration
    # It won't actually run the full training due to the small dataset and setup
    print("Setting up training environment...")
    orchestrator.setup_environment()
    
    print("Loading dataset...")
    orchestrator.load_dataset(dataset)
    
    print("Setting up training...")
    orchestrator.setup()
    
    print("\\nTraining would start here in a real scenario.")
    print("For a complete training run, use the execute function with your configuration.")
except Exception as e:
    print(f"Error during training setup: {str(e)}")
    print("\\nNote: This is expected in a notebook environment without proper GPU setup.")
    print("In a real training scenario, you would use the execute function with your configuration file.")

## 8. Running Training with the Execute Function

In a real-world scenario, you would typically use the `execute` function to run the entire training pipeline. Here's how you would do it:

In [None]:
# Save the configuration to a YAML file
config_path = "clm_training_config.yaml"
with open(config_path, "w") as f:
    yaml.dump(clm_config.to_dict(), f)

print(f"Configuration saved to {config_path}")

# In a real scenario, you would run:
# execute(config_path)
print("To run training with this configuration, use:")
print(f"from src.tasks.clm_training import execute")
print(f"execute('{config_path}')")

## 9. Monitoring Training Progress

The Continual Pretraining Framework provides several ways to monitor training progress:

1. **Console Logging**: Training metrics are logged to the console.
2. **Weights & Biases (WandB)**: Optional integration for experiment tracking.
3. **Checkpoints**: Regular model checkpoints are saved to the output directory.

Let's see how to configure logging and monitor training:

In [None]:
# Example of configuring WandB logging
def configure_wandb_logging(config):
    """Configure Weights & Biases logging"""
    config_with_wandb = config.copy()
    config_with_wandb.logging_config = "wandb"
    config_with_wandb.wandb_project = "clm_training_tutorial"
    config_with_wandb.wandb_entity = "your_wandb_entity"  # Replace with your WandB entity
    config_with_wandb.log_model = False
    config_with_wandb.log_iter_interval = 10
    return config_with_wandb

# Example of how to load and inspect a checkpoint
def load_checkpoint(checkpoint_path):
    """Load a checkpoint and print its contents"""
    if os.path.exists(checkpoint_path):
        checkpoint = torch.load(checkpoint_path, map_location="cpu")
        print(f"Checkpoint keys: {list(checkpoint.keys())}")
        if "step_count" in checkpoint:
            print(f"Training step: {checkpoint['step_count']}")
        if "current_epoch" in checkpoint:
            print(f"Current epoch: {checkpoint['current_epoch']}")
        return checkpoint
    else:
        print(f"Checkpoint not found at {checkpoint_path}")
        return None

# Example checkpoint path
checkpoint_path = "clm_training_output/checkpoint.pt"
print(f"In a real training run, checkpoints would be saved to: {checkpoint_path}")

# 10. Best Practices and Optimization Tips

Here are some best practices and optimization tips for CLM training:

### Model Selection and Hardware Requirements

- **Model Size**: Choose a model size appropriate for your hardware. Larger models require more memory and compute.
- **GPU Memory**: As a rule of thumb, you need at least 8GB of GPU memory for small models (125M parameters), 16GB for medium models (1B parameters), and 40GB+ for large models (7B+ parameters).
- **Distributed Training**: For models larger than 1B parameters, consider using distributed training strategies like FSDP or DeepSpeed.

### Training Optimization

- **Gradient Accumulation**: Use gradient accumulation to simulate larger batch sizes on limited hardware.
- **Gradient Checkpointing**: Enable gradient checkpointing to reduce memory usage at the cost of increased computation time.
- **Mixed Precision Training**: Use mixed precision training (bf16 or fp16) to reduce memory usage and speed up training.
- **Learning Rate Scheduling**: Use a learning rate scheduler with warmup to improve training stability.

### Dataset Preparation

- **Dataset Size**: Larger datasets generally lead to better models, but also require more training time.
- **Dataset Quality**: High-quality, diverse data is crucial for good model performance.
- **Validation Split**: Always include a validation split to monitor training progress and prevent overfitting.

### Monitoring and Debugging

- **Regular Validation**: Validate the model regularly to catch issues early.
- **Gradient Norms**: Monitor gradient norms to detect exploding or vanishing gradients.
- **Learning Rate**: Start with a small learning rate and gradually increase it if training is stable.
- **Memory Usage**: Monitor GPU memory usage to detect memory leaks or inefficient memory usage.

## 11. Integration with Tokenization

The CLM training task is designed to work seamlessly with the tokenized dataset produced by the tokenization task. Here's how to integrate the two tasks:

1. **Run the Tokenization Task**: First, run the tokenization task to prepare your dataset.
2. **Configure CLM Training**: Set up your CLM training configuration to use the tokenized dataset.
3. **Run CLM Training**: Execute the CLM training task using the tokenized dataset.

Example workflow:

In [None]:
# Example workflow integrating tokenization and CLM training

# 1. Tokenization configuration
tokenization_config = {
    "task": "tokenization",
    "tokenizer_name": "gpt2",
    "context_length": 1024,
    "overlap": 0,
    "batch_size": 1000,
    "num_proc": 4,
    "dataset": {
        "source": "local",
        "nameOrPath": "path/to/raw/dataset"
    },
    "output_dir": "tokenized_dataset"
}

# 2. CLM training configuration using the tokenized dataset
clm_training_config = {
    "task": "pretraining",
    "model_name": "gpt2",
    "dataset": {
        "source": "local",
        "nameOrPath": "tokenized_dataset"  # Output from tokenization task
    },
    "output_dir": "trained_model",
    # Other training parameters...
}

# In a real scenario, you would run:
# 1. Execute tokenization
# from src.tasks.tokenization import execute as tokenize
# tokenize("path/to/tokenization_config.yaml")

# 2. Execute CLM training
# from src.tasks.clm_training import execute as train
# train("path/to/clm_training_config.yaml")

print("This example shows how to integrate tokenization and CLM training tasks.")
print("In a real scenario, you would save these configurations to YAML files and run the execute functions.")

## 12. Conclusion

In this tutorial, we've covered the basics of CLM training using the Continual Pretraining Framework. We've learned how to:

1. Load a tokenized dataset
2. Configure CLM training parameters
3. Select an appropriate distributed training strategy
4. Train a model using the ContinualOrchestrator
5. Monitor training progress and optimize training

The Continual Pretraining Framework provides a flexible and efficient way to train causal language models, with support for various distributed training strategies and optimization techniques.

For more advanced usage, refer to the framework documentation and experiment with different configurations to find what works best for your specific use case.