# Causal Language Model (CLM) Training Tutorial

This tutorial demonstrates how to train a causal language model using the Continual Pretraining Framework. We'll cover the following topics:

1. Understanding CLM training concepts
2. Setting up the training configuration
3. Loading a tokenized dataset
4. Selecting a training strategy
5. Training a model using the ContinualOrchestrator
6. Monitoring training progress
7. Evaluating the trained model
8. Best practices and optimization tips

This tutorial assumes you have already completed the tokenization tutorial and have a tokenized dataset available.

# =============================================================================

## Understanding CLM Training Concepts

Causal Language Model (CLM) training is a fundamental technique for training large language models. In CLM training, the model learns to predict the next token in a sequence given all previous tokens. This is also known as autoregressive language modeling.

Key concepts in CLM training include:

- **Autoregressive Prediction**: The model predicts one token at a time, with each prediction conditioned on all previous tokens.
- **Causal Attention Mask**: Ensures that the model can only attend to previous tokens in the sequence, not future ones.
- **Next Token Prediction Loss**: The training objective is to minimize the negative log-likelihood of predicting the correct next token.
- **Distributed Training**: Large models often require training across multiple GPUs or nodes using strategies like FSDP, DDP, or DeepSpeed.
- **Gradient Accumulation**: Accumulating gradients across multiple batches to simulate larger batch sizes.
- **Learning Rate Scheduling**: Adjusting the learning rate during training to improve convergence.

The Continual Pretraining Framework provides a comprehensive implementation for CLM training with various distributed training strategies, making it easy to train large language models efficiently.

# =============================================================================

## Creating a correct yaml file for the clm_training task

# ===========================
# CLM Training Configuration
# ===========================

# --- Task Info ---
task: "clm_training"                  # Causal Language Modeling training
experiment_name: "tutorial_clm_training"
verbose_level: 4

# --- Dataset ---
dataset:
  source: "local"                    # or "hf" for Hugging Face datasets
  nameOrPath: "tutorials/sample_tokenized_dataset"
  format: "hf"

# --- Model ---
model_name: "openai-community/gpt2"  # Pretrained model or custom path
precision: "bf16-true"

# --- Training Parameters ---
number_epochs: 1
batch_size: 8
gradient_accumulation: true          # Accumulate n steps before backprop
gradient_accumulation_steps: 2       # Effective batch size = batch_size * steps
grad_clip: 1.0
lr: 0.00002
lr_decay: true
weight_decay: 0.01
beta1: 0.9
beta2: 0.95
lr_scheduler: "warmup_linear"        # Several schedules supported
warmup_proportion: 0.06              # Proportion of steps for warmup

# --- Validation ---
validate_after_epoch: false           # Validate after each epoch
validate_on_end: false                # Validate at end of training
validate_after_k_steps: 1000          # Validate every k steps

# --- Checkpointing ---
save_on_validate: false
save_on_end: true
output_dir: "tutorials/output"

# --- Parallelization ---
parallelization_strategy: "fsdp"      # Supported: "fsdp", "ddp"
auto_wrap_policy: "gpt2"
sharding_strategy: "FULL_SHARD"
state_dict_type: "sharded"
limit_all_gathers: true
cpu_offload: false
num_workers: 4
gradient_checkpointing: true

# --- Logging ---
logging_config: "wandb"
wandb_project: "your_wandb_project"   # Replace with your WandB project name
wandb_entity: "your_wandb_entity"     # Replace with your WandB entity
log_model: true
log_iter_interval: 10

# --- Usage ---
# Run with:
# python src/main.py --config path/to/clm_training_tutorial.yaml

# =====================================================================

## Let's load the configuration file

In [1]:
import yaml
from box import Box

# Load the config from the YAML file
with open("/workspace/tutorials/configs/clm_training_tutorial.yaml", "r") as f:
    clm_config = Box(yaml.safe_load(f), default_box=True)

print("Loaded config keys:", clm_config.keys())

Loaded config keys: dict_keys(['auto_wrap_policy', 'batch_size', 'beta1', 'beta2', 'cpu_offload', 'dataset', 'experiment_name', 'grad_clip', 'gradient_accumulation', 'gradient_accumulation_steps', 'gradient_checkpointing', 'limit_all_gathers', 'log_iter_interval', 'log_model', 'logging_config', 'lr', 'lr_decay', 'lr_scheduler', 'model_name', 'num_workers', 'number_epochs', 'output_dir', 'parallelization_strategy', 'precision', 'save_on_end', 'save_on_validate', 'sharding_strategy', 'state_dict_type', 'task', 'validate_after_epoch', 'validate_after_k_steps', 'validate_on_end', 'verbose_level', 'wandb_entity', 'wandb_project', 'warmup_proportion', 'weight_decay'])


# =============================================================================

## Creating a Sample Dataset

For this tutorial, we'll use a small sample dataset. In a real-world scenario, you would typically use the tokenized dataset from the tokenization step. Let's first create a small sample dataset for demonstration purposes:

In [45]:
from datasets import Dataset, DatasetDict, load_from_disk
from pathlib import Path

# If you already have a tokenized dataset, skip this cell.
# This creates a small dummy dataset for demonstration.
sample_dataset_dir = Path("sample_tokenized_dataset")
if not sample_dataset_dir.exists():
    train_data = {
        "input_ids": [[101, 2023, 2003, 1037, 4937, 2361, 1012, 102] + [0]*8],
        "attention_mask": [[1]*8 + [0]*8],
        "labels": [[101, 2023, 2003, 1037, 4937, 2361, 1012, 102] + [0]*8],
    }
    valid_data = {
        "input_ids": [[101, 2023, 2003, 1037, 3231, 2361, 1012, 102] + [0]*8],
        "attention_mask": [[1]*8 + [0]*8],
        "labels": [[101, 2023, 2003, 1037, 3231, 2361, 1012, 102] + [0]*8],
    }
    ds = DatasetDict({
        "train": Dataset.from_dict(train_data),
        "valid": Dataset.from_dict(valid_data),
    })
    ds.save_to_disk(str(sample_dataset_dir))
    print("Sample dataset created at:", sample_dataset_dir)
else:
    print("Sample dataset already exists at:", sample_dataset_dir)
    
    
# If you want to use the sample dataset, update the config in-memory
clm_config.dataset.nameOrPath = str(sample_dataset_dir)
print("Dataset path in config set to:", clm_config.dataset.nameOrPath)

Saving the dataset (1/1 shards): 100%|██████████| 1/1 [00:00<00:00, 430.72 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 1/1 [00:00<00:00, 503.82 examples/s]

Sample dataset created at: sample_tokenized_dataset
Dataset path in config set to: sample_tokenized_dataset





# =============================================================================

## Loading a Tokenized Dataset

In a real-world scenario, you would load the tokenized dataset created in the tokenization step. Let's see how to load a tokenized dataset from disk:

In [46]:
loaded_dataset = load_from_disk(clm_config.dataset.nameOrPath)
print("Loaded dataset splits:", list(loaded_dataset.keys()))
print("Number of examples in train split:", len(loaded_dataset["train"]))
print("Number of examples in valid split:", len(loaded_dataset["valid"]))
print("First example from train split:", loaded_dataset["train"][0])

Loaded dataset splits: ['train', 'valid']
Number of examples in train split: 1
Number of examples in valid split: 1
First example from train split: {'input_ids': [101, 2023, 2003, 1037, 4937, 2361, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': [101, 2023, 2003, 1037, 4937, 2361, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0]}


# =============================================================================

## Validate Dataset Columns

In [50]:
required_columns = ["input_ids", "attention_mask", "labels"]
all_columns_present = all(col in loaded_dataset["train"].column_names for col in required_columns)
print("All required columns present:", all_columns_present)
assert all_columns_present, "Dataset is missing required columns for CLM training!"


import yaml

with open("/workspace/tutorials/configs/clm_training_tutorial.yaml", "w") as f:
    yaml.dump(clm_config.to_dict(), f)
print("Updated config saved to /workspace/tutorials/configs/clm_training_tutorial.yaml")

All required columns present: True
Updated config saved to /workspace/tutorials/configs/clm_training_tutorial.yaml


# =============================================================================

## Run Training 

In [51]:
## Run Training from the Notebook

# If not already in the workspace root, change directory
import os
os.chdir('/workspace')

# Now run the training script
!python src/main.py --config /workspace/tutorials/configs/clm_training_tutorial.yaml

2025-06-16 11:11:06 - utils.orchestrator - [0;32mINFO[0m - [0;32mFound 1 CUDA devices available for training[0m
[[0;32mINFO[0m | utils.orchestrator ]: [0;32mFound 1 CUDA devices available for training[0m
2025-06-16 11:11:06 - utils.orchestrator - [0;32mINFO[0m - [0;32mFound 1 CUDA devices available for training[0m
[[0;32mINFO[0m | utils.orchestrator ]: [0;32mFound 1 CUDA devices available for training[0m
2025-06-16 11:11:06 - utils.orchestrator - [0;32mINFO[0m - [0;32mStarting training pipeline[0m
[[0;32mINFO[0m | utils.orchestrator ]: [0;32mStarting training pipeline[0m
2025-06-16 11:11:06 - utils.orchestrator - [0;36mDEBUG[0m - [0;36mOrchestrator config gradient_accumulation_steps: value=2, type=<class 'int'>[0m
[[0;36mDEBUG[0m | utils.orchestrator ]: [0;36mOrchestrator config gradient_accumulation_steps: value=2, type=<class 'int'>[0m
2025-06-16 11:11:06 - utils.orchestrator - [0;36mDEBUG[0m - [0;36mOrchestrator config validate_after_k_steps: value

# =============================================================================

## Understanding Distributed Training Strategies

The Continual Pretraining Framework is intended to support all the distributed training strategies from Fabric Lightning:

1. **FSDP (Fully Sharded Data Parallel)**: Shards model parameters, gradients, and optimizer states across GPUs, enabling training of very large models.
2. **DDP (Distributed Data Parallel)**: Replicates the model on each GPU and synchronizes gradients, suitable for medium-sized models.
3. **DeepSpeed**: Implements ZeRO optimization for efficient large model training with memory optimizations. (Not implemented yet)
4. **DP (Data Parallel)**: Simple data parallelism for single-node multi-GPU setups.  (Not implemented yet)

# =============================================================================

# Best Practices and Optimization Tips

Here are some best practices and optimization tips for CLM training:

### Model Selection and Hardware Requirements

- **Model Size**: Choose a model size appropriate for your hardware. Larger models require more memory and compute.
- **Distributed Training**: For models larger than 1B parameters, consider using distributed training strategies like FSDP or DeepSpeed.

### Training Optimization

- **Gradient Accumulation**: Use gradient accumulation to simulate larger batch sizes on limited hardware.
- **Gradient Checkpointing**: Enable gradient checkpointing to reduce memory usage at the cost of increased computation time.
- **Mixed Precision Training**: Use mixed precision training (bf16 or fp16) to reduce memory usage and speed up training.
- **Learning Rate Scheduling**: Use a learning rate scheduler with warmup to improve training stability.

### Dataset Preparation

- **Dataset Size**: Larger datasets generally lead to better models, but also require more training time.
- **Dataset Quality**: High-quality, diverse data is crucial for good model performance.
- **Validation Split**: Always include a validation split to monitor training progress and prevent overfitting.

### Monitoring and Debugging

- **Regular Validation**: Validate the model regularly to catch issues early.
- **Gradient Norms**: Monitor gradient norms to detect exploding or vanishing gradients.
- **Learning Rate**: Start with a small learning rate and gradually increase it if training is stable.
- **Memory Usage**: Monitor GPU memory usage to detect memory leaks or inefficient memory usage.

# =============================================================================

## Integration with Tokenization

The CLM training task is designed to work seamlessly with the tokenized dataset produced by the tokenization task. Here's how to integrate the two tasks:

1. **Run the Tokenization Task**: First, run the tokenization task to prepare your dataset.
2. **Configure CLM Training**: Set up your CLM training configuration to use the tokenized dataset.
3. **Run CLM Training**: Execute the CLM training task using the tokenized dataset.

Example workflow:

In [None]:
import os
os.chdir("/workspace")


# 1. Run tokenization
!python src/main.py --config tutorials/configs/tokenization_tutorial.yaml

# 2. Run CLM training
print("Running CLM training...")
!python src/main.py --config tutorials/configs/clm_training_tutorial.yaml

2025-06-16 13:54:14 - src.utils.orchestrator - [0;32mINFO[0m - [0;32mStarting tokenization workflow[0m
2025-06-16 13:54:14 - src.utils.orchestrator - [0;32mINFO[0m - [0;32mLoading dataset from files at dir 'tutorials/data/raw_text_data'[0m
2025-06-16 13:54:14 - src.utils.dataset.storage - [0;32mINFO[0m - [0;32mProcessing files from 'tutorials/data/raw_text_data' and grouping by file extension.[0m
2025-06-16 13:54:14 - src.utils.dataset.storage - [0;32mINFO[0m - [0;32mStarting directory scan in: tutorials/data/raw_text_data[0m
2025-06-16 13:54:14 - src.utils.dataset.storage - [0;36mDEBUG[0m - [0;36mDirectory: tutorials/data/raw_text_data - Found 1/1 files with supported extensions ['txt', 'csv', 'json', 'jsonl'][0m
2025-06-16 13:54:14 - src.utils.dataset.storage - [0;36mDEBUG[0m - [0;36mScan completed: Found 1 matching files across 1 directories[0m
2025-06-16 13:54:14 - src.utils.dataset.storage - [0;36mDEBUG[0m - [0;36mGrouped files by extensions: ['txt (1)']

# =============================================================================

## Conclusion

In this tutorial, we've covered the basics of CLM training using the Continual Pretraining Framework. We've learned how to:

1. Load a tokenized dataset
2. Configure CLM training parameters
3. Select an appropriate distributed training strategy
4. Train a model using the ContinualOrchestrator
5. Monitor training progress and optimize training

The Continual Pretraining Framework provides a flexible and efficient way to train causal language models, with support for various distributed training strategies and optimization techniques.

For more advanced usage, refer to the framework documentation and experiment with different configurations to find what works best for your specific use case.