# Transformer Language Model Training on Databricks

This notebook trains a Transformer language model using **PyTorch DDP** on **multiple GPUs**.

## Cluster Requirements

- **Runtime**: DBR 13.3 ML or higher (with PyTorch 2.0+)
- **Driver**: Instance with multiple GPUs (e.g., `p3.16xlarge` with 8x V100)
- **Workers**: 0 (single-node multi-GPU)
- **Libraries**: Will be installed in Step 1

## GPU Cluster Options

### For 8 GPUs:
- `p3.16xlarge` - 8x V100 (16GB each)
- `p4d.24xlarge` - 8x A100 (40GB each)
- `g5.48xlarge` - 8x A10G (24GB each)

### For 4 GPUs:
- `p3.8xlarge` - 4x V100 (16GB each)
- `g5.12xlarge` - 4x A10G (24GB each)
- `g4dn.12xlarge` - 4x T4 (16GB each)


## Step 0: Check GPU Availability


In [None]:
import torch
import os

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Number of GPUs: {torch.cuda.device_count()}")

if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    for i in range(torch.cuda.device_count()):
        props = torch.cuda.get_device_properties(i)
        print(f"\nGPU {i}: {props.name}")
        print(f"  Memory: {props.total_memory / 1024**3:.2f} GB")

print(f"\nCurrent directory: {os.getcwd()}")


## Step 1: Setup Project & Install Dependencies


In [None]:
# Install dependencies
%pip install einops pyyaml tensorboard tqdm --quiet

print("Packages installed successfully!")


In [None]:
# Set project directory
import os
from pathlib import Path

# Modify this path based on where you uploaded the project files
PROJECT_DIR = "/dbfs/tmp/transformer-ddp-lm"  # Change as needed

print(f"Project directory: {PROJECT_DIR}")
os.makedirs(PROJECT_DIR, exist_ok=True)
os.chdir(PROJECT_DIR)
print(f"Changed to: {os.getcwd()}")


**Note**: The repository is automatically cloned from GitHub in the cell above.

If you need to update the code later:
```python
%sh
cd /dbfs/tmp/transformer-ddp-lm
git pull
```


## Step 2: Prepare Dataset


In [None]:
# Create data directory and prepare toy dataset
!mkdir -p data
!python data/prepare_dataset.py --output data/toy_dataset.txt --repeat 100

# Verify dataset
import os
if os.path.exists("data/toy_dataset.txt"):
    size = os.path.getsize("data/toy_dataset.txt")
    print(f"\nDataset created: {size:,} bytes")
else:
    print("\nDataset creation failed!")


## Step 3: Train with Multiple GPUs (DDP)


In [None]:
# Get number of available GPUs
import torch
num_gpus = torch.cuda.device_count()

print(f"Starting DDP training with {num_gpus} GPUs...")
print("="*80)

# Launch multi-GPU training
!python databricks_train.py --num-gpus {num_gpus} --config configs/default_config.yaml

print("\n" + "="*80)
print("Training completed!")
print("="*80)


## Step 4: Inference & Text Generation


In [None]:
# Generate text with trained model
!python inference.py --checkpoint checkpoints/best_model.pt --prompt "The Transformer is" --max-length 300 --temperature 0.8
