A Python project for training custom Large Language Models (LLMs) from scratch using a Transformer-based architecture with PyTorch.
This project provides a complete pipeline to:
- Download and preprocess datasets from Hugging Face
- Train a custom tokenizer (BPE-based)
- Train a Transformer-based language model from scratch
- Generate text using the trained model
- Resume training from checkpoints
- Track model performance with evaluation metrics
- Custom Transformer Architecture - Causal self-attention with multi-head attention
- Flexible Configuration - Easily adjustable model hyperparameters via JSON
- Dataset Management - Support for multiple datasets from Hugging Face
- Checkpoint System - Save and resume training at any epoch
- Distributed Training Ready - Multi-GPU support with mixed precision (fp16)
- Tokenizer Training - BPE tokenizer training for both English and French
- Evaluation Metrics - Per-epoch validation and prompt generation tracking
Install
git clone https://github.com/levashi/makeyourownllm
cd makeyourownllmpython3 -m venv venv
source venv/bin/activateNote
Please install PyTorch yourself for a GPU integration
pip install -r requirements.txt- transformers (4.43.0) - Hugging Face transformers library
- torch - PyTorch deep learning framework
- accelerate (1.11.0) - Training acceleration utilities
- datasets (2.15.0) - Hugging Face datasets library
- tokenizers (>=0.19) - Fast tokenizer implementation
- bitsandbytes (0.40.0) - 8-bit quantization support
- numpy, scipy, tqdm - Numerical and utility libraries
- beautifulsoup4 - Web scraping utilities
{
"embed_dim": 400, # Embedding dimension
"num_heads": 16, # Number of attention heads
"num_layers": 9, # Number of transformer blocks
"block_size": 256, # Context window size
"vocab_size": 30000, # Vocabulary size
"mlp_ratio": 5.0, # MLP hidden dimension ratio
"dropout": 0.2, # Dropout probability
"batch_size": 32, # Training batch size
"lr": 0.0001, # Learning rate
"epochs": 10 # Number of training epochs
}Define the datasets to download and use for training.
python train.py mymodel \
--generate-tokenizer \
--import-datasets \
--merge-datasets \
--tokenizer-lang enArguments:
--generate-tokenizer- Train a new BPE tokenizer--import-datasets- Download datasets from Hugging Face--merge-datasets- Merge datasets into train/val/test splits--tokenizer-lang- Language for tokenizer (en or fr)
python train.py mymodel \
--tokenizer-path tokenizer.json \
--config-path config.json \
--eval-prompt "Once upon a time" \
--train-seed 42Training Arguments:
--tokenizer-path- Path to the tokenizer file--config-path- Path to model configuration--eval-prompt- Prompt for evaluation during training--eval-max-new-tokens- Max tokens for eval generation (default: 50)--train-seed- Random seed (default: 42)--workers- DataLoader workers (default: 0)
Resume from the last checkpoint:
python train.py mymodel --resume-training lastResume from a specific epoch:
python train.py mymodel --resume-training 5Resume from a checkpoint directory:
python train.py mymodel --resume-training path/to/checkpointpython run.py "Your prompt here" \
-m mymodel \
--device cuda \
--tokenizer-path mymodel/tokenizer.json \
-o output.txtInference Arguments:
prompt- Text prompt for generation-m, --model-path- Path to trained model directory--device- Device to use (cuda or cpu)-t, --tokenizer-path- Path to tokenizer file--use-auto-tokenizer- Use HuggingFace auto tokenizer-o, --output- Output file path (optional)
Generation Parameters (config-run.json):
{
"temperature": 0.5,
"top_p": 0.95,
"top_k": 50,
"num_beams": 1,
"max_new_tokens": 50,
"do_sample": true,
"repetition_penalty": 1.0,
"length_penalty": 1.0,
"early_stopping": false,
"seed": 42
}CausalSelfAttention:
- Multi-head causal attention mechanism
- Prevents tokens from attending to future positions
- Supports configurable number of attention heads
TransformerBlock:
- Layer normalization before attention and MLP
- Feed-forward network with GELU activation
- Residual connections and dropout
myTransformer:
- Token and positional embeddings
- Stack of transformer blocks
- Language modeling head for next-token prediction
- Generation method with temperature, top-k, and top-p sampling
- Automatic mixed precision (fp16) using PyTorch's autocast
- Gradient scaling for stable training
- Warmup phase with linear increase
- Cosine annealing decay
- Configurable warmup steps and total steps
- Save model, optimizer, and scheduler states
- Track best validation loss
- Save every N epochs
- Metadata tracking (epoch, global step, best loss)
- Streaming dataset for memory efficiency
- Multi-worker data loading
- Per-worker seed management for reproducibility
logs/{model_name}.log- Training log file{model_name}/eval_prompts.txt- Generated text samples during training
{model_name}/checkpoints/{epoch}/- Checkpoint directory containing:model.pt- Model state dictoptimizer.pt- Optimizer state dictscheduler.pt- Learning rate scheduler state dictmeta.json- Metadata (epoch, global_step, best_val_loss)
{model_name}/checkpoints/{model_name}.pt- Final trained model weights
# 1. Create tokenizer and download datasets
python train.py my_custom_llm \
--generate-tokenizer \
--import-datasets \
--merge-datasets \
--tokenizer-lang en \
--reset
# 2. Train for 10 epochs
python train.py my_custom_llm \
--tokenizer-path tokenizer.json \
--config-path config.json \
--eval-prompt "The future of AI is"
# 3. Generate text
python run.py "The future of AI is" \
-m my_custom_llm \
--device cuda \
-o generated_text.txt- GPU Memory: Adjust
batch_sizein config.json based on GPU VRAM - Sequence Length:
block_sizeaffects memory; start small and increase - Num Heads: Must evenly divide
embed_dim - Multi-GPU: Modify code to use
torch.nn.DataParallelorDistributedDataParallel
NaN/Inf in loss:
- Reduce learning rate
- Increase warmup steps
- Check for gradient overflow
Out of Memory:
- Reduce batch size
- Reduce block_size (sequence length)
- Use gradient accumulation
Poor generation quality:
- Train longer (increase epochs)
- Use larger model (increase embed_dim, num_layers)
- Increase dataset size/quality
MIT
levashi
Note: This is a learning project for understanding transformer-based language models from first principles.