Train tiny language models entirely from scratch using native PyTorch!
This open-source project provides a complete implementation for training tiny language models from the ground up. Most core algorithmic code has been rebuilt from scratch using native PyTorch. This represents not only a full-stage open-source recreation of large language models but also serves as an introductory tutorial to LLM development.
β¨ Pure PyTorch Implementation: Every component is built from scratch using native PyTorch
- Multi-head attention mechanism
- Transformer blocks with RMSNorm
- Positional encodings (RoPE)
- Custom training loop and optimization
π Complete Training Pipeline:
- Data loading and preprocessing
- Training with gradient clipping and learning rate scheduling
- Checkpointing and model saving
- Evaluation loop
π Educational: Extensively documented code suitable for learning
- Python 3.10
- PyTorch 2.6.0
- transformers 4.57.1
- NumPy 1.26.4
pip install -r requirements.txtcd ./trainer
torchrun --nproc_per_node=4 train_pretrain.pycd ./trainer
torchrun --nproc_per_node=4 train_full_sft.py --model_path /home/xxx/data/models/Tiny-Language-Model/pretrain/pretrain_768_epoch2.pthtorchrun --nproc_per_node=4 train_lora.py --model_path /home/xxx/data/models/Tiny-Language-Model/full_sft/full_sft_768_epoch1.pthtorchrun --nproc_per_node=4 train_dpo.py --model_path /home/xxx/data/models/Tiny-Language-Model/full_sft/full_sft_768_epoch1.pthpython inference.py --model_path /home/xxx/data/models/Tiny-Language-Model/pretrain/pretrain_768_epoch2.pth --weight pretrain
python inference.py --model_path /home/xxx/data/models/Tiny-Language-Model/full_sft/full_sft_768_epoch1.pth --weight full_sft
python inference.py --model_path /home/xxx/data/models/Tiny-Language-Model/dpo/dpo_768_epoch0.pth --weight dpo
The model implements a decoder-only transformer architecture:
- Token Embeddings: Maps input tokens to dense vectors
- Positional Encodings: Adds position information (sinusoidal or learned)
- Transformer Blocks: Multiple layers of:
- Grouped query attention
- SwiGLU feedforward networks
- RMS normalization
- Residual connections
- Language Model Head: Projects to vocabulary logits
Tiny-Language-Model/
βββ model/ # Model architecture implementation
β βββ model_tinylm.py # Tiny Language Model neural network
βββ tokenizer/ # Tokenization utilities
β βββ tokenizer_config.json # Tokenizer configuration file
β βββ tokenizer.json # Tokenizer vocabulary and rules
βββ dataset/ # Data handling utilities
β βββ dataset.py # Dataset loading and preprocessing
βββ trainer/ # Training utilities
β βββ trainer_pretrain.py # Pre-training script
β βββ trainer_full_sft.py # Supervised Fine-Tuning script
β βββ trainer_lora.py # LoRA script
β βββ trainer_dpo.py # DPO script
β βββ trainer_utils.py # Training helper functions
βββ inference.py # Inference script
βββ requirements.txt # Python dependencies
βββ README.md # Project documentation
GQA is an attention mechanism that bridges the gap between Multi-Head Attention (MHA) and Multi-Query Attention (MQA). It provides a balance between model quality and inference efficiency by grouping multiple query heads to share the same key and value heads.
RoPE is a positional encoding method that encodes absolute positional information using rotation matrices, allowing the model to naturally capture relative positional relationships through the attention mechanism.
For language modeling, we use causal (autoregressive) masking to ensure the model can only attend to previous tokens, not future ones.
The model generates text one token at a time, using its own predictions as input for the next step.
- Start Small: Begin with a small model to verify your pipeline works
- Monitor Loss: Watch training and evaluation loss to detect overfitting
- Learning Rate: Use learning rate warmup and cosine decay for stable training
- Gradient Clipping: Essential for preventing exploding gradients
- Checkpointing: Save checkpoints regularly in case training is interrupted
Common hyperparameter ranges:
- Learning rate: 1e-4 to 5e-4
- Batch size: 8 to 64 (depending on GPU memory)
- Model dimension: 512 or 768
- Number of layers: 8 or 16
- Number of heads: 8
- Dropout: 0
The model size and training speed depend on your hardware:
- Small model (512 dim, 8 layers): ~25M parameters, GPU required
- Large model (768 dim, 16 layers): ~104M parameters, GPU required
Contributions are welcome! This project aims to be educational and accessible. Please:
- Keep implementations clear and well-documented
- Avoid adding unnecessary dependencies
- Include examples for new features
- Follow the existing code style
This project is licensed under the MIT License - see the LICENSE file for details.
This implementation is designed for learning. Here are some concepts covered:
- Transformer architecture
- Self-attention mechanisms
- Tokenization algorithms (BPE)
- Training deep neural networks
- Text generation strategies
- Optimization techniques
If you use this code for research or educational purposes, please cite:
@software{tiny_language_model,
author = {Zilong Liu},
title = {Tiny Language Model: Train Small Language Models from Scratch},
year = {2026},
url = {https://github.com/lyuzlion/Tiny-Language-Model}
}This project is inspired by:
- "Attention is All You Need" (Vaswani et al., 2017)
- Various open-source implementations of transformers
- The PyTorch community
Built with β€οΈ for learning and education