This repository contains models, datasets, and source code for our paper "ViLegalLM: Language Models for Vietnamese Legal Text."
├── Checkpoints/ # Pre-trained model checkpoints (Hugging Face format)
│ ├── ViLegalBERT/ # 135M encoder-only model
│ ├── ViLegalQwen2.5-1.5B-Base/ # 1.54B decoder-only model
│ └── ViLegalQwen3-1.7B-Base/ # 1.72B decoder-only model
│
├── Datasets/ # Synthetic datasets (CSV format)
│ ├── ViLegalTF/ # True/False QA (13,032 train / 388 val)
│ ├── ViLegalMCQ/ # Multiple Choice QA (14,920 train / 300 val)
│ └── ViLegalNLI/ # Natural Language Inference (7,660 train / 150 val)
│
└── Source codes/
├── Pre-training ViLegalLM/ # Continual pre-training scripts
│ ├── ViLegalBERT/ # MLM pre-training (Masked Language Modeling)
│ └── ViLegalQwen/ # CLM pre-training (Causal Language Modeling)
│
└── Fine-tuning ViLegalLM/ # Task-specific fine-tuning (Jupyter Notebooks)
├── Information Retrieval/
├── Natural Language Inference/
├── Question Answering/
│ ├── True/False/
│ ├── Multiple-choice/
│ ├── Multiple-choice Legal Knowledge/
│ ├── Extractive QA/
│ └── Abstractive QA/
└── Syllogism Reasoning/
All models are in Hugging Face format and can be loaded with transformers:
| Model | Parameters | Context Length | Pre-training Objective | Checkpoint |
|---|---|---|---|---|
| ViLegalBERT | 135M | 256 | MLM (15% masking) | ViLegalBERT |
| ViLegalQwen2.5-1.5B-Base | 1.54B | 2,048 | CLM | ViLegalQwen2.5-1.5B-Base |
| ViLegalQwen3-1.7B-Base | 1.72B | 4,096 | CLM | ViLegalQwen3-1.7B-Base |
Usage:
# For ViLegalBERT
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained("ntphuc149/ViLegalBERT")
tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base-v2")
# For ViLegalQwen2.5-1.5B-Base
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("ntphuc149/ViLegalQwen2.5-1.5B-Base")
model = AutoModelForCausalLM.from_pretrained("ntphuc149/ViLegalQwen2.5-1.5B-Base")
# For ViLegalQwen3-1.7B-Base
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("ntphuc149/ViLegalQwen3-1.7B-Base")
model = AutoModelForCausalLM.from_pretrained("ntphuc149/ViLegalQwen3-1.7B-Base")Three synthetic datasets created via LLM-based generation and hard negative mining:
- ViLegalTF: True/False questions with 4 difficulty levels
- ViLegalMCQ: Multiple choice questions (4 options) with 3 difficulty levels
- ViLegalNLI: Hard negative pairs from ALQAC and ZALO benchmarks
All datasets are in CSV format with train/validation splits.
All pre-training code is provided as Jupyter Notebooks in Source codes/Pre-training ViLegalLM/.
All fine-tuning code is provided as Jupyter Notebooks in Source codes/Fine-tuning ViLegalLM/.
Notation:
[FT]: Discriminative fine-tuning (encoder models)[IFT]: Instruction fine-tuning with QLoRA (decoder models)[FT-CV]: 5-fold cross-validation
ViLegalLM achieves state-of-the-art results across 10 Vietnamese legal benchmarks. See paper Section 6 for complete results across all benchmarks.
Pre-training (see Appendix A for complete hyperparameters):
- ViLegalBERT: 410K steps, A100 40GB, ~149 GPU hours
- ViLegalQwen2.5-1.5B: 22K steps, A100 40GB, ~175 GPU hours
- ViLegalQwen3-1.7B: 25.5K steps, H100 80GB, ~28 GPU hours
Fine-tuning (see Appendix B.3 for task-specific configurations):
- Encoder models: Discriminative fine-tuning on P100 16GB
- Decoder models: QLoRA (4-bit, rank=16, α=32) on P100 16GB
- Research use only: Models are NOT intended for production legal advice without expert validation.
- Hardware: Pre-training requires A100/H100 GPUs. Fine-tuning works on P100 16GB.