ViLegalLM: Language Models for Vietnamese Legal Text [ACL 2026]

This repository contains models, datasets, and source code for our paper "ViLegalLM: Language Models for Vietnamese Legal Text."

📁 Repository Structure

├── Checkpoints/                         # Pre-trained model checkpoints (Hugging Face format)
│   ├── ViLegalBERT/                     # 135M encoder-only model
│   ├── ViLegalQwen2.5-1.5B-Base/        # 1.54B decoder-only model
│   └── ViLegalQwen3-1.7B-Base/          # 1.72B decoder-only model
│
├── Datasets/                            # Synthetic datasets (CSV format)
│   ├── ViLegalTF/                       # True/False QA (13,032 train / 388 val)
│   ├── ViLegalMCQ/                      # Multiple Choice QA (14,920 train / 300 val)
│   └── ViLegalNLI/                      # Natural Language Inference (7,660 train / 150 val)
│
└── Source codes/
    ├── Pre-training ViLegalLM/          # Continual pre-training scripts
    │   ├── ViLegalBERT/                 # MLM pre-training (Masked Language Modeling)
    │   └── ViLegalQwen/                 # CLM pre-training (Causal Language Modeling)
    │
    └── Fine-tuning ViLegalLM/           # Task-specific fine-tuning (Jupyter Notebooks)
        ├── Information Retrieval/
        ├── Natural Language Inference/
        ├── Question Answering/
        │   ├── True/False/
        │   ├── Multiple-choice/
        │   ├── Multiple-choice Legal Knowledge/
        │   ├── Extractive QA/
        │   └── Abstractive QA/
        └── Syllogism Reasoning/

🤖 Models

All models are in Hugging Face format and can be loaded with transformers:

Model	Parameters	Context Length	Pre-training Objective	Checkpoint
ViLegalBERT	135M	256	MLM (15% masking)	`ViLegalBERT`
ViLegalQwen2.5-1.5B-Base	1.54B	2,048	CLM	`ViLegalQwen2.5-1.5B-Base`
ViLegalQwen3-1.7B-Base	1.72B	4,096	CLM	`ViLegalQwen3-1.7B-Base`

Usage:

# For ViLegalBERT
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("ntphuc149/ViLegalBERT")
tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base-v2")

# For ViLegalQwen2.5-1.5B-Base
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("ntphuc149/ViLegalQwen2.5-1.5B-Base")
model = AutoModelForCausalLM.from_pretrained("ntphuc149/ViLegalQwen2.5-1.5B-Base")

# For ViLegalQwen3-1.7B-Base
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("ntphuc149/ViLegalQwen3-1.7B-Base")
model = AutoModelForCausalLM.from_pretrained("ntphuc149/ViLegalQwen3-1.7B-Base")

📊 Datasets

Three synthetic datasets created via LLM-based generation and hard negative mining:

ViLegalTF: True/False questions with 4 difficulty levels
ViLegalMCQ: Multiple choice questions (4 options) with 3 difficulty levels
ViLegalNLI: Hard negative pairs from ALQAC and ZALO benchmarks

All datasets are in CSV format with train/validation splits.

🚀 Quick Start

Pre-training

All pre-training code is provided as Jupyter Notebooks in Source codes/Pre-training ViLegalLM/.

Fine-tuning

All fine-tuning code is provided as Jupyter Notebooks in Source codes/Fine-tuning ViLegalLM/.

Notation:

[FT]: Discriminative fine-tuning (encoder models)
[IFT]: Instruction fine-tuning with QLoRA (decoder models)
[FT-CV]: 5-fold cross-validation

📈 Key Results

ViLegalLM achieves state-of-the-art results across 10 Vietnamese legal benchmarks. See paper Section 6 for complete results across all benchmarks.

⚙️ Training Configuration

Pre-training (see Appendix A for complete hyperparameters):

ViLegalBERT: 410K steps, A100 40GB, ~149 GPU hours
ViLegalQwen2.5-1.5B: 22K steps, A100 40GB, ~175 GPU hours
ViLegalQwen3-1.7B: 25.5K steps, H100 80GB, ~28 GPU hours

Fine-tuning (see Appendix B.3 for task-specific configurations):

Encoder models: Discriminative fine-tuning on P100 16GB
Decoder models: QLoRA (4-bit, rank=16, α=32) on P100 16GB

⚠️ Important Notes

Research use only: Models are NOT intended for production legal advice without expert validation.
Hardware: Pre-training requires A100/H100 GPUs. Fine-tuning works on P100 16GB.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ViLegalLM: Language Models for Vietnamese Legal Text [ACL 2026]

📁 Repository Structure

🤖 Models

📊 Datasets

🚀 Quick Start

Pre-training

Fine-tuning

📈 Key Results

⚙️ Training Configuration

⚠️ Important Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

ViLegalLM: Language Models for Vietnamese Legal Text [ACL 2026]

📁 Repository Structure

🤖 Models

📊 Datasets

🚀 Quick Start

Pre-training

Fine-tuning

📈 Key Results

⚙️ Training Configuration

⚠️ Important Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages