Skip to content

ntphuc149/ViLegalLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

ViLegalLM: Language Models for Vietnamese Legal Text [ACL 2026]

This repository contains models, datasets, and source code for our paper "ViLegalLM: Language Models for Vietnamese Legal Text."

📁 Repository Structure

├── Checkpoints/                         # Pre-trained model checkpoints (Hugging Face format)
│   ├── ViLegalBERT/                     # 135M encoder-only model
│   ├── ViLegalQwen2.5-1.5B-Base/        # 1.54B decoder-only model
│   └── ViLegalQwen3-1.7B-Base/          # 1.72B decoder-only model
│
├── Datasets/                            # Synthetic datasets (CSV format)
│   ├── ViLegalTF/                       # True/False QA (13,032 train / 388 val)
│   ├── ViLegalMCQ/                      # Multiple Choice QA (14,920 train / 300 val)
│   └── ViLegalNLI/                      # Natural Language Inference (7,660 train / 150 val)
│
└── Source codes/
    ├── Pre-training ViLegalLM/          # Continual pre-training scripts
    │   ├── ViLegalBERT/                 # MLM pre-training (Masked Language Modeling)
    │   └── ViLegalQwen/                 # CLM pre-training (Causal Language Modeling)
    │
    └── Fine-tuning ViLegalLM/           # Task-specific fine-tuning (Jupyter Notebooks)
        ├── Information Retrieval/
        ├── Natural Language Inference/
        ├── Question Answering/
        │   ├── True/False/
        │   ├── Multiple-choice/
        │   ├── Multiple-choice Legal Knowledge/
        │   ├── Extractive QA/
        │   └── Abstractive QA/
        └── Syllogism Reasoning/

🤖 Models

All models are in Hugging Face format and can be loaded with transformers:

Model Parameters Context Length Pre-training Objective Checkpoint
ViLegalBERT 135M 256 MLM (15% masking) ViLegalBERT
ViLegalQwen2.5-1.5B-Base 1.54B 2,048 CLM ViLegalQwen2.5-1.5B-Base
ViLegalQwen3-1.7B-Base 1.72B 4,096 CLM ViLegalQwen3-1.7B-Base

Usage:

# For ViLegalBERT
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("ntphuc149/ViLegalBERT")
tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base-v2")

# For ViLegalQwen2.5-1.5B-Base
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("ntphuc149/ViLegalQwen2.5-1.5B-Base")
model = AutoModelForCausalLM.from_pretrained("ntphuc149/ViLegalQwen2.5-1.5B-Base")

# For ViLegalQwen3-1.7B-Base
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("ntphuc149/ViLegalQwen3-1.7B-Base")
model = AutoModelForCausalLM.from_pretrained("ntphuc149/ViLegalQwen3-1.7B-Base")

📊 Datasets

Three synthetic datasets created via LLM-based generation and hard negative mining:

  • ViLegalTF: True/False questions with 4 difficulty levels
  • ViLegalMCQ: Multiple choice questions (4 options) with 3 difficulty levels
  • ViLegalNLI: Hard negative pairs from ALQAC and ZALO benchmarks

All datasets are in CSV format with train/validation splits.

🚀 Quick Start

Pre-training

All pre-training code is provided as Jupyter Notebooks in Source codes/Pre-training ViLegalLM/.

Fine-tuning

All fine-tuning code is provided as Jupyter Notebooks in Source codes/Fine-tuning ViLegalLM/.

Notation:

  • [FT]: Discriminative fine-tuning (encoder models)
  • [IFT]: Instruction fine-tuning with QLoRA (decoder models)
  • [FT-CV]: 5-fold cross-validation

📈 Key Results

ViLegalLM achieves state-of-the-art results across 10 Vietnamese legal benchmarks. See paper Section 6 for complete results across all benchmarks.

⚙️ Training Configuration

Pre-training (see Appendix A for complete hyperparameters):

  • ViLegalBERT: 410K steps, A100 40GB, ~149 GPU hours
  • ViLegalQwen2.5-1.5B: 22K steps, A100 40GB, ~175 GPU hours
  • ViLegalQwen3-1.7B: 25.5K steps, H100 80GB, ~28 GPU hours

Fine-tuning (see Appendix B.3 for task-specific configurations):

  • Encoder models: Discriminative fine-tuning on P100 16GB
  • Decoder models: QLoRA (4-bit, rank=16, α=32) on P100 16GB

⚠️ Important Notes

  • Research use only: Models are NOT intended for production legal advice without expert validation.
  • Hardware: Pre-training requires A100/H100 GPUs. Fine-tuning works on P100 16GB.

About

ViLegalLM: Language Models for Vietnamese Legal Text (ACL 2026)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors