HuTieuBERT

Morpheme-Aware Transformer for Vietnamese

News: Accepted to ACL 2026 Main Conference.

Authors: Anh Trac Duc Dinh, Khang Hoang Nhat Vo, Tai Tien Ta, Vinh Cong Doan, Tho Quan

Camera-ready version is not available yet. We will update the paper and citation information after the conference.

This repository implements a morpheme-aware Transformer architecture that enhances pretrained encoders with explicit morphological structure for isolating languages.

By introducing two lightweight inductive biases:

Adaptive Boundary-Token Fusion
Morpheme-Aware Attention Bias

The model effectively captures compound cohesion and morpheme boundaries that standard Transformers often overlook, while remaining optimized for Vietnamese.

The design is portable to other isolating languages like Mandarin Chinese, consistently improving performance on syntactic tasks such as POS tagging and Named Entity Recognition (NER).

Adaptive Boundary-Token Fusion

Subword Alignment:

Syncing Labels: This step aligns word-structure tags (BMES) with the smaller sub-units of text created during tokenization:
- B (Begin): Marks the first syllable or character of a multi-syllable word.
- M (Middle): Applied to the internal syllables or characters of a multi-syllable word.
- E (End): Marks the final syllable or character of a multi-syllable word.
- S (Single): Used for standalone words that consist of only one syllable or character.
Maintaining Structure: By expanding these tags, the model ensures that multi-syllable words keep their linguistic meaning even when broken into pieces.

Adaptive Interpolation Layer

Blending Information: This module combines standard word data with specific "boundary" information that marks where words start and end.
Smart Filtering: A "gate" automatically decides how much boundary information is needed for each word based on its context.
Rich Representation: The result is a more complete digital representation of the text that respects the natural boundaries of the language.

Morpheme-Aware Attention Bias

This module guides the model's focus by injecting a fixed structural prior into the early self-attention layers. It ensures that the "attention" mass respects the natural boundaries of compounds rather than spreading too thin across unrelated words.

The bias is controlled by a matrix using four key parameters to modulate relationship scores:

Alpha ($\alpha$): Strengthens focus between tokens that belong to the same compound phrase.
Beta ($\beta$): Penalizes or "mutes" attention between tokens that belong to different compounds.
Gamma ($\gamma$): Highlights and adjusts the importance of single-word units.
Delta ($\delta$): Controls the strength of a token's focus on itself (self-attention bias).

By reweighting these connections, the model maintains a stable internal geometry while gaining a clearer understanding of linguistic structure. This method not only work with Vietnamese but also other Isolating Languages like Mandarin Chinese, Thai...

Before Running the Code

HuTieuBERT depends on VnCoreNLP for Vietnamese word segmentation. Before running any tokenizer or model example in this repository, please download and set up VnCoreNLP following the official repository: vncorenlp/VnCoreNLP.

Recommended setup:

pip install py_vncorenlp

import py_vncorenlp

py_vncorenlp.download_model(save_dir="/absolute/path/to/vncorenlp")

Notes:

Java 1.8 or newer is required by VnCoreNLP.
Set vncorenlp_dir to the same directory you used in download_model(...).
The camera-ready paper is not available yet; we will update the README with the final publication details after the conference.

Example Usage

import torch
import torch.nn as nn
from transformers.models.roberta.modeling_roberta import RobertaModel, RobertaEncoder, RobertaLayer
from transformers import RobertaConfig

from model.tokenizer import MorphemeAwareTokenizer
from model.embeddings import BoundaryAwareEmbeddings
from model.model import MorphemeAwareRobertaModel, MorphemeAwareRobertaForSequenceClassification

tokenizer = MorphemeAwareTokenizer.from_pretrained(
    "ducanhdinh/HuTieuBert",
    vncorenlp_dir="/content/vncorenlp",
    return_tensors="pt"
)

config = RobertaConfig.from_pretrained("ducanhdinh/HuTieuBert")

# Applied Structural Bias Matrix to Layer 1 and 2 full 12 heads
target_heads = {
    1: list(range(config.num_attention_heads)),
    2: list(range(config.num_attention_heads)),
}

model = MorphemeAwareRobertaForSequenceClassification(
    config,
    num_labels=label_num,
    target_heads=target_heads,
    alpha=0.5,
    beta=-0.3,
    gamma=0.0,
    delta=0.0,
)

model.roberta = MorphemeAwareRobertaModel.from_pretrained(
    "ducanhdinh/HuTieuBert",
    target_heads=target_heads,
    alpha=0.5,
    beta=-0.3,
    gamma=0.0,
    delta=0.0,
)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

Acknowledgement

Paper (Coming Soon)

The full paper describing our method and experimental results will be released soon. If you find this work useful, please consider citing our paper:

@article{,
  title   = {Coming Soon},
  author  = {Coming Soon},
  journal = {Coming Soon},
  year    = {Coming Soon},
}

Use of External Segmentation

This work makes use of VnCoreNLP - a Vietnamese natural language processing toolkit.

Copyright (C) 2018-2019 VnCoreNLP
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
See the GNU General Public License for more details.

Repository: https://github.com/vncorenlp/VnCoreNLP

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
figures		figures
model		model
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HuTieuBERT

Morpheme-Aware Transformer for Vietnamese

Adaptive Boundary-Token Fusion

Morpheme-Aware Attention Bias

Before Running the Code

Example Usage

Acknowledgement

Paper (Coming Soon)

Use of External Segmentation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HuTieuBERT

Morpheme-Aware Transformer for Vietnamese

Adaptive Boundary-Token Fusion

Morpheme-Aware Attention Bias

Before Running the Code

Example Usage

Acknowledgement

Paper (Coming Soon)

Use of External Segmentation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages