HoogBERTa

This repository includes the Thai pretrained language representation (HoogBERTa_base) and the fine-tuned model for multitask sequence labeling.

Installation

$ pip install -r requirements.txt
$ pip install --editable .

To download model, use

>>> import hoogberta
>>> hoogberta.download() # or hoogberta.download('/home/user/.hoogberta/')

Usage

see test.py

Documentation

To annotate POS, NE and cluase boundary, use the following commands

from hoogberta.multitagger import HoogBERTaMuliTaskTagger
tagger = HoogBERTaMuliTaskTagger(cuda=False) # or cuda=True
output = tagger.nlp("วันที่ 12 มีนาคมนี้ ฉันจะไปเที่ยววัดพระแก้ว ที่กรุงเทพ")

Please give the "base path" parameter if you have changed the "models" directory to a different location than the current one, for example.

tagger = HoogBERTaMuliTaskTagger(cuda=False,base_path="/home/user/.hoogberta/" )

The output is a list of annotations (token, POS, NE, MARK). "MARK" is annotation for a single white space that can be PUNC (not clause boundary) or MARK (clause boundary). Note that, for clause boundary classification, the current pretrained model works well with inputs containing two clauses. If you want a more precise result, we recommend running tagger.nlp iteratively.

To extract token features, based on the RoBERTa architecture, use the following commands

from hoogberta.encoder import HoogBERTaEncoder
encoder = HoogBERTaEncoder(cuda=False) # or cuda=True
token_ids, features = encoder.extract_features("วันที่ 12 มีนาคมนี้ ฉันจะไปเที่ยววัดพระแก้ว ที่กรุงเทพ")

For batch processing,

inputText = ["วันที่ 12 มีนาคมนี้","ฉันจะไปเที่ยววัดพระแก้ว ที่กรุงเทพ"]
token_ids, features = encoder.extract_features_batch(inputText)

To use HoogBERTa as an embedding layer, use

tokens, features = encoder.extract_features_from_tensor(token_ids) # where token_ids is a tensor with type "long".

Huggingface Models

HoogBERTaEncoder

HoogBERTa: Feature Extraction and Mask Language Modeling

HoogBERTaMuliTaskTagger:

HoogBERTa-NER-lst20: Named-entity recognition (NER) based on LST20
HoogBERTa-POS-lst20: Part-of-speech tagging (POS) based on LST20
HoogBERTa-SENTENCE-lst20: Clause Boundary Classification based on LST20

Citation

Please cite as:

@inproceedings{porkaew2021hoogberta,
  title = {HoogBERTa: Multi-task Sequence Labeling using Thai Pretrained Language Representation},
  author = {Peerachet Porkaew, Prachya Boonkwan and Thepchai Supnithi},
  booktitle = {The Joint International Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP 2021)},
  year = {2021},
  address={Online}
}

Download full-text PDF

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
hoogberta		hoogberta
huggingface		huggingface
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py
test.py		test.py
test_encoder.py		test_encoder.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hoogberta

hoogberta

huggingface

huggingface

.gitignore

.gitignore

Dockerfile

Dockerfile

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

setup.cfg

setup.cfg

setup.py

setup.py

test.py

test.py

test_encoder.py

test_encoder.py

Repository files navigation

HoogBERTa

Installation

Usage

Documentation

Huggingface Models

Citation

About

Releases 2

Packages

Contributors 4

Languages

License

lstnlp/HoogBERTa

Folders and files

Latest commit

History

Repository files navigation

HoogBERTa

Installation

Usage

Documentation

Huggingface Models

Citation

About

Resources

License

Stars

Watchers

Forks

Languages