Add tokenisation support for the Tetun language #224

raphaelmerx · 2021-03-13T10:44:24Z

Description

Adding support for tokenisation of the Tetun Dili language (code tdt). Tetun has words that contain apostrophes (e.g "I" in Tetun is "ha'u"). The logic here will keep apostrophes when they are part of words, but will tokenize them if they are used for quotes.

Test

echo "bainhira sa'e aviaun husi Austrália ba Timor-Leste dehan 'oh kadeira sa'e ona'" | perl mosesdecoder/scripts/tokenizer/tokenizer.perl -threads 1 -l tdt -no-escape
Tokenizer Version 1.1
Language: tdt
Number of threads: 1
bainhira sa'e aviaun husi Austrália ba Timor-Leste dehan ' oh kadeira sa'e ona '

hieuhoang · 2021-03-15T03:37:04Z

thanks!

Add tokenisation support for the Tetun language

75d4c67

hieuhoang merged commit 0036c6c into moses-smt:master Mar 15, 2021

raphaelmerx mentioned this pull request Mar 27, 2021

Add tokenization for Tetun (tdt) hplt-project/sacremoses#114

Open

jelmervdl mentioned this pull request Sep 27, 2023

Add tokenization for Tetun Dili (tdt) hplt-project/sacremoses#144

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tokenisation support for the Tetun language #224

Add tokenisation support for the Tetun language #224

raphaelmerx commented Mar 13, 2021

hieuhoang commented Mar 15, 2021

Add tokenisation support for the Tetun language #224

Add tokenisation support for the Tetun language #224

Conversation

raphaelmerx commented Mar 13, 2021

Description

Test

hieuhoang commented Mar 15, 2021