Skip to content

Prepare SentencePiece and BPE on Malaysian texts (Jawi, Melayu, Manglish, Mandarin, Tamil).

Notifications You must be signed in to change notification settings

malaysia-ai/prepare-tokenizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 

Repository files navigation

prepare-tokenizer

Prepare SentencePiece (T5, Llama2) and Byte level (GPT2, RoBERTa) BPE on Malaysian texts (Jawi, Melayu, Manglish, Mandarin, Tamil).

dataset used

  1. https://huggingface.co/datasets/malaysia-ai/dedup-text-dataset
  2. https://huggingface.co/datasets/mesolitica/translated-code-instructions-122k
  3. https://huggingface.co/datasets/mesolitica/translated-unnatural_code_instructions_20M
  4. https://huggingface.co/datasets/mesolitica/translated-python-evol-instruct-51k
  5. https://huggingface.co/datasets/mesolitica/google-translate-ms-pa
  6. https://huggingface.co/datasets/mesolitica/google-translate-ms-ta

how-to

SentencePiece

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('malaysia-ai/sentencepiece-tokenizer')
tokenizer.encode('husein comel')
tokenizer.encode('husein cute')
tokenizer.encode('حسين چوميل')
tokenizer.encode('侯赛因很可爱')

BPE

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('malaysia-ai/bpe-tokenizer')
tokenizer.encode('husein comel')
tokenizer.encode('husein cute')
tokenizer.encode('حسين چوميل')
tokenizer.encode('侯赛因很可爱')
tokenizer.encode('ஹுசைன் அழகாக இருக்கிறார்')

how-to train

  1. Train SentencePiece,
python3 train-sentencepiece.py

When training SentencePiece,

  • Always partitioned long texts.

We use Standard_HB60-15rs to train.

  1. Train BPE,
python3 train-bpe.py

We use Standard_HB60-15rs to train.

About

Prepare SentencePiece and BPE on Malaysian texts (Jawi, Melayu, Manglish, Mandarin, Tamil).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published