Tokenizer for languages with script in Brahmic Script family.
cd tokenizer
maturin developThe file telugu.json is a JSON file that contains the trained vocabulary for the Telugu language. It is used to initialize the Tokenizer class, which is part of the brahmi_script module.
import brahmi_script
tokenizer = brahmi_script.Tokenizer("telugu", "telugu.json")
encoded_string = tokenizer.encode("తెలుగు")
encoded_file = tokenizer.encode_file("telugu_text.txt")
decoded_string = tokenizer.decode(encoded_string)
assert decoded_string == "తెలుగు"
with open("telugu_text.txt", "r") as f:
decoded_file = tokenizer.decode(encoded_file)
file_text = f.read()
assert decoded_file == file_textimport sentencepiece as spm
import brahmi_script
brahmi = brahmi_script.Tokenizer("telugu", "telugu.json")
unigram = spm.SentencePieceProcessor(model_file='sp/uni-8000.model')
transformed = brahmi.transform_encode("తెలుగు")
encoded = unigram.encode(transformed, out_type=int)