UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 55: invalid start byte #2579

fkurushin · 2024-04-09T16:39:10Z

I have trained the bpe model with google sentencepiece spm_train tool. when I am trying to build vocabulary with onmt_build_vocab tool, the error is raised:

spm_train --input=/data/translator/parallel/zh_train.txt \
    --model_prefix=/data/translator/code/zh\
    --train_extremely_large_corpus=True\
    --minloglevel=1\
    --num_threads=40\
    --vocab_size=30000\
    --model_type=bpe

The same with the ru model and then:

onmt_build_vocab -config zh-ru-translator.yaml -n_sample -1

This is configuration:

# zh-ru-translator.yaml
## Where the samples will be written
save_data: run/opennmt_data
## Where the vocab(s) will be written
src_vocab: /data/translator/code/zh.vocab
tgt_vocab: /data/translator/code/ru.vocab


# Should match the vocab size for SentencePiece
src_vocab_size: 30000
tgt_vocab_size: 30000

share_vocab: False

# Corpus opts:
data:
    corpus_1:
        path_src: /data/translator/parallel/zh_train.txt
        path_tgt: /data/translator/parallel/ru_train.txt
        weight: 1
        transforms: [bpe, filtertoolong]
    valid:
        path_src: /data/translator/parallel/zh_valid.txt
        path_tgt: /data/translator/parallel/ru_valid.txt
        transforms: [bpe, filtertoolong]


### Transform related opts:
#### Subword
src_subword_model: /data/translator/code/zh.model
tgt_subword_model: /data/translator/code/ru.model
#### Filter
src_seq_length: 150
tgt_seq_length: 150

does anybody faced this issue before?

PS: Previously I have tried the open net bpe version, it was too slow for me, it run about 2 days without any result.

The text was updated successfully, but these errors were encountered:

fkurushin · 2024-04-22T16:15:12Z

using python3 OpenNMT-py/tools/spm_to_vocab.py solved my issue

fkurushin closed this as completed Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 55: invalid start byte #2579

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 55: invalid start byte #2579

fkurushin commented Apr 9, 2024 •

edited

Loading

fkurushin commented Apr 22, 2024

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 55: invalid start byte #2579

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 55: invalid start byte #2579

Comments

fkurushin commented Apr 9, 2024 • edited Loading

fkurushin commented Apr 22, 2024

fkurushin commented Apr 9, 2024 •

edited

Loading