Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 55: invalid start byte #2579

Closed
fkurushin opened this issue Apr 9, 2024 · 1 comment

Comments

@fkurushin
Copy link

fkurushin commented Apr 9, 2024

I have trained the bpe model with google sentencepiece spm_train tool. when I am trying to build vocabulary with onmt_build_vocab tool, the error is raised:

spm_train --input=/data/translator/parallel/zh_train.txt \
    --model_prefix=/data/translator/code/zh\
    --train_extremely_large_corpus=True\
    --minloglevel=1\
    --num_threads=40\
    --vocab_size=30000\
    --model_type=bpe

The same with the ru model and then:

onmt_build_vocab -config zh-ru-translator.yaml -n_sample -1

This is configuration:

# zh-ru-translator.yaml
## Where the samples will be written
save_data: run/opennmt_data
## Where the vocab(s) will be written
src_vocab: /data/translator/code/zh.vocab
tgt_vocab: /data/translator/code/ru.vocab


# Should match the vocab size for SentencePiece
src_vocab_size: 30000
tgt_vocab_size: 30000

share_vocab: False

# Corpus opts:
data:
    corpus_1:
        path_src: /data/translator/parallel/zh_train.txt
        path_tgt: /data/translator/parallel/ru_train.txt
        weight: 1
        transforms: [bpe, filtertoolong]
    valid:
        path_src: /data/translator/parallel/zh_valid.txt
        path_tgt: /data/translator/parallel/ru_valid.txt
        transforms: [bpe, filtertoolong]


### Transform related opts:
#### Subword
src_subword_model: /data/translator/code/zh.model
tgt_subword_model: /data/translator/code/ru.model
#### Filter
src_seq_length: 150
tgt_seq_length: 150

does anybody faced this issue before?

PS: Previously I have tried the open net bpe version, it was too slow for me, it run about 2 days without any result.

@fkurushin
Copy link
Author

using python3 OpenNMT-py/tools/spm_to_vocab.py solved my issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant