Issues with Custom SentencePiece Models and Pretrained Embeddings in Training #2582

HURIMOZ · 2024-04-14T18:07:01Z

Hello OpenNMT-py Community,
I've been working on training a bilingual model using OpenNMT-py and have encountered some challenges related to using custom SentencePiece models and pretrained embeddings. I'm seeking guidance or suggestions on how to resolve these issues.

Background:
I'm training a bilingual translation model with the transformer architecture. Given the linguistic characteristics of the target language, I initially attempted to implement custom tokenization rules using SentencePiece.

Issue:
Despite following the documentation and ensuring the config.yaml file is correctly set up for using SentencePiece models (src_spm.model and tgt_spm.model) and vocabularies (src_spm.vocab and tgt_spm.vocab), I encountered an error when running the onmt_train command:
onmt_train: error: the following arguments are required: -src_vocab/--src_vocab

After explicitly specifying the -src_vocab and -tgt_vocab arguments, I encountered another error related to the use of pretrained embeddings:
AssertionError: -save_data should be set if use pretrained embeddings.

Configuration:
OpenNMT-py version: 3.5.1
Model architecture: Transformer
Tokenization: SentencePiece subword tokenization for both source and target languages
Pretrained embeddings for the source language

Questions:

How can I resolve the errors encountered when using custom SentencePiece models and pretrained embeddings with OpenNMT-py?
Is there a recommended approach to integrating custom tokenization rules with OpenNMT-py's built-in SentencePiece functionality?

Are there specific configurations or steps required to use pretrained embeddings with SentencePiece tokenization in OpenNMT-py?

Any insights or suggestions from the community would be greatly appreciated.

The text was updated successfully, but these errors were encountered:

HURIMOZ closed this as completed Apr 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with Custom SentencePiece Models and Pretrained Embeddings in Training #2582

Issues with Custom SentencePiece Models and Pretrained Embeddings in Training #2582

HURIMOZ commented Apr 14, 2024

Issues with Custom SentencePiece Models and Pretrained Embeddings in Training #2582

Issues with Custom SentencePiece Models and Pretrained Embeddings in Training #2582

Comments

HURIMOZ commented Apr 14, 2024