You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello OpenNMT-py Community,
I've been working on training a bilingual model using OpenNMT-py and have encountered some challenges related to using custom SentencePiece models and pretrained embeddings. I'm seeking guidance or suggestions on how to resolve these issues.
Background:
I'm training a bilingual translation model with the transformer architecture. Given the linguistic characteristics of the target language, I initially attempted to implement custom tokenization rules using SentencePiece.
Issue:
Despite following the documentation and ensuring the config.yaml file is correctly set up for using SentencePiece models (src_spm.model and tgt_spm.model) and vocabularies (src_spm.vocab and tgt_spm.vocab), I encountered an error when running the onmt_train command: onmt_train: error: the following arguments are required: -src_vocab/--src_vocab
After explicitly specifying the -src_vocab and -tgt_vocab arguments, I encountered another error related to the use of pretrained embeddings: AssertionError: -save_data should be set if use pretrained embeddings.
Configuration:
OpenNMT-py version: 3.5.1
Model architecture: Transformer
Tokenization: SentencePiece subword tokenization for both source and target languages
Pretrained embeddings for the source language
Questions:
How can I resolve the errors encountered when using custom SentencePiece models and pretrained embeddings with OpenNMT-py?
Is there a recommended approach to integrating custom tokenization rules with OpenNMT-py's built-in SentencePiece functionality?
Are there specific configurations or steps required to use pretrained embeddings with SentencePiece tokenization in OpenNMT-py?
Any insights or suggestions from the community would be greatly appreciated.
The text was updated successfully, but these errors were encountered:
Hello OpenNMT-py Community,
I've been working on training a bilingual model using OpenNMT-py and have encountered some challenges related to using custom SentencePiece models and pretrained embeddings. I'm seeking guidance or suggestions on how to resolve these issues.
Background:
I'm training a bilingual translation model with the transformer architecture. Given the linguistic characteristics of the target language, I initially attempted to implement custom tokenization rules using SentencePiece.
Issue:
Despite following the documentation and ensuring the config.yaml file is correctly set up for using SentencePiece models (src_spm.model and tgt_spm.model) and vocabularies (src_spm.vocab and tgt_spm.vocab), I encountered an error when running the onmt_train command:
onmt_train: error: the following arguments are required: -src_vocab/--src_vocab
After explicitly specifying the -src_vocab and -tgt_vocab arguments, I encountered another error related to the use of pretrained embeddings:
AssertionError: -save_data should be set if use pretrained embeddings.
Configuration:
OpenNMT-py version: 3.5.1
Model architecture: Transformer
Tokenization: SentencePiece subword tokenization for both source and target languages
Pretrained embeddings for the source language
Questions:
How can I resolve the errors encountered when using custom SentencePiece models and pretrained embeddings with OpenNMT-py?
Is there a recommended approach to integrating custom tokenization rules with OpenNMT-py's built-in SentencePiece functionality?
Are there specific configurations or steps required to use pretrained embeddings with SentencePiece tokenization in OpenNMT-py?
Any insights or suggestions from the community would be greatly appreciated.
The text was updated successfully, but these errors were encountered: