Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with Custom SentencePiece Models and Pretrained Embeddings in Training #2582

Closed
HURIMOZ opened this issue Apr 14, 2024 · 0 comments
Closed

Comments

@HURIMOZ
Copy link

HURIMOZ commented Apr 14, 2024

Hello OpenNMT-py Community,
I've been working on training a bilingual model using OpenNMT-py and have encountered some challenges related to using custom SentencePiece models and pretrained embeddings. I'm seeking guidance or suggestions on how to resolve these issues.

Background:
I'm training a bilingual translation model with the transformer architecture. Given the linguistic characteristics of the target language, I initially attempted to implement custom tokenization rules using SentencePiece.

Issue:
Despite following the documentation and ensuring the config.yaml file is correctly set up for using SentencePiece models (src_spm.model and tgt_spm.model) and vocabularies (src_spm.vocab and tgt_spm.vocab), I encountered an error when running the onmt_train command:
onmt_train: error: the following arguments are required: -src_vocab/--src_vocab

After explicitly specifying the -src_vocab and -tgt_vocab arguments, I encountered another error related to the use of pretrained embeddings:
AssertionError: -save_data should be set if use pretrained embeddings.

Configuration:
OpenNMT-py version: 3.5.1
Model architecture: Transformer
Tokenization: SentencePiece subword tokenization for both source and target languages
Pretrained embeddings for the source language

Questions:

  • How can I resolve the errors encountered when using custom SentencePiece models and pretrained embeddings with OpenNMT-py?

  • Is there a recommended approach to integrating custom tokenization rules with OpenNMT-py's built-in SentencePiece functionality?

Are there specific configurations or steps required to use pretrained embeddings with SentencePiece tokenization in OpenNMT-py?

Any insights or suggestions from the community would be greatly appreciated.

@HURIMOZ HURIMOZ closed this as completed Apr 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant