Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about SentencePiece [SOS] and [EOS] ID. #12

Closed
nooralahzadeh opened this issue Aug 14, 2020 · 4 comments
Closed

Question about SentencePiece [SOS] and [EOS] ID. #12

nooralahzadeh opened this issue Aug 14, 2020 · 4 comments

Comments

@nooralahzadeh
Copy link

Hi,
I saw that in SentencePieceTrainer, as below you made EOS and BOS and MASK and PAS tokens equal to Zero
" --bos_id=-1 --eos_id=-1" " --control_symbols=[SOS],[EOS],[MASK]"
However, during the captioning, you define
sos_index: int = 1, eos_index: int = 2,
I am wondering if these setups , have any effects?

@kdexd
Copy link
Owner

kdexd commented Aug 16, 2020

By default, SentencePieceTrainer assigns ID 1 as <s> and ID 2 as </s> . Check here

I prefer [SOS] and [EOS] in text instead of <s> and </s>, so I passed my custom symbols as --control-symbols. Internally, SentencePieceTrainer reserves ID 0 for <unk> (which cannot be changed as far as I know), and other control symbols are assigned from ID 3 (in presence of default <s> and </s>) or ID 1 (in absence of <s> and </s>).

So I turned off default <s> and </s>, and instead provided [SOS] and [EOS] so they get ID 1 and 2 respectively.

@kdexd kdexd changed the title question Question about SentencePiece [SOS] and [EOS] ID. Aug 16, 2020
@kdexd
Copy link
Owner

kdexd commented Aug 16, 2020

Edited title for others to search easily. :-)

@nooralahzadeh
Copy link
Author

nooralahzadeh commented Aug 16, 2020

Thanks. How about [PAD]'s id, is it zero by default?

@kdexd
Copy link
Owner

kdexd commented Aug 16, 2020

[PAD] and <unk> are same — we use the same token to right-pad captions and represent out-of-vocabulary tokens, similar to recent image captioning models. It is ID 0 by default. As far as I remember, I always refer its token as <unk>, and the corresponding variable name in code is padding_idx.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants