Skip to content
This repository has been archived by the owner on Oct 31, 2022. It is now read-only.

Consued about vocab and encoder #31

Open
weiguowilliam opened this issue Sep 23, 2019 · 0 comments
Open

Consued about vocab and encoder #31

weiguowilliam opened this issue Sep 23, 2019 · 0 comments

Comments

@weiguowilliam
Copy link

I'm reading the source code. And I have two questions about vocab and encoder. Please help me with that. Thank you in advance.

  1. For vocab.bpe, I take the second row (Ġ t) for example. But I found "Ġ" appears in many rows(for example the third row). So why isn't it one-to-one correspondence?
  2. Are the items in encoder.json the subtokens from BPE? I take "\u0120regress" for example. Why does "\u0120" appear here?
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant