Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] In CharBPETokenizer, when Vocab or merges is None, unk_token cannot be used. #1136

Merged
merged 7 commits into from
Dec 27, 2022
Merged

Conversation

SeongBeomLEE
Copy link
Contributor

Same Error #1120

Thanks.

before

char_level_bep.py

if vocab is not None and merges is not None:
    tokenizer = Tokenizer(
        BPE(
            vocab,
            merges,
            dropout=dropout,
            unk_token=str(unk_token),
            end_of_word_suffix=suffix,
        )
    )
else:
    tokenizer = Tokenizer(BPE())

main.py

from tokenizers import CharBPETokenizer

tokenizer = CharBPETokenizer(
    unk_token = "[UNK]",
    suffix = "</w>",
)

tokenizer.train(
    files = './vocab.txt',
    vocab_size = 1000,
    min_frequency = 1,
    special_tokens = ["[PAD]", "[BOS]", "[EOS]", "[UNK]", "[SEP]", "[CLS]", "[MASK]"],
)

line = "나는 😀 😃 😄 축구를 😀 😃 😄 좋아한다."
pieces = tokenizer.encode(line)
print("CharBPETokenizer", pieces.tokens, tokenizer.decode(pieces.ids))

output:

CharBPETokenizer ['나는</w>', '축구를</w>', '좋아한다</w>', '.</w>'] 나는 축구를 좋아한다 .

after

char_level_bep.py

if vocab is not None and merges is not None:
    tokenizer = Tokenizer(
        BPE(
            vocab,
            merges,
            dropout=dropout,
            unk_token=str(unk_token),
            end_of_word_suffix=suffix,
        )
    )
else:
    tokenizer = Tokenizer(BPE(unk_token=str(unk_token),))

main.py

from tokenizers import CharBPETokenizer

tokenizer = CharBPETokenizer(
    unk_token = "[UNK]",
    suffix = "</w>",
)

tokenizer.train(
    files = './vocab.txt',
    vocab_size = 1000,
    min_frequency = 1,
    special_tokens = ["[PAD]", "[BOS]", "[EOS]", "[UNK]", "[SEP]", "[CLS]", "[MASK]"],
)

line = "나는 😀 😃 😄 축구를 😀 😃 😄 좋아한다."
pieces = tokenizer.encode(line)
print("CharBPETokenizer", pieces.tokens, tokenizer.decode(pieces.ids))

output:

CharBPETokenizer ['나는</w>', '[UNK]', '[UNK]', '[UNK]', '축구를</w>', '[UNK]', '[UNK]', '[UNK]', '좋아한다</w>', '.</w>'] 나는 축구를 좋아한다 .

SeongBeomLEE and others added 5 commits December 12, 2022 18:41
In SentencePieceBPETokenizer, when Vocab or  merges is None, unk_token cannot be used.
…e_bpe.py

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
In CharBPETokenizer, when Vocab or merges is None, unk_token cannot be used.
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Dec 25, 2022

The documentation is not available anymore as the PR was closed or merged.

…pe.py

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
…pe.py

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
Copy link
Collaborator

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you for this

@Narsil Narsil merged commit 9b155b5 into huggingface:main Dec 27, 2022
Narsil added a commit that referenced this pull request Jan 16, 2023
…nnot be used. (#1136)

* [fix] Use unk_token

In SentencePieceBPETokenizer, when Vocab or  merges is None, unk_token cannot be used.

* [fix] If unk_token is None, this case is also considered.

* Update bindings/python/py_src/tokenizers/implementations/sentencepiece_bpe.py

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

* [FIX] In CharBPETokenizer, Use unk_token.

In CharBPETokenizer, when Vocab or merges is None, unk_token cannot be used.

* Update bindings/python/py_src/tokenizers/implementations/char_level_bpe.py

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

* Update bindings/python/py_src/tokenizers/implementations/char_level_bpe.py

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants