Tokenizer has partial token suffix instead of prefix #65

guustfranssensEY · 2022-01-19T16:31:01Z

Following your guide for identifying model configuration

MODEL_ID = "vinai/bertweet-base"

from transformers import AutoModelForSequenceClassification, AutoTokenizer
model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, normalization=True, use_fast=False)

ids= tokenizer('tokenization')
ids

returns:

{'input_ids': [0, 969, 6186, 6680, 2], 'token_type_ids': [0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1]}

Then

tokenizer.convert_ids_to_tokens(ids['input_ids'])

returns:

['<s>', 'to@@', 'ken@@', 'ization', '</s>']

Here I noticed that the tokenizer adds a partial token suffix instead of partial token prefix. Having a suffix instead of prefix is not configurable in the config.

jalammar · 2022-01-20T18:20:45Z

Oh wow I've never come across such a tokenizer. That's interesting..

guustfranssensEY changed the title ~~Tokenizer has token suffix instead of prefix~~ Tokenizer has partial token suffix instead of prefix Jan 19, 2022

guustfranssensEY mentioned this issue Jan 20, 2022

KeyError: 'tokenizer_config' #66

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer has partial token suffix instead of prefix #65

Tokenizer has partial token suffix instead of prefix #65

guustfranssensEY commented Jan 19, 2022 •

edited

jalammar commented Jan 20, 2022

Tokenizer has partial token suffix instead of prefix #65

Tokenizer has partial token suffix instead of prefix #65

Comments

guustfranssensEY commented Jan 19, 2022 • edited

jalammar commented Jan 20, 2022

guustfranssensEY commented Jan 19, 2022 •

edited