How to remove token ? #4827

LeslieOverfitting · 2020-06-07T04:00:29Z

I only know how to add token, but how to remoce some special token

Aktsvigun · 2020-06-07T08:36:50Z

From what I can observe, there are two types of tokens in your tokenizer: base tokens, which can be derived with tokenizer.encoder and the added ones: tokenizer.added_tokens_encoder. Depending on which token you want to remove, you use del tokenizer.encoder or del tokenizer.added_tokens_encoder.
¡NB! Do not forget to resize the embedding layer of your model with model.resize_token_embeddings(len(tokenizer)).

stale · 2020-08-06T08:59:06Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

mitramir55 · 2021-05-10T04:45:54Z

Hi, I can't seem to remove tokens from the main vocabulary with tokenizer.encoder. I get AttributeError: 'BertTokenizerFast' object has no attribute 'encoder'.

Also. if we remove some tokens from the middle of the whole vocabulary file, can the model set the right embeddings for new token ids? Will the specific token ids and embeddings be removed from our vocab file and model?

What I currently do:

del tokenizer.vocab[unwanted_words] 
model.resize_token_embeddings(len(tokenizer))

We're decreasing vocabulary size here, but will my model understand which tokens were removed?

avi-jit · 2021-05-27T04:49:05Z

@mitramir55 I can't imagine how the model would know which tokens were removed from the vocabulary. I have the same question. Perhaps we would have to remove weight elements one by one from the model's lookup embeddings. Any other ideas?

snoop2head · 2022-01-04T15:53:46Z

@mitramir55

Does del deletes the token from the tokenizer? It didn't seem to work for me

mitramir55 · 2022-01-04T19:58:48Z

Hi @snoop2head and @avi-jit,
No, I did not delete any word from the vocabulary. If you think about it, it's not even logical to delete a word - an id in the input or output of a trained model. All I did was adding the words I wanted to be in the model's vocabulary while training , and then setting the probability of some words I didn't want to minus infinity while using the model -predicting the next word. This way the model won't choose from them and will go to the next most probable option.

### Adding words before training
model_path = 'HooshvareLab/distilbert-fa-zwnj-base'

model = AutoModelForMaskedLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path,
                                               use_fast=True)

tokenizer.add_tokens(['this', 'that', 'those'])
model.resize_token_embeddings(len(tokenizer))

# then the training...

Now let's say we want our trained transformer to suggest a word for an incomplete sentence without considering some specific "banned" words:

### setting the probability of some words being generated to -inf

all_banned_tokens = ['«', ':', '،', '/', '*', ']', '[', '؟', '…', 'ی', tokenizer.unk_token]
all_banned_tokens = [i.strip() for i in all_banned_tokens]

banned_ids = []
banned_ids = [i[0] for i in tokenizer.batch_encode_plus(all_banned_tokens, add_special_tokens=False).input_ids]

def get_transformer_suggestions(sequence, model, tokenizer, top_k=5, banned_ids = banned_ids):
   """ gets a sequence of words and outputs top_k suggested words"""

    suggestion = []
    ids_main = tokenizer.encode(sequence, return_tensors="pt", add_special_tokens=True)

    ids_ = ids_main.detach().clone()
    position = torch.where(ids_main == tokenizer.mask_token_id)
    positions_list = position[1].numpy().tolist()

    model_logits = model(ids_)['logits'][0][positions_list[0]]
    model_logits[banned_ids] = -math.inf
    
    top_k_tokens = torch.topk(model_logits, top_k, dim=0).indices.tolist()
    
    for j in range(len(top_k_tokens)):
        suggestion.append(tokenizer.decode(top_k_tokens[j]))

    return suggestion      

candidates = get_transformer_suggestions(input_sentence = f'this is an amazing {tokenizer.mask_token}', model= model,  tokenizer=tokenizer, top_k=5, anned_ids=banned_ids)

I hope this was helpful. Tell me if there is anything else I can explain to make it clear.

snoop2head · 2022-01-04T23:31:25Z

@mitramir55
There are occasions where you want to delete tokens from the tokenizer and resize the embedding layer accordingly.

Just like I stated in issue #15032 , there are tokens such as [unused363].

I am figuring out way how to remove the surplus of 500 tokens from the tokenizer.

Thank you for your kind explanation though!

mitramir55 · 2022-01-05T13:01:51Z

Hi @snoop2head ,
I'm not sure what you want to do exactly, but I think this post and this one can be helpful.

Basically, what you need to know is that you cannot change the embedding layer of a model, because this is part of a trained transformer with specific weights and layers. If you want to change the embedding, then you need to train the model. This is because each tokenizer has a vocab.json and merge.txt file that has been created during the process of training (with byte-level BPE) and if you want to change the tokenizer, you need to modify those. However, with a little search I found this post where the author has changed the files (I think with another model's file). Maybe you can get some help from this.

Aktsvigun mentioned this issue Jun 8, 2020

remove words from vocabulary #4843

Closed

stale bot added the wontfix label Aug 6, 2020

stale bot closed this as completed Aug 13, 2020

snoop2head mentioned this issue Jan 4, 2022

Removing tokens from the tokenizer #15032

Closed

apolinario mentioned this issue Dec 1, 2023

Add unload_textual_inversion() method huggingface/diffusers#6013

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to remove token ? #4827

How to remove token ? #4827

LeslieOverfitting commented Jun 7, 2020

Aktsvigun commented Jun 7, 2020

stale bot commented Aug 6, 2020

mitramir55 commented May 10, 2021 •

edited

avi-jit commented May 27, 2021

snoop2head commented Jan 4, 2022

mitramir55 commented Jan 4, 2022 •

edited

snoop2head commented Jan 4, 2022

mitramir55 commented Jan 5, 2022

How to remove token ? #4827

How to remove token ? #4827

Comments

LeslieOverfitting commented Jun 7, 2020

Aktsvigun commented Jun 7, 2020

stale bot commented Aug 6, 2020

mitramir55 commented May 10, 2021 • edited

avi-jit commented May 27, 2021

snoop2head commented Jan 4, 2022

mitramir55 commented Jan 4, 2022 • edited

snoop2head commented Jan 4, 2022

mitramir55 commented Jan 5, 2022

mitramir55 commented May 10, 2021 •

edited

mitramir55 commented Jan 4, 2022 •

edited