Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to remove token ? #4827

Closed
LeslieOverfitting opened this issue Jun 7, 2020 · 8 comments
Closed

How to remove token ? #4827

LeslieOverfitting opened this issue Jun 7, 2020 · 8 comments
Labels

Comments

@LeslieOverfitting
Copy link

I only know how to add token, but how to remoce some special token

@Aktsvigun
Copy link
Contributor

From what I can observe, there are two types of tokens in your tokenizer: base tokens, which can be derived with tokenizer.encoder and the added ones: tokenizer.added_tokens_encoder. Depending on which token you want to remove, you use del tokenizer.encoder or del tokenizer.added_tokens_encoder.
¡NB! Do not forget to resize the embedding layer of your model with model.resize_token_embeddings(len(tokenizer)).

@stale
Copy link

stale bot commented Aug 6, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Aug 6, 2020
@stale stale bot closed this as completed Aug 13, 2020
@mitramir55
Copy link

mitramir55 commented May 10, 2021

Hi, I can't seem to remove tokens from the main vocabulary with tokenizer.encoder. I get AttributeError: 'BertTokenizerFast' object has no attribute 'encoder'.

Also. if we remove some tokens from the middle of the whole vocabulary file, can the model set the right embeddings for new token ids? Will the specific token ids and embeddings be removed from our vocab file and model?

What I currently do:

del tokenizer.vocab[unwanted_words] 
model.resize_token_embeddings(len(tokenizer))

We're decreasing vocabulary size here, but will my model understand which tokens were removed?

@avi-jit
Copy link

avi-jit commented May 27, 2021

@mitramir55 I can't imagine how the model would know which tokens were removed from the vocabulary. I have the same question. Perhaps we would have to remove weight elements one by one from the model's lookup embeddings. Any other ideas?

@snoop2head
Copy link

@mitramir55

Does del deletes the token from the tokenizer? It didn't seem to work for me

@mitramir55
Copy link

mitramir55 commented Jan 4, 2022

Hi @snoop2head and @avi-jit,
No, I did not delete any word from the vocabulary. If you think about it, it's not even logical to delete a word - an id in the input or output of a trained model. All I did was adding the words I wanted to be in the model's vocabulary while training , and then setting the probability of some words I didn't want to minus infinity while using the model -predicting the next word. This way the model won't choose from them and will go to the next most probable option.

### Adding words before training
model_path = 'HooshvareLab/distilbert-fa-zwnj-base'

model = AutoModelForMaskedLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path,
                                               use_fast=True)

tokenizer.add_tokens(['this', 'that', 'those'])
model.resize_token_embeddings(len(tokenizer))

# then the training...

Now let's say we want our trained transformer to suggest a word for an incomplete sentence without considering some specific "banned" words:

### setting the probability of some words being generated to -inf

all_banned_tokens = ['«', ':', '،', '/', '*', ']', '[', '؟', '…', 'ی', tokenizer.unk_token]
all_banned_tokens = [i.strip() for i in all_banned_tokens]

banned_ids = []
banned_ids = [i[0] for i in tokenizer.batch_encode_plus(all_banned_tokens, add_special_tokens=False).input_ids]

def get_transformer_suggestions(sequence, model, tokenizer, top_k=5, banned_ids = banned_ids):
   """ gets a sequence of words and outputs top_k suggested words"""

    suggestion = []
    ids_main = tokenizer.encode(sequence, return_tensors="pt", add_special_tokens=True)

    ids_ = ids_main.detach().clone()
    position = torch.where(ids_main == tokenizer.mask_token_id)
    positions_list = position[1].numpy().tolist()

    model_logits = model(ids_)['logits'][0][positions_list[0]]
    model_logits[banned_ids] = -math.inf
    
    top_k_tokens = torch.topk(model_logits, top_k, dim=0).indices.tolist()
    
    for j in range(len(top_k_tokens)):
        suggestion.append(tokenizer.decode(top_k_tokens[j]))

    return suggestion      

candidates = get_transformer_suggestions(input_sentence = f'this is an amazing {tokenizer.mask_token}', model= model,  tokenizer=tokenizer, top_k=5, anned_ids=banned_ids)

I hope this was helpful. Tell me if there is anything else I can explain to make it clear.

@snoop2head
Copy link

@mitramir55
There are occasions where you want to delete tokens from the tokenizer and resize the embedding layer accordingly.

Just like I stated in issue #15032 , there are tokens such as [unused363].

I am figuring out way how to remove the surplus of 500 tokens from the tokenizer.

Thank you for your kind explanation though!

@mitramir55
Copy link

Hi @snoop2head ,
I'm not sure what you want to do exactly, but I think this post and this one can be helpful.

Basically, what you need to know is that you cannot change the embedding layer of a model, because this is part of a trained transformer with specific weights and layers. If you want to change the embedding, then you need to train the model. This is because each tokenizer has a vocab.json and merge.txt file that has been created during the process of training (with byte-level BPE) and if you want to change the tokenizer, you need to modify those. However, with a little search I found this post where the author has changed the files (I think with another model's file). Maybe you can get some help from this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants