-
Notifications
You must be signed in to change notification settings - Fork 25.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to remove token ? #4827
Comments
From what I can observe, there are two types of tokens in your tokenizer: base tokens, which can be derived with |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Hi, I can't seem to remove tokens from the main vocabulary with tokenizer.encoder. I get Also. if we remove some tokens from the middle of the whole vocabulary file, can the model set the right embeddings for new token ids? Will the specific token ids and embeddings be removed from our vocab file and model? What I currently do:
We're decreasing vocabulary size here, but will my model understand which tokens were removed? |
@mitramir55 I can't imagine how the model would know which tokens were removed from the vocabulary. I have the same question. Perhaps we would have to remove weight elements one by one from the model's lookup embeddings. Any other ideas? |
Does del deletes the token from the tokenizer? It didn't seem to work for me |
Hi @snoop2head and @avi-jit,
Now let's say we want our trained transformer to suggest a word for an incomplete sentence without considering some specific "banned" words:
I hope this was helpful. Tell me if there is anything else I can explain to make it clear. |
@mitramir55 Just like I stated in issue #15032 , there are tokens such as I am figuring out way how to remove the surplus of 500 tokens from the tokenizer. Thank you for your kind explanation though! |
Hi @snoop2head , Basically, what you need to know is that you cannot change the embedding layer of a model, because this is part of a trained transformer with specific weights and layers. If you want to change the embedding, then you need to train the model. This is because each tokenizer has a |
I only know how to add token, but how to remoce some special token
The text was updated successfully, but these errors were encountered: