-
Notifications
You must be signed in to change notification settings - Fork 6.2k
Description
I'd like to propose an idea analagous to #369.
The current fine tuning script for textual inversion initialises the new placeholder_token
's embedding with an existing initializer_token
(and enforces that the init token is exactly one token).
diffusers/examples/textual_inversion/textual_inversion.py
Lines 409 to 411 in 84b9df5
# Initialise the newly added placeholder token with the embeddings of the initializer token | |
token_embeds = text_encoder.get_input_embeddings().weight.data | |
token_embeds[placeholder_token_id] = token_embeds[initializer_token_id] |
I was curious if we could initialise a new token from multiple existing ones. Let me give an example for a use case. Say I'm trying to add the concept of a "low camera angle". The existing model does have some semblance of this concept, but it's far from concrete. However, it's existing knowledge is not captured by any single token in isolation.
My first thought was to get the embeddings of each token from
tokenizer.encode("low camera angle", add_special_tokens=False)
and average them but that doesn't quite smell right. As I understand it, it's the text_encoder
that's responsible for relationships between sequences of words. I wonder what the best strategy might be to initialise a new token from multiple existing ones.
Thanks!