PCC: questions regarding additional token initialization (PCC Decoder)

Hi there!

Thanks for sharing this interesting work. I have a few questions regarding the patch model. I implemented the creation script as follows:

```python
NEW_TOKENS = [
    '<mem>', # <mem> is the start of a memory block.
    '<ae>', # <ae> is the hint for the reconstruction task
    '</mem>', # </mem> is the end of a memory block.
]
tokenizer = AutoTokenizer.from_pretrained(base_model_path)
model = AutoModelForCausalLM.from_pretrained(base_model_path, torch_dtype=torch.bfloat16)
num_added = tokenizer.add_special_tokens({
    "additional_special_tokens": NEW_TOKENS
})
assert num_added == len(NEW_TOKENS), "Error adding tokens (some tokens already exist!!)."

print(f"Added {num_added} tokens.")
model.resize_token_embeddings(len(tokenizer))

token_ids = tokenizer.convert_tokens_to_ids(NEW_TOKENS)
print("Token IDs:", token_ids)
with torch.no_grad():
    input_emb_matrix = model.get_input_embeddings().weight.detach()
    lm_head_matrix = model.get_output_embeddings().weight.detach()
    embedding_vectors = input_emb_matrix[token_ids].clone()
    lm_head_vectors = lm_head_matrix[token_ids].clone()

patch = {
    "tokens": NEW_TOKENS,
    "embedding": embedding_vectors,
    "lm_head": lm_head_vectors
}
```

Based on this snippet, I have three questions:

1. In `injection & model.py`, why do you explicitly set `eos_token_id` to 128001 and `pad_token_id` to 128002? By default, these values should be 128009 and empty, respectively:

```python
self.tokenizer.eos_token_id = 128001
self.tokenizer.pad_token_id = 128002  # <|reserved_special_token_0|>
patch = torch.load("model/patch/llama3_8b_special_token_patch.pt")
```


2. For pretraining and finetuning, the decoder loads the **same patch pt**. According to your previous response, this patch file should be created as in the code above (which initializes embeddings randomly, though aligned with the model’s distribution). Could you confirm that this snippet is correct?

3. I noticed that the decoder sets `_set_grad_mode` to `False` all the time. Does this mean that, even during the warmup stage, the added token embeddings will **not** be updated? Referring to your README, the embeddings are supposed to be optimized during warmup for both pretraining and finetuning. Am I misunderstanding something here?

Thanks in advance for your clarification!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PCC: questions regarding additional token initialization (PCC Decoder) #33

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PCC: questions regarding additional token initialization (PCC Decoder) #33

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions