Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lm_head and v_head, why re-initialize and why dropout? #43

Closed
clam004 opened this issue Aug 14, 2022 · 4 comments
Closed

lm_head and v_head, why re-initialize and why dropout? #43

clam004 opened this issue Aug 14, 2022 · 4 comments

Comments

@clam004
Copy link

clam004 commented Aug 14, 2022

First off, thank you for building this! 3 questions regarding the two heads of the policy model:

  1. why re-initialize the weights in the language model head in
class GPT2HeadWithValueModel

     self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)

when a trained lm_head already exist in GPT2LMHeadModel?

  1. why does the model still speak coherently before training even though the lm_head weights of the model are random?

from 01-gpt2-with-value-head.ipynb

My most favourite movie is Captain America: Civil War, which moved into the
My least favourite movie is Jon Favreau's Log Horizon, complete with psychedelic
  1. Why use dropout on your value ? value is not like the entire layer of a neural network where you dont want the model to reply too heavily on one activate, value is the one and only signal you get for that layer, so why drop it out?
  (v_head): ValueHead(
    (summary): Linear(in_features=768, out_features=1, bias=True)
    (activation): Identity()
    (first_dropout): Dropout(p=0.1, inplace=False)
    (last_dropout): Identity()
    (flatten): Flatten(start_dim=1, end_dim=-1)
  )

Thanks again!

@clam004
Copy link
Author

clam004 commented Aug 30, 2022

So I did some research on my own and basically my first 2 questions can be answered by looking at the huggingface transformers repository: https://github.com/huggingface/transformers/blob/main/src/transformers/modeling_utils.py

@danjohnvelasco
Copy link

So I did some research on my own and basically my first 2 questions can be answered by looking at the huggingface transformers repository: https://github.com/huggingface/transformers/blob/main/src/transformers/modeling_utils.py

Hi @clam004, do you mind explaining your answer/understanding on why they do it? Thanks!

@clam004
Copy link
Author

clam004 commented Dec 14, 2022

@danjohnvelasco as long as you use the same name self.lm_head, when you load the pretrained model from the dictionary of parameters, these linear parameters will be replaced with the trained ones. So thats why the model still works (question 2). Also regarding question 3, I suspect somehow it doesnt matter, although Im not sure why, cause when I run this repo without the dropout layer, as expected, it behaves the same.

@lvwerra
Copy link
Member

lvwerra commented Jan 13, 2023

Regarding 3 I agree and we moved the dropout before the linear layer in #70.

@lvwerra lvwerra closed this as completed Jan 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants