-
Notifications
You must be signed in to change notification settings - Fork 25.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPT2 model does not have attention mask #808
Comments
Indeed, I will remove this doctring, there is no attention_mask on GPT-2. |
But what to do if I do want to avoid computing attention on the paddings in the input sequences. |
GPT-2 is a model with absolute position embeddings (like Bert) so you should always pad on the right to get best performances for this model (will add this information to the doc_string). As it's a causal model (only attend to the left context), also means that the model will not attend to the padding tokens (which are on the right) for any real token anyway. So in conclusion, no need to take special care of avoiding attention on padding. Just don't use the output of the padded tokens for anything as they don't contain any reliable information (which is obvious I hope). |
@thomwolf thanks much, and great job! |
Hello, in the doc string of GPT2 model, it says there is an optional input called attention_mask to avoid computing attention on paddings. But actually I cannot find the implementation and there is no such arguments either.
The text was updated successfully, but these errors were encountered: