Skip to content
Discussion options

You must be logged in to vote

You can pad the tensor (to the right) with any token, and the choice does not affect the training because of the autoregressive attention mask used in the decoder. It is also important that the loss is masked accordingly.

In the code, the mask is implemented as the following:

mask = torch.empty(n_ctx, n_ctx).fill_(-np.inf).triu_(1)

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by jongwook
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants