Fix token counting to allow there to be no attention mask #818

dakinggg · 2023-12-21T23:56:06Z

When we pretokenize, we just pass raw tensors to the collator, which does not add an attention mask. This is fine, because we pretokenize without padding, but would crash on the token counting function if your tokenizer had a pad token.

IFT token count same before and after

pretokenized token count as expected (40x2048x960)

alextrott16

"I knew you'd come crawling back"
-- input_ids, probably

fix it

7f71d13

dakinggg requested review from alextrott16 and samhavens December 21, 2023 23:56

dakinggg and others added 2 commits December 21, 2023 15:56

fix message

b3d79e8

Merge branch 'main' into fix-no-attn-mask

0910c61

dakinggg marked this pull request as ready for review December 22, 2023 00:01

samhavens approved these changes Dec 22, 2023

View reviewed changes

alextrott16 approved these changes Dec 22, 2023

View reviewed changes

dakinggg merged commit 836ab95 into mosaicml:main Dec 22, 2023
10 checks passed

dakinggg deleted the fix-no-attn-mask branch February 10, 2024 07:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix token counting to allow there to be no attention mask #818

Fix token counting to allow there to be no attention mask #818

dakinggg commented Dec 21, 2023 •

edited

alextrott16 left a comment •

edited

Fix token counting to allow there to be no attention mask #818

Fix token counting to allow there to be no attention mask #818

Conversation

dakinggg commented Dec 21, 2023 • edited

alextrott16 left a comment • edited

Choose a reason for hiding this comment

dakinggg commented Dec 21, 2023 •

edited

alextrott16 left a comment •

edited