Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add dialogue data collator tests (#2485)
Add dialogue data collator unit test. Things to note on this PR: - is it correct that we mask the last occurance of `<|endoftext|>` of the assistant? See the example in the test, there will be one occurance where we have this token and one where there is none. See the todo in the code. - I built a dummy tokenizer from the pythia-70m one using `tokenizer = old_tokenizer.train_new_from_iterator(training_iter, vocab_size)` to keep the size minimal. Just trained on the text that appears in the test.
- Loading branch information