Skip to content

Commit

Permalink
Add dialogue data collator tests (#2485)
Browse files Browse the repository at this point in the history
Add dialogue data collator unit test. Things to note on this PR:
- is it correct that we mask the last occurance of `<|endoftext|>` of
the assistant? See the example in the test, there will be one occurance
where we have this token and one where there is none. See the todo in
the code.
- I built a dummy tokenizer from the pythia-70m one using `tokenizer =
old_tokenizer.train_new_from_iterator(training_iter, vocab_size)` to
keep the size minimal. Just trained on the text that appears in the
test.
  • Loading branch information
CloseChoice committed Apr 20, 2023
1 parent b9c60ed commit 969a3ba
Show file tree
Hide file tree
Showing 4 changed files with 1,563 additions and 0 deletions.
@@ -0,0 +1,14 @@
{
"additional_special_tokens": [
"<|prompter|>",
"<|assistant|>",
"<|system|>",
"<|prefix_begin|>",
"<|prefix_end|>"
],
"bos_token": "<|endoftext|>",
"eos_token": "<|endoftext|>",
"pad_token": "<|padding|>",
"sep_token": "<|endoftext|>",
"unk_token": "<|endoftext|>"
}

0 comments on commit 969a3ba

Please sign in to comment.