-
Notifications
You must be signed in to change notification settings - Fork 340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multi-turn dialog format #17
Comments
There ARE special tokens (<|endoftext|>, id=50256) between dialogue turns in multi-turn setup. Your input format should be like this: Turn1 <|endoftext|> Turn2 <|endoftext|> ... TurnN Let us know if you have any further concerns. |
Thanks for the clarification, that makes sense. |
What are these Line 107 in 5688864
It looks like they are different for the different turns in a session and get passed to the model while training. This would be very weird as the GPT2 model from the transformers library converts token_type_ids as if they are regular input_ids and uses the same embedding for them.
|
Yes, they are different for different turns when we prepared the data. I believe we did some experiments about it, and end up using the default token_type_id (by setting token_type_id = None) during training. |
I see it now: Lines 279 to 280 in 5688864
|
@dreasysnail: should there be spaces around the <|endoftext|> or not? |
@dreasysnail, was DialoGPT trained with any kind of padding (e.g., if the entire dialogue doesn't fill up the max length)? Or did the multi-turn dialogue always fill up the entire max length (as in GPT2 training)? |
Section 3.1 of the paper states that dialog turns of the same session are concatenated into a long text, ended by the end-of-text token.
Does this mean that there are no special tokens in between dialog turns?
How do I separate dialog turns?
The text was updated successfully, but these errors were encountered: