Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi-turn dialog format #17

Closed
LHolten opened this issue Nov 22, 2019 · 7 comments
Closed

multi-turn dialog format #17

LHolten opened this issue Nov 22, 2019 · 7 comments
Assignees

Comments

@LHolten
Copy link

LHolten commented Nov 22, 2019

Section 3.1 of the paper states that dialog turns of the same session are concatenated into a long text, ended by the end-of-text token.

Does this mean that there are no special tokens in between dialog turns?

How do I separate dialog turns?

@dreasysnail
Copy link
Contributor

There ARE special tokens (<|endoftext|>, id=50256) between dialogue turns in multi-turn setup. Your input format should be like this:

Turn1 <|endoftext|> Turn2 <|endoftext|> ... TurnN

Let us know if you have any further concerns.

@LHolten
Copy link
Author

LHolten commented Nov 23, 2019

Thanks for the clarification, that makes sense.

@LHolten
Copy link
Author

LHolten commented Nov 23, 2019

What are these token_type_ids for?

token_type_ids += [i] * (len(s) + 1)

It looks like they are different for the different turns in a session and get passed to the model while training.
This would be very weird as the GPT2 model from the transformers library converts token_type_ids as if they are regular input_ids and uses the same embedding for them.

@intersun
Copy link
Contributor

Yes, they are different for different turns when we prepared the data. I believe we did some experiments about it, and end up using the default token_type_id (by setting token_type_id = None) during training.

@LHolten
Copy link
Author

LHolten commented Nov 25, 2019

I see it now:

DialoGPT/LSP_train.py

Lines 279 to 280 in 5688864

if args.no_token_id:
token_ids = None

@abisee
Copy link

abisee commented Oct 20, 2020

Turn1 <|endoftext|> Turn2 <|endoftext|> ... TurnN

@dreasysnail: should there be spaces around the <|endoftext|> or not?

@abisee
Copy link

abisee commented Oct 20, 2020

@dreasysnail, was DialoGPT trained with any kind of padding (e.g., if the entire dialogue doesn't fill up the max length)? Or did the multi-turn dialogue always fill up the entire max length (as in GPT2 training)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants