Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokens in multi-turn setting #30

Open
ferdinando17 opened this issue Feb 6, 2020 · 12 comments
Open

Tokens in multi-turn setting #30

ferdinando17 opened this issue Feb 6, 2020 · 12 comments

Comments

@ferdinando17
Copy link

ferdinando17 commented Feb 6, 2020

Hi,
thanks for making the work available and for the explanations.

From the paper I understand that a training instance is a dialogue session, made up of several dialogue turns concatenated and ended by the end-of-text token.

Based on this and on what dreasysnail says in Issue #17:

There ARE special tokens (<|endoftext|>, id=50256) between dialogue turns in multi-turn setup. >Your input format should be like this:

Turn1 <|endoftext|> Turn2 <|endoftext|> ... TurnN

my question is:

are the token between different dialogue turns the same as the tokens separating whole dialogue sessions?

Thank you

@liehtman
Copy link

liehtman commented Feb 13, 2020

Hi,
thanks for making the work available and for the explanations.

From the paper I understand that a training instance is a dialogue session, made up of several dialogue turns concatenated and ended by the end-of-text token.

Based on this and on what dreasysnail says in Issue #17:

There ARE special tokens (<|endoftext|>, id=50256) between dialogue turns in multi-turn setup. >Your input format should be like this:
Turn1 <|endoftext|> Turn2 <|endoftext|> ... TurnN

my question is:

are the token between different dialogue turns the same as the tokens separating whole dialogue sessions?

Thank you

If I understand right, there are NO tokens between dialogue sessions. Because one dialogue session is one training example and contains source (utt1 <|eos|> utt2 <|eos|> utt3) and target (utt4). Next session passed to the model as another training sample.

@ferdinando17
Copy link
Author

ferdinando17 commented Feb 13, 2020

Thank you liethman, this is very helpful.

My current, updated understanding is that the .tsv file must be in the format you described,
with a \t between the source (utt1 <|eos|> utt2 <|eos|> utt3) and the target (utt4).

Then, the prepro.py will create the features, that end with an <|endoftext|> token (id=50256).

@GraphGrailAi
Copy link

  • here, interested too

@ferdinando17
Copy link
Author

I successfully managed to fine-tune the model with input data in this form:
each line of the .tsv file is a dialogue, with each turn separated by <|eos|> and a tab that separates the target from the rest of the dialogue.

A sample training instance is therefore :
utt1 <|eos|> utt2 <|eos|> utt3 \t target \n

@LooperXX
Copy link

LooperXX commented Apr 28, 2020

Hi, @ferdinando17 . I am trying to fine-tune the model with my own dataset. I failed to run python demo.py --data small so that I can't know the exact format of the .tsv file. After reading some codes, I agree with your opinion. Could you please help me confirm if the format of my data set(.tsv file) is correct:

0.0 utt1 EOS 1.0 utt2 EOS 1.0 utt3 \t 1.0 i am a admin .\n

Hope to get your reply. Thanks.

@ferdinando17
Copy link
Author

ferdinando17 commented Apr 28, 2020

Hi,
you are missing the tab, it should be
"0.0 utt1 0.0 EOS utt2 0.0 EOS utt3 \t 1.0 i am a admin .\n"

to ask dialoGPT to predict "i am a admin. "
Look at my example.

Also, the zeros mean you are not training on all the utterances that follow them, is it what you want?

@LooperXX
Copy link

LooperXX commented Apr 29, 2020

Hi, @ferdinando17 , this is what bothers me. In multi-turn dialog, we have several previous turns as context, one user turn as the question and one system turn as the answer. Through your explanation, I realized that it should be

0.0 utt1 EOS 1.0 utt2 EOS 1.0 utt3 \t 1.0 i am a admin .\n

as the example in the training/fine-tuning dataset, where only the first sentence should be 0.0 and the remaining sentences should be 1.0 to train/fine-tune the model regardless of the user turn or the system turn.
(Actually I am confused that should I distinguish between user and system turns: 0.0 to user turn and 1.0 to system turn, so that the model only need to predict each system turn. Because the model just need to predict the system utterance in the evaluation. But maybe all 1.0 will help train the model with more data.)
Is that correct? Hope to get your reply. Thanks. 🙏

@ferdinando17
Copy link
Author

Are you applying it to task-oriented dialogue?

I understand that the 0.0 are for those sentences that you want to filter, the authors used it to avoid training on offensive language.
I used all 1.0 and my training instances where of the form I specified, where target was always a system turn.

I hope it makes sense.

@LooperXX
Copy link

LooperXX commented May 6, 2020

Hi, @ferdinando17. Thank you for your reply.
Yes, I am trying to apply it to the task-oriented dialogue. In my understanding, I think 0.0 would make the model not make predictions about this sentence, and 1.0 would make the model make predictions about this sentence. So I think it is ok to train the model by making the first sentence of each multi-turn dialog 0.0, as the context information, and making the rest of the statements 1.0. Also, we can make every user turn 0.0 and each system turns 1.0.
Maybe more experiments about two different settings are needed.
Thanks again for your reply.

@ferdinando17
Copy link
Author

ferdinando17 commented May 6, 2020

Ok, I see. I disagree, but of course I might be wrong.
In this issue, another user says 0.0 causes the
sentences to be ignored in the training. They refer to the hugginface docs too.

Let me know if you find evidence of the contrary.

@dreasysnail dreasysnail pinned this issue Oct 5, 2020
@minmummax
Copy link

hi gays,how do i deal with datasets like this :
person1: utt1, person2 : utt2, person1: utt3 ...
by refering what you all says
i think it should like this:
1.0 utt1 EOS 1.0 utt2 EOS \t 1.0 utt3

is this correct?

@minmummax
Copy link

also i just wonder what should the validation set looks like

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants