Tokens in multi-turn setting #30

ferdinando17 · 2020-02-06T10:26:18Z

Hi,
thanks for making the work available and for the explanations.

From the paper I understand that a training instance is a dialogue session, made up of several dialogue turns concatenated and ended by the end-of-text token.

Based on this and on what dreasysnail says in Issue #17:

There ARE special tokens (<|endoftext|>, id=50256) between dialogue turns in multi-turn setup. >Your input format should be like this:

Turn1 <|endoftext|> Turn2 <|endoftext|> ... TurnN

my question is:

are the token between different dialogue turns the same as the tokens separating whole dialogue sessions?

Thank you

liehtman · 2020-02-13T10:42:40Z

Hi,
thanks for making the work available and for the explanations.

From the paper I understand that a training instance is a dialogue session, made up of several dialogue turns concatenated and ended by the end-of-text token.

Based on this and on what dreasysnail says in Issue #17:

There ARE special tokens (<|endoftext|>, id=50256) between dialogue turns in multi-turn setup. >Your input format should be like this:
Turn1 <|endoftext|> Turn2 <|endoftext|> ... TurnN

my question is:

are the token between different dialogue turns the same as the tokens separating whole dialogue sessions?

Thank you

If I understand right, there are NO tokens between dialogue sessions. Because one dialogue session is one training example and contains source (utt1 <|eos|> utt2 <|eos|> utt3) and target (utt4). Next session passed to the model as another training sample.

ferdinando17 · 2020-02-13T13:24:17Z

Thank you liethman, this is very helpful.

My current, updated understanding is that the .tsv file must be in the format you described,
with a \t between the source (utt1 <|eos|> utt2 <|eos|> utt3) and the target (utt4).

Then, the prepro.py will create the features, that end with an <|endoftext|> token (id=50256).

GraphGrailAi · 2020-04-05T09:24:50Z

here, interested too

ferdinando17 · 2020-04-06T15:48:52Z

I successfully managed to fine-tune the model with input data in this form:
each line of the .tsv file is a dialogue, with each turn separated by <|eos|> and a tab that separates the target from the rest of the dialogue.

A sample training instance is therefore :
utt1 <|eos|> utt2 <|eos|> utt3 \t target \n

LooperXX · 2020-04-28T16:35:57Z

Hi, @ferdinando17 . I am trying to fine-tune the model with my own dataset. I failed to run python demo.py --data small so that I can't know the exact format of the .tsv file. After reading some codes, I agree with your opinion. Could you please help me confirm if the format of my data set(.tsv file) is correct:

0.0 utt1 EOS 1.0 utt2 EOS 1.0 utt3 \t 1.0 i am a admin .\n

Hope to get your reply. Thanks.

ferdinando17 · 2020-04-28T19:19:19Z

Hi,
you are missing the tab, it should be
"0.0 utt1 0.0 EOS utt2 0.0 EOS utt3 \t 1.0 i am a admin .\n"

to ask dialoGPT to predict "i am a admin. "
Look at my example.

Also, the zeros mean you are not training on all the utterances that follow them, is it what you want?

LooperXX · 2020-04-29T01:54:21Z

Hi, @ferdinando17 , this is what bothers me. In multi-turn dialog, we have several previous turns as context, one user turn as the question and one system turn as the answer. Through your explanation, I realized that it should be

0.0 utt1 EOS 1.0 utt2 EOS 1.0 utt3 \t 1.0 i am a admin .\n

as the example in the training/fine-tuning dataset, where only the first sentence should be 0.0 and the remaining sentences should be 1.0 to train/fine-tune the model regardless of the user turn or the system turn.
(Actually I am confused that should I distinguish between user and system turns: 0.0 to user turn and 1.0 to system turn, so that the model only need to predict each system turn. Because the model just need to predict the system utterance in the evaluation. But maybe all 1.0 will help train the model with more data.)
Is that correct? Hope to get your reply. Thanks. 🙏

ferdinando17 · 2020-05-05T20:49:22Z

Are you applying it to task-oriented dialogue?

I understand that the 0.0 are for those sentences that you want to filter, the authors used it to avoid training on offensive language.
I used all 1.0 and my training instances where of the form I specified, where target was always a system turn.

I hope it makes sense.

LooperXX · 2020-05-06T03:26:00Z

Hi, @ferdinando17. Thank you for your reply.
Yes, I am trying to apply it to the task-oriented dialogue. In my understanding, I think 0.0 would make the model not make predictions about this sentence, and 1.0 would make the model make predictions about this sentence. So I think it is ok to train the model by making the first sentence of each multi-turn dialog 0.0, as the context information, and making the rest of the statements 1.0. Also, we can make every user turn 0.0 and each system turns 1.0.
Maybe more experiments about two different settings are needed.
Thanks again for your reply.

ferdinando17 · 2020-05-06T10:21:13Z

Ok, I see. I disagree, but of course I might be wrong.
In this issue, another user says 0.0 causes the
sentences to be ignored in the training. They refer to the hugginface docs too.

Let me know if you find evidence of the contrary.

minmummax · 2022-10-09T09:00:21Z

hi gays，how do i deal with datasets like this :
person1: utt1, person2 : utt2, person1: utt3 ...
by refering what you all says
i think it should like this:
1.0 utt1 EOS 1.0 utt2 EOS \t 1.0 utt3

is this correct?

minmummax · 2022-10-09T10:09:59Z

also i just wonder what should the validation set looks like

dreasysnail pinned this issue Oct 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokens in multi-turn setting #30

Tokens in multi-turn setting #30

ferdinando17 commented Feb 6, 2020 •

edited

liehtman commented Feb 13, 2020 •

edited

ferdinando17 commented Feb 13, 2020 •

edited

GraphGrailAi commented Apr 5, 2020

ferdinando17 commented Apr 6, 2020

LooperXX commented Apr 28, 2020 •

edited

ferdinando17 commented Apr 28, 2020 •

edited

LooperXX commented Apr 29, 2020 •

edited

ferdinando17 commented May 5, 2020

LooperXX commented May 6, 2020

ferdinando17 commented May 6, 2020 •

edited

minmummax commented Oct 9, 2022

minmummax commented Oct 9, 2022

Tokens in multi-turn setting #30

Tokens in multi-turn setting #30

Comments

ferdinando17 commented Feb 6, 2020 • edited

liehtman commented Feb 13, 2020 • edited

ferdinando17 commented Feb 13, 2020 • edited

GraphGrailAi commented Apr 5, 2020

ferdinando17 commented Apr 6, 2020

LooperXX commented Apr 28, 2020 • edited

ferdinando17 commented Apr 28, 2020 • edited

LooperXX commented Apr 29, 2020 • edited

ferdinando17 commented May 5, 2020

LooperXX commented May 6, 2020

ferdinando17 commented May 6, 2020 • edited

minmummax commented Oct 9, 2022

minmummax commented Oct 9, 2022

ferdinando17 commented Feb 6, 2020 •

edited

liehtman commented Feb 13, 2020 •

edited

ferdinando17 commented Feb 13, 2020 •

edited

LooperXX commented Apr 28, 2020 •

edited

ferdinando17 commented Apr 28, 2020 •

edited

LooperXX commented Apr 29, 2020 •

edited

ferdinando17 commented May 6, 2020 •

edited