Understanding the train.tsv file #28

wise-east · 2020-01-31T04:24:35Z

After running python demo.py --data small and looking at the resulting train.tsv file, I want to make sure I have the correct understanding of the format and what the float values indicate.

For example, the first two examples look like:

t3_17830,t1_c24,t1_c40	0.0 On the bright side , despite kidnapping and cruelly abandoning him , it doesn't sound like he was tortured ...	1.0 We didn't torture somebody ! USA
t3_17844,t1_c88,t1_c95	1.0 will comments dissapear if ranked low enough ? I can just see the pages with 5000 comments now ..	1.0 not yet , but we'll play around with it

From the paper, I see that there was some heavy preprocessing and filtering done, such as removing offensive and bland training instances. Are the sequences prepended with 0.0 the filtered instances that will not be used to update the weights during training? Based on my understanding of the code, the weight 0.0 ensures this by setting the language modeling labels to -1:

DialoGPT/prepro.py

Lines 108 to 110 in 18d91ce

    
           if w == 0.0: 
        
               lm_labels += [-1] * (len(s) + 1) 
        
               weights += [0.0] * (len(s) + 1)

What I'm confused with is that I can't seem to find how the training process ignores the preprended identifiers of each line (ex: t3_17830,t1_c24,t1_c40). How does this part of the training data gets ignored?

The text was updated successfully, but these errors were encountered:

intersun · 2020-02-02T05:05:57Z

Thanks for pointing out the bug. The identifier on each line, i.e., t3_17830,t1_c24,t1_c40 and etc, is supposed to be removed in order to run prepro.py.

Please leave a comment if you have more questions.

ferdinando17 · 2020-02-05T09:36:54Z

Hello,
thanks for releasing the work.

What are the meaning of 0.0 and 1.0? Can you confirm that it is what wise-east says ?

wise-east · 2020-04-02T05:15:48Z

@ferdinando17 I'm pretty sure my understanding is correct given Hugginface's documentation.

Aman-4-Real · 2022-08-19T07:19:07Z

@ferdinando17 I'm pretty sure my understanding is correct given Hugginface's documentation.

I had a look at these sessions containing 0.0 utterances, and I found that most of them contains violence, pornography, inappropriate expressions, etc. I agree with you and I just picked all the sessions without 0.0 utterances, which resulted in around 100 million sesseions (out of 146,846,215 mentioned in README).

wise-east closed this as completed Apr 2, 2020

ferdinando17 mentioned this issue May 6, 2020

Tokens in multi-turn setting #30

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding the train.tsv file #28

Understanding the train.tsv file #28

wise-east commented Jan 31, 2020

intersun commented Feb 2, 2020

ferdinando17 commented Feb 5, 2020 •

edited

Loading

wise-east commented Apr 2, 2020

Aman-4-Real commented Aug 19, 2022

Understanding the train.tsv file #28

Understanding the train.tsv file #28

Comments

wise-east commented Jan 31, 2020

intersun commented Feb 2, 2020

ferdinando17 commented Feb 5, 2020 • edited Loading

wise-east commented Apr 2, 2020

Aman-4-Real commented Aug 19, 2022

ferdinando17 commented Feb 5, 2020 •

edited

Loading