Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understanding the train.tsv file #28

Closed
wise-east opened this issue Jan 31, 2020 · 4 comments
Closed

Understanding the train.tsv file #28

wise-east opened this issue Jan 31, 2020 · 4 comments

Comments

@wise-east
Copy link

After running python demo.py --data small and looking at the resulting train.tsv file, I want to make sure I have the correct understanding of the format and what the float values indicate.

For example, the first two examples look like:

t3_17830,t1_c24,t1_c40	0.0 On the bright side , despite kidnapping and cruelly abandoning him , it doesn't sound like he was tortured ...	1.0 We didn't torture somebody ! USA
t3_17844,t1_c88,t1_c95	1.0 will comments dissapear if ranked low enough ? I can just see the pages with 5000 comments now ..	1.0 not yet , but we'll play around with it

From the paper, I see that there was some heavy preprocessing and filtering done, such as removing offensive and bland training instances. Are the sequences prepended with 0.0 the filtered instances that will not be used to update the weights during training? Based on my understanding of the code, the weight 0.0 ensures this by setting the language modeling labels to -1:

DialoGPT/prepro.py

Lines 108 to 110 in 18d91ce

if w == 0.0:
lm_labels += [-1] * (len(s) + 1)
weights += [0.0] * (len(s) + 1)

What I'm confused with is that I can't seem to find how the training process ignores the preprended identifiers of each line (ex: t3_17830,t1_c24,t1_c40). How does this part of the training data gets ignored?

@intersun
Copy link
Contributor

intersun commented Feb 2, 2020

Thanks for pointing out the bug. The identifier on each line, i.e., t3_17830,t1_c24,t1_c40 and etc, is supposed to be removed in order to run prepro.py.

Please leave a comment if you have more questions.

@ferdinando17
Copy link

ferdinando17 commented Feb 5, 2020

Hello,
thanks for releasing the work.

What are the meaning of 0.0 and 1.0? Can you confirm that it is what wise-east says ?

@wise-east
Copy link
Author

@ferdinando17 I'm pretty sure my understanding is correct given Hugginface's documentation.

@Aman-4-Real
Copy link

@ferdinando17 I'm pretty sure my understanding is correct given Hugginface's documentation.

I had a look at these sessions containing 0.0 utterances, and I found that most of them contains violence, pornography, inappropriate expressions, etc. I agree with you and I just picked all the sessions without 0.0 utterances, which resulted in around 100 million sesseions (out of 146,846,215 mentioned in README).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants