-
Notifications
You must be signed in to change notification settings - Fork 341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Understanding the train.tsv file #28
Comments
Thanks for pointing out the bug. The identifier on each line, i.e., t3_17830,t1_c24,t1_c40 and etc, is supposed to be removed in order to run prepro.py. Please leave a comment if you have more questions. |
Hello, What are the meaning of 0.0 and 1.0? Can you confirm that it is what wise-east says ? |
@ferdinando17 I'm pretty sure my understanding is correct given Hugginface's documentation. |
I had a look at these sessions containing 0.0 utterances, and I found that most of them contains violence, pornography, inappropriate expressions, etc. I agree with you and I just picked all the sessions without 0.0 utterances, which resulted in around 100 million sesseions (out of 146,846,215 mentioned in README). |
After running
python demo.py --data small
and looking at the resultingtrain.tsv
file, I want to make sure I have the correct understanding of the format and what the float values indicate.For example, the first two examples look like:
From the paper, I see that there was some heavy preprocessing and filtering done, such as removing offensive and bland training instances. Are the sequences prepended with 0.0 the filtered instances that will not be used to update the weights during training? Based on my understanding of the code, the weight 0.0 ensures this by setting the language modeling labels to -1:
DialoGPT/prepro.py
Lines 108 to 110 in 18d91ce
What I'm confused with is that I can't seem to find how the training process ignores the preprended identifiers of each line (ex:
t3_17830,t1_c24,t1_c40
). How does this part of the training data gets ignored?The text was updated successfully, but these errors were encountered: