-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A question about data set and labeling #26
Comments
Hi @DUT-lujunyu, thanks for your interest in our work, sorry for the delayed response. Here's some answers:
|
Thanks for your detailed answers! |
Sorry for the slow response, this is a strange problem. The annotated_train.csv file indeed has that |
Hi @Thartvigsen, I have dowloaded the dataset from HuggingFace. The paper reports a total of 274186 generated prompts. Also, from your previous responses, I do not understand a few things:
Could you clarify? Thank you. |
Hi @AmenRa thanks for your interest in our work! I believe the 274k vs 260k issue is from duplication removal but the original resources were made unavailable, so I can't go back and check to be certain, unfortunately
|
Thanks for the fast reply! The paper reports "We selected 792 statements from TOXIGEN to include in our test set". Could you clarify? Thanks. |
This is a good question and I'm not sure. I don't have access to some of the original internal docs, so this confusion is likely irreducible for us both. I will try to hunt this down. I suspect that the root issue is that at the time of the original submission, we'd gotten annotations for <1k samples. Then at the time of paper acceptance, we'd gotten annotations for ~10k samples, resulting in two versions of the dataset for which we conducted splits. That 792 may be an artifact of the original numbers, not the larger annotated set. The 8960 |
Ok, thanks! |
您好,您的邮件已经收到,谢谢
|
Dear project managers:
When I downloaded the original dataset from the link of huggingface "https://huggingface.co/datasets/skg/toxigen-data", I noted that there are another two datasets named "annotated_train.csv" and "annotated_test.csv" besides the file "toxigen.csv". And I have two questions:
Several examples are as follows:
Maybe I missed something. I am sincerely looking forward to your reply. Thank you.
The text was updated successfully, but these errors were encountered: