-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Created dataset card snli.md #663
Created dataset card snli.md #663
Conversation
First draft of a dataset card using the SNLI corpus as an example
Adding a direct link to the rendered markdown: |
It would be amazing if we ended up with this much information on all of our datasets :) I don't think there's too much repetition, everything that is in here is relevant. The main challenge will be to figure out how to structure the sheet so that all of the information can be presented without overwhelming the reader. We'll also want to have as much of it as possible in structured form so it can be easily navigated. |
@mcmillanmajora for now can you remove the prompts / quoted blocks so we can see what the datasheet would look like on its own? Would also love to hear if @sgugger has some first impressions |
Removed section prompts for clarity
I removed the prompts. It's definitely a little easier to read without them! |
Should we name the file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great to me, thanks a lot for all the work!
I don't feel like there are too many repetitions (almost the opposite side of the scale, I feel there are slightly too many see this section for more information).
I just think the "Tasks supported" section could be expanded a bit and moved a bit more toward the beginning.
datasets/snli/snli.md
Outdated
1 | Two women are embracing while holding to go packages. | Two woman are holding packages. | 0 | ||
2 | Two women are embracing while holding to go packages. | The men are fighting outside a deli. | 2 | ||
|
||
## Tasks supported: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel this section should be presented before as it's probably the information a user wants to know first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. It's also short enough that it doesn't push the author information too far down/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree !
Furthermore it'd be cool to have a section in yaml with the tasks. It would allow moon-landing to parse it and allow users to do dataset search by task :) cc @julien-c
datasets/snli/snli.md
Outdated
## Tasks supported: | ||
### Task categorization / tags | ||
|
||
Text to structured, three-way text classification |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have pages that explain what are those in details? Otherwise a brief sentence explaining what both are would be super useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make some structured tags for the task types / families (and have a page somewhere explaining what the tags mean)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, it would be great to have links here to that page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really amazing example of what a dataset card could/should be 😍
Maybe when it starts to contain so much information we could even have a table of content at the top?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love it !
datasets/snli/snli.md
Outdated
1 | Two women are embracing while holding to go packages. | Two woman are holding packages. | 0 | ||
2 | Two women are embracing while holding to go packages. | The men are fighting outside a deli. | 2 | ||
|
||
## Tasks supported: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree !
Furthermore it'd be cool to have a section in yaml with the tasks. It would allow moon-landing to parse it and allow users to do dataset search by task :) cc @julien-c
Renamed to README.md. Task section moved to top. TOC added. YAML content added. Multilingual section removed because it seemed repetitive.
Asked @sleepinyourhat for some insights too :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this!
datasets/snli/README.md
Outdated
## Tasks supported: | ||
### Task categorization / tags | ||
|
||
Text to structured, three-way text classification |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What makes this structured? That would usually make me think of structured prediction—i.e., predicting something more complex than a label from a small fixed set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was in contrast with text-to-text, but we're still working on the tag set. Is there something you think would fit better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just cut 'structured'?
|
||
It was supported by a Google Faculty Research Award, a gift from Bloomberg L.P., the Defense Advanced Research Projects Agency (DARPA) Deep Exploration and Filtering of Text (DEFT) Program under Air Force Research Laboratory (AFRL) contract no. FA8750-13-2-0040, the National Science Foundation under grant no. IIS 1159679, and the Department of the Navy, Office of Naval Research, under grant no. N00014-10-1-0109. | ||
|
||
### Who are the language producers (who wrote the text / created the base content)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be worth flagging here that the annotations themselves are also free-form text, so the rest of this document is also relevant to this question.
datasets/snli/README.md
Outdated
|
||
### Who are the annotators? | ||
|
||
The annotators of the validation task were a closed set of about 30 trusted crowdworkers on Amazon Mechanical Turk. It is unknown if any demographic information was collected or how they were compensated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No demographic information was collected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We gave $1 bonuses in cases where annotator labels agreed with our labels for 250 randomly distributed examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, annotators were paid per HIT under similar terms to the above.
datasets/snli/README.md
Outdated
|
||
### Who are the language producers (who wrote the text / created the base content)? | ||
|
||
A large portion of the premises (160k) were produced in the [Flickr 30k corpus](http://shannon.cs.illinois.edu/DenotationGraph/) by an unknown number of crowdworkers. About 2,500 crowdworkers from Amazon Mechanical Turk produced the associated hypotheses. The premises from the Flickr 30k project describe people and animals whose photos were collected and presented to the Flickr 30k crowdworkers, but the SNLI corpus did not present the photos to the hypotheses creators. Neither report crowdworker or photo subject demographic information or crowdworker compensation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 'neither' here makes it sound like you might be discussing the SNLI hypothesis creators, but it's not totally clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is the place to talk about compensation, I sadly don't think I have the full documentation that I'd need to give useful numbers. Here's what I recall, with the help of some vaguely-relevant emails. Feel free to use any of this if it seems formal enoguh:
- Writing-phase workers were paid per HIT with no special incentives. We rejected (didn't pay for) work only in one or two extreme cases where a single annotator submitted a gigantic amount of junk data in a way that was clearly automated. We disqualified workers who clearly ignored the guidelines over many HITs.
- I believe we targeted an hourly rate in the $10-15/hour range, based on our own in-house estimates of task difficulty, though we have no reliable way to check those estimates, and it's quite plausible that they were too low once you account for time spent switching HITs.
- The pay rate varied over time (as we fine-tuned our estimate of how hard the task was), and I believe we also separated very long premises into batches that paid slightly more. I believe the rates were all between
$.1 and $ .5, and clustered toward the middle of that range.
datasets/snli/README.md
Outdated
|
||
### Annotation process | ||
|
||
56,941 of the total sentence pairs were further annotated in a validation task. Four annotators each labeled a premise-hypothesis pair for entailment, contradiction, or neither, resulting in 5 total judgements including the original hypothesis author judgement. See Section 2.2 for more details (Bowman et al., 2015). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for => as?
datasets/snli/README.md
Outdated
Premise | 14.1 | ||
Hypothesis | 8.3 | ||
|
||
The _label_ has 4 possible values, _0_, _1_, _2_, _-_. which correspond to _entailment_, _neutral_, _contradiction_, and _no label_ respectively. The dataset was developed so that the first three values would be evenly distributed across the splits. See the Annotation Process section for details on _no label_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this is specific to datasets
format—both original formats use string-valued labels.
datasets/snli/README.md
Outdated
|
||
### Example ID | ||
|
||
The ID is an integer starting from 0. It has no inherent meaning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The IDs in the original dataset correspond to identifiers from Flickr30k or (the draft version of) VisualGenome, suffixed with an internal identifier.
datasets/snli/README.md
Outdated
## Known Limitations | ||
### Known social biases | ||
|
||
The language reflects the content of the photos collected from Flickr, as described in the Data Collection section. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a paper that quantifies some kinds of bias: https://www.aclweb.org/anthology/W17-1609/
datasets/snli/README.md
Outdated
|
||
### Other known limitations | ||
|
||
[Gururangan et al (2018)](https://www.aclweb.org/anthology/N18-2017.pdf) showed that the SNLI corpus had a number of annotation artifacts. Using a simple classifer, they correctly predicted the label of the hypothesis 67% of the time without using the premise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was roughly simultaneous with two other very similar papers by Tsuchiya and by Poliak et al, so it's worth mentioning all three if you mention one. The Gururangan paper cites the other two.
datasets/snli/README.md
Outdated
## Tasks supported: | ||
### Task categorization / tags | ||
|
||
Text to structured, three-way text classification |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, it has also been used for entailment generation, though I'm not sure that it's actually useful for that task (or if that task is ever useful):
Updated compensation information, dataset structure, and bias literature based on review from @sleepinyourhat
Thank you for taking the time to look through the card and for all your comments @sleepinyourhat ! I've incorporated them in the latest update. |
language: | ||
- en | ||
task: | ||
- text-classification | ||
purpose: | ||
- NLI | ||
size: | ||
- ">100k" | ||
language producers: | ||
- crowdsourced | ||
annotation: | ||
- crowdsourced | ||
tags: | ||
- extended-from-other-datasets | ||
license: "CC BY-SA 4.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love it, this will be very useful.
One thing about size
: maybe we should use a range like "500k-1M"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to support multiple values of size
for a single dataset? Otherwise it should probably be
size: ">100k"
(just a value, not an array)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some datasets have sub-datasets (glue, wikipedia, etc.) with different sizes so we can have several sizes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it make sense to have a list (not a map) of all sub-dataset sizes in that case though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably have a map from sub-datasets names to specific features (at least size, language for Wikipedia, purpose for KILT, etc...) so people can search for either the full dataset or any sub-dataset. E.g. filtering by portuguese and LM tags returns the Portuguese subset of Wikipedia.
- crowdsourced | ||
tags: | ||
- extended-from-other-datasets | ||
license: "CC BY-SA 4.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
license: "CC BY-SA 4.0" | |
license: cc-by-4.0 |
For models (documented at https://huggingface.co/docs#what-metadata-can-i-add-to-my-model-card) we use the License keywords listed by GitHub at https://docs.github.com/en/free-pro-team@latest/github/creating-cloning-and-archiving-repositories/licensing-a-repository#searching-github-by-license-type
(Hopefully we'll plug some sort of form validation for users at some point)
task: | ||
- text-classification | ||
purpose: | ||
- NLI | ||
size: | ||
- ">100k" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe not in this particular PR's scope, but do we already have a sense of the taxonomy i.e. the possible values for those tags? (to display them in the website)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mcmillanmajora is working on that next (after taking a look at more of our datasets to get of sense of what we need)
@sleepinyourhat You're right, wrong copy/paste |
- crowdsourced | ||
tags: | ||
- extended-from-other-datasets | ||
license: "CC BY-SA 4.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
license: "CC BY-SA 4.0" | |
license: cc-by-sa-4.0 |
Initial tags
…entation Updating remote branch to be up to date with HF commits
@@ -0,0 +1,17 @@ | |||
--- | |||
language: | |||
- BCP-47: fr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't have the same scheme as the other card (and won't be compatible with our website filtering out of the box), was this intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I included this by mistake. Sorry about that!
@sleepinyourhat The schema is definitely drawing from Data Statements and Datasheets for Datasets but we also wanted to include some more general information to introduce the dataset to new users. If you have any suggestions for changes to the schema itself, please let us know! |
First draft of a dataset card using the SNLI corpus as an example.
This is mostly based on the Google Doc draft, but I added a few sections and moved some things around.
I moved Who Was Involved to follow Language, both because I thought the authors should be presented more towards the front and because I think it makes sense to present the speakers close to the language so it doesn't have to be repeated.
I created a section I called Data Characteristics by pulling some things out of the other sections. I was thinking that this would be more about the language use in context of the specific task construction. That name isn't very descriptive though and could probably be improved.
-- Domain and language type out of Language. I particularly wanted to keep the Language section as simple and as abstracted from the task as possible.
-- 'How was the data collected' out of Who Was Involved
-- Normalization out of Features/Dataset Structure
-- I also added an annotation process section.
I kept the Features section mostly the same as the Google Doc, but I renamed it Dataset Structure to more clearly separate it from the language use, and added some links to the documentation pages.
I also kept Tasks Supported, Known Limitations, and Licensing Information mostly the same. Looking at it again though, maybe Tasks Supported should come before Data Characteristics?
The trickiest part about writing a dataset card for the SNLI corpus specifically is that it's built on datasets which are themselves built on datasets so I had to dig in a lot of places to find information. I think this will be easier with other datasets and once there is more uptake of dataset cards so they can just link to each other. (Maybe that needs to be an added section?)
I also made an effort not to repeat information across the sections or to refer to a previous section if the information was relevant in a later one. Is there too much repetition still?