Created dataset card snli.md #663

mcmillanmajora · 2020-09-22T22:29:37Z

First draft of a dataset card using the SNLI corpus as an example.

This is mostly based on the Google Doc draft, but I added a few sections and moved some things around.

I moved Who Was Involved to follow Language, both because I thought the authors should be presented more towards the front and because I think it makes sense to present the speakers close to the language so it doesn't have to be repeated.
I created a section I called Data Characteristics by pulling some things out of the other sections. I was thinking that this would be more about the language use in context of the specific task construction. That name isn't very descriptive though and could probably be improved.
-- Domain and language type out of Language. I particularly wanted to keep the Language section as simple and as abstracted from the task as possible.
-- 'How was the data collected' out of Who Was Involved
-- Normalization out of Features/Dataset Structure
-- I also added an annotation process section.
I kept the Features section mostly the same as the Google Doc, but I renamed it Dataset Structure to more clearly separate it from the language use, and added some links to the documentation pages.
I also kept Tasks Supported, Known Limitations, and Licensing Information mostly the same. Looking at it again though, maybe Tasks Supported should come before Data Characteristics?

The trickiest part about writing a dataset card for the SNLI corpus specifically is that it's built on datasets which are themselves built on datasets so I had to dig in a lot of places to find information. I think this will be easier with other datasets and once there is more uptake of dataset cards so they can just link to each other. (Maybe that needs to be an added section?)

I also made an effort not to repeat information across the sections or to refer to a previous section if the information was relevant in a later one. Is there too much repetition still?

First draft of a dataset card using the SNLI corpus as an example

yjernite · 2020-09-23T22:35:27Z

Adding a direct link to the rendered markdown:
https://github.com/mcmillanmajora/datasets/blob/add_dataset_documentation/datasets/snli/README.md

yjernite · 2020-09-23T22:56:24Z

It would be amazing if we ended up with this much information on all of our datasets :)

I don't think there's too much repetition, everything that is in here is relevant. The main challenge will be to figure out how to structure the sheet so that all of the information can be presented without overwhelming the reader. We'll also want to have as much of it as possible in structured form so it can be easily navigated.

yjernite · 2020-09-23T22:57:39Z

@mcmillanmajora for now can you remove the prompts / quoted blocks so we can see what the datasheet would look like on its own?

Would also love to hear if @sgugger has some first impressions

Removed section prompts for clarity

mcmillanmajora · 2020-09-24T01:12:20Z

I removed the prompts. It's definitely a little easier to read without them!

julien-c · 2020-09-24T12:47:29Z

Should we name the file README.md for consistency with models?

sgugger

This looks great to me, thanks a lot for all the work!

I don't feel like there are too many repetitions (almost the opposite side of the scale, I feel there are slightly too many see this section for more information).

I just think the "Tasks supported" section could be expanded a bit and moved a bit more toward the beginning.

sgugger · 2020-09-24T13:08:53Z

datasets/snli/snli.md

+1	| Two women are embracing while holding to go packages.	| Two woman are holding packages.	| 0
+2	| Two women are embracing while holding to go packages.	| The men are fighting outside a deli. | 2
+
+## Tasks supported:


I feel this section should be presented before as it's probably the information a user wants to know first.

Agreed. It's also short enough that it doesn't push the author information too far down/

I agree !

Furthermore it'd be cool to have a section in yaml with the tasks. It would allow moon-landing to parse it and allow users to do dataset search by task :) cc @julien-c

sgugger · 2020-09-24T13:09:23Z

datasets/snli/snli.md

+## Tasks supported:
+### Task categorization / tags
+
+Text to structured, three-way text classification


Do we have pages that explain what are those in details? Otherwise a brief sentence explaining what both are would be super useful.

We should make some structured tags for the task types / families (and have a page somewhere explaining what the tags mean)

In that case, it would be great to have links here to that page.

thomwolf

This is a really amazing example of what a dataset card could/should be 😍

Maybe when it starts to contain so much information we could even have a table of content at the top?

lhoestq

Love it !

lhoestq · 2020-09-25T14:05:40Z

datasets/snli/snli.md

+1	| Two women are embracing while holding to go packages.	| Two woman are holding packages.	| 0
+2	| Two women are embracing while holding to go packages.	| The men are fighting outside a deli. | 2
+
+## Tasks supported:


I agree !

Furthermore it'd be cool to have a section in yaml with the tasks. It would allow moon-landing to parse it and allow users to do dataset search by task :) cc @julien-c

Renamed to README.md. Task section moved to top. TOC added. YAML content added. Multilingual section removed because it seemed repetitive.

yjernite · 2020-09-28T18:26:18Z

Asked @sleepinyourhat for some insights too :)

sleepinyourhat

Thanks for doing this!

sleepinyourhat · 2020-09-29T22:15:29Z

datasets/snli/README.md

+## Tasks supported:
+### Task categorization / tags
+
+Text to structured, three-way text classification


What makes this structured? That would usually make me think of structured prediction—i.e., predicting something more complex than a label from a small fixed set.

It was in contrast with text-to-text, but we're still working on the tag set. Is there something you think would fit better?

Just cut 'structured'?

sleepinyourhat · 2020-09-29T22:18:36Z

datasets/snli/README.md

+
+It was supported by a Google Faculty Research Award, a gift from Bloomberg L.P., the Defense Advanced Research Projects Agency (DARPA) Deep Exploration and Filtering of Text (DEFT) Program under Air Force Research Laboratory (AFRL) contract no. FA8750-13-2-0040, the National Science Foundation under grant no. IIS 1159679, and the Department of the Navy, Office of Naval Research, under grant no. N00014-10-1-0109.
+
+### Who are the language producers (who wrote the text / created the base content)?


It could be worth flagging here that the annotations themselves are also free-form text, so the rest of this document is also relevant to this question.

sleepinyourhat · 2020-09-29T22:19:36Z

datasets/snli/README.md

+
+### Who are the annotators?
+
+The annotators of the validation task were a closed set of about 30 trusted crowdworkers on Amazon Mechanical Turk. It is unknown if any demographic information was collected or how they were compensated.


No demographic information was collected.

We gave $1 bonuses in cases where annotator labels agreed with our labels for 250 randomly distributed examples.

Otherwise, annotators were paid per HIT under similar terms to the above.

sleepinyourhat · 2020-09-29T22:20:12Z

datasets/snli/README.md

+
+### Who are the language producers (who wrote the text / created the base content)?
+
+A large portion of the premises (160k) were produced in the [Flickr 30k corpus](http://shannon.cs.illinois.edu/DenotationGraph/) by an unknown number of crowdworkers. About 2,500 crowdworkers from Amazon Mechanical Turk produced the associated hypotheses. The premises from the Flickr 30k project describe people and animals whose photos were collected and presented to the Flickr 30k crowdworkers, but the SNLI corpus did not present the photos to the hypotheses creators. Neither report crowdworker or photo subject demographic information or crowdworker compensation.


The 'neither' here makes it sound like you might be discussing the SNLI hypothesis creators, but it's not totally clear.

If this is the place to talk about compensation, I sadly don't think I have the full documentation that I'd need to give useful numbers. Here's what I recall, with the help of some vaguely-relevant emails. Feel free to use any of this if it seems formal enoguh:

Writing-phase workers were paid per HIT with no special incentives. We rejected (didn't pay for) work only in one or two extreme cases where a single annotator submitted a gigantic amount of junk data in a way that was clearly automated. We disqualified workers who clearly ignored the guidelines over many HITs.

I believe we targeted an hourly rate in the $10-15/hour range, based on our own in-house estimates of task difficulty, though we have no reliable way to check those estimates, and it's quite plausible that they were too low once you account for time spent switching HITs.

The pay rate varied over time (as we fine-tuned our estimate of how hard the task was), and I believe we also separated very long premises into batches that paid slightly more. I believe the rates were all between $.1 and $.5, and clustered toward the middle of that range.

sleepinyourhat · 2020-09-29T22:38:17Z

datasets/snli/README.md

+
+### Annotation process
+
+56,941 of the total sentence pairs were further annotated in a validation task. Four annotators each labeled a premise-hypothesis pair for entailment, contradiction, or neither, resulting in 5 total judgements including the original hypothesis author judgement. See Section 2.2 for more details (Bowman et al., 2015). 


sleepinyourhat · 2020-09-29T22:40:57Z

datasets/snli/README.md

+Premise | 14.1
+Hypothesis | 8.3
+
+The _label_ has 4 possible values, _0_, _1_, _2_, _-_.  which correspond to _entailment_, _neutral_, _contradiction_, and _no label_ respectively. The dataset was developed so that the first three values would be evenly distributed across the splits. See the Annotation Process section for details on _no label_.


I assume this is specific to datasets format—both original formats use string-valued labels.

sleepinyourhat · 2020-09-29T22:42:23Z

datasets/snli/README.md

+
+### Example ID
+
+The ID is an integer starting from 0. It has no inherent meaning. 


The IDs in the original dataset correspond to identifiers from Flickr30k or (the draft version of) VisualGenome, suffixed with an internal identifier.

sleepinyourhat · 2020-09-29T22:43:07Z

datasets/snli/README.md

+## Known Limitations
+### Known social biases
+
+The language reflects the content of the photos collected from Flickr, as described in the Data Collection section.


There's a paper that quantifies some kinds of bias: https://www.aclweb.org/anthology/W17-1609/

sleepinyourhat · 2020-09-29T22:43:42Z

datasets/snli/README.md

+
+### Other known limitations
+
+[Gururangan et al (2018)](https://www.aclweb.org/anthology/N18-2017.pdf) showed that the SNLI corpus had a number of annotation artifacts. Using a simple classifer, they correctly predicted the label of the hypothesis 67% of the time without using the premise.


This was roughly simultaneous with two other very similar papers by Tsuchiya and by Poliak et al, so it's worth mentioning all three if you mention one. The Gururangan paper cites the other two.

sleepinyourhat · 2020-09-29T22:45:52Z

datasets/snli/README.md

+## Tasks supported:
+### Task categorization / tags
+
+Text to structured, three-way text classification


FWIW, it has also been used for entailment generation, though I'm not sure that it's actually useful for that task (or if that task is ever useful):

https://arxiv.org/abs/1606.01404

@sleepinyourhat

Updated compensation information, dataset structure, and bias literature based on review from @sleepinyourhat

mcmillanmajora · 2020-09-30T20:32:26Z

Thank you for taking the time to look through the card and for all your comments @sleepinyourhat ! I've incorporated them in the latest update.

lhoestq · 2020-10-01T09:10:33Z

datasets/snli/README.md

+language: 
+- en
+task:
+- text-classification
+purpose:
+- NLI
+size:
+- ">100k"
+language producers:
+- crowdsourced
+annotation:
+- crowdsourced
+tags:
+- extended-from-other-datasets
+license: "CC BY-SA 4.0"


I love it, this will be very useful.

One thing about size: maybe we should use a range like "500k-1M"

Do we need to support multiple values of size for a single dataset? Otherwise it should probably be

size: ">100k"

(just a value, not an array)

Some datasets have sub-datasets (glue, wikipedia, etc.) with different sizes so we can have several sizes

does it make sense to have a list (not a map) of all sub-dataset sizes in that case though?

We should probably have a map from sub-datasets names to specific features (at least size, language for Wikipedia, purpose for KILT, etc...) so people can search for either the full dataset or any sub-dataset. E.g. filtering by portuguese and LM tags returns the Portuguese subset of Wikipedia.

julien-c · 2020-10-01T14:08:43Z

datasets/snli/README.md

+- crowdsourced
+tags:
+- extended-from-other-datasets
+license: "CC BY-SA 4.0"


Suggested change

license: "CC BY-SA 4.0"

license: cc-by-4.0

For models (documented at https://huggingface.co/docs#what-metadata-can-i-add-to-my-model-card) we use the License keywords listed by GitHub at https://docs.github.com/en/free-pro-team@latest/github/creating-cloning-and-archiving-repositories/licensing-a-repository#searching-github-by-license-type

(Hopefully we'll plug some sort of form validation for users at some point)

julien-c · 2020-10-01T14:12:05Z

datasets/snli/README.md

+task:
+- text-classification
+purpose:
+- NLI
+size:
+- ">100k"


Maybe not in this particular PR's scope, but do we already have a sense of the taxonomy i.e. the possible values for those tags? (to display them in the website)

@mcmillanmajora is working on that next (after taking a look at more of our datasets to get of sense of what we need)

sleepinyourhat · 2020-10-01T14:12:17Z

Be careful to keep the ‘sa’ term in the license. It’s something we inherited from the Flickr captions.

…

On Thu, Oct 1, 2020 at 10:09 AM Julien Chaumond ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In datasets/snli/README.md <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_huggingface_datasets_pull_663-23discussion-5Fr498273172&d=DwMFaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=sCzLyHdE8zgQwk2-sKwA1w&m=PHPCew9Xj3CBQrudcaii70ln-wpRtbngE_tj3Ioy3NI&s=WbEkKXCbL6j5Ui3sox_WqvzrbShbJn2WW-51SENL2ZQ&e=> : > +--- +language: +- en +task: +- text-classification +purpose: +- NLI +size: +- ">100k" +language producers: +- crowdsourced +annotation: +- crowdsourced +tags: +- extended-from-other-datasets +license: "CC BY-SA 4.0" ⬇️ Suggested change -license: "CC BY-SA 4.0" +license: cc-by-4.0 For models (documented at https://huggingface.co/docs#what-metadata-can-i-add-to-my-model-card <https://urldefense.proofpoint.com/v2/url?u=https-3A__huggingface.co_docs-23what-2Dmetadata-2Dcan-2Di-2Dadd-2Dto-2Dmy-2Dmodel-2Dcard&d=DwMFaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=sCzLyHdE8zgQwk2-sKwA1w&m=PHPCew9Xj3CBQrudcaii70ln-wpRtbngE_tj3Ioy3NI&s=ck3x8c_ujrwKReDTSGuWWgD9W6REHEPbZaO7S4GFRd4&e=>) we use the License keywords listed by GitHub at ***@***.***/github/creating-cloning-and-archiving-repositories/licensing-a-repository#searching-github-by-license-type <https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.github.com_en_free-2Dpro-2Dteam-40latest_github_creating-2Dcloning-2Dand-2Darchiving-2Drepositories_licensing-2Da-2Drepository-23searching-2Dgithub-2Dby-2Dlicense-2Dtype&d=DwMFaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=sCzLyHdE8zgQwk2-sKwA1w&m=PHPCew9Xj3CBQrudcaii70ln-wpRtbngE_tj3Ioy3NI&s=dWBP-ZvtMErD-egoBiBTCKA4500mjDXVSk03oW1g16U&e=> (Hopefully we'll plug some sort of form validation for users at some point) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_huggingface_datasets_pull_663-23pullrequestreview-2D500386385&d=DwMFaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=sCzLyHdE8zgQwk2-sKwA1w&m=PHPCew9Xj3CBQrudcaii70ln-wpRtbngE_tj3Ioy3NI&s=HU2Hwi7HH9W2NtMoCIiQlhXxxEULLi8L9gnWU5PBAPY&e=>, or unsubscribe <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAJZSWL63W2LB7SBICA2GMTSISEPZANCNFSM4RWKAZRA&d=DwMFaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=sCzLyHdE8zgQwk2-sKwA1w&m=PHPCew9Xj3CBQrudcaii70ln-wpRtbngE_tj3Ioy3NI&s=086__lKQLxTanHfjE8kOIpaJbaWPzBB9gGIt_prWeH8&e=> .

julien-c · 2020-10-01T14:14:23Z

@sleepinyourhat You're right, wrong copy/paste

julien-c · 2020-10-01T14:15:18Z

datasets/snli/README.md

+- crowdsourced
+tags:
+- extended-from-other-datasets
+license: "CC BY-SA 4.0"


Suggested change

license: "CC BY-SA 4.0"

license: cc-by-sa-4.0

Initial tags

…entation Updating remote branch to be up to date with HF commits

sleepinyourhat · 2020-10-12T23:04:17Z

Question: Where does this standard come from? It looks similar to both 'Data Statements' and 'Datasheets for Datasets', but it doesn't look quite like either.

…

On Mon, Oct 12, 2020 at 4:27 PM Yacine Jernite ***@***.***> wrote: Merged #663 <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_huggingface_datasets_pull_663&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=sCzLyHdE8zgQwk2-sKwA1w&m=D34WbiHBTYHOdXsI9JV9wJqSieP6zAPGqGKDziM5uKU&s=s4_X-BSEnTKgGg9rPLBt3cyVptyMX_iWD5Ql3UMBi-I&e=> into master. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_huggingface_datasets_pull_663-23event-2D3868180429&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=sCzLyHdE8zgQwk2-sKwA1w&m=D34WbiHBTYHOdXsI9JV9wJqSieP6zAPGqGKDziM5uKU&s=elcM4umqReQfIrgHhpey9W_wPaq5QRgq7xNlubM47QI&e=>, or unsubscribe <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAJZSWJVGQRCR4OTTV27VTTSKNRBXANCNFSM4RWKAZRA&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=sCzLyHdE8zgQwk2-sKwA1w&m=D34WbiHBTYHOdXsI9JV9wJqSieP6zAPGqGKDziM5uKU&s=NB6nEROnTPgwNyF3ZklOmHnvP7kOkOm7sEa740KbVCs&e=> .

julien-c · 2020-10-13T11:57:10Z

datasets/allocine/README.md

@@ -0,0 +1,17 @@
+---
+language: 
+- BCP-47: fr


this doesn't have the same scheme as the other card (and won't be compatible with our website filtering out of the box), was this intended?

No, I included this by mistake. Sorry about that!

mcmillanmajora · 2020-10-13T17:05:20Z

@sleepinyourhat The schema is definitely drawing from Data Statements and Datasheets for Datasets but we also wanted to include some more general information to introduce the dataset to new users. If you have any suggestions for changes to the schema itself, please let us know!

Create snli.md

266e322

First draft of a dataset card using the SNLI corpus as an example

mcmillanmajora added the Dataset discussion Discussions on the datasets label Sep 22, 2020

mcmillanmajora requested review from julien-c, thomwolf, yjernite, sgugger and lhoestq September 22, 2020 22:29

yjernite self-assigned this Sep 22, 2020

Update snli.md

a14c645

Removed section prompts for clarity

sgugger approved these changes Sep 24, 2020

View reviewed changes

thomwolf approved these changes Sep 24, 2020

View reviewed changes

lhoestq approved these changes Sep 25, 2020

View reviewed changes

Update README.md

619d37a

Renamed to README.md. Task section moved to top. TOC added. YAML content added. Multilingual section removed because it seemed repetitive.

sleepinyourhat reviewed Sep 29, 2020

View reviewed changes

Update README.md

e0bc552

Updated compensation information, dataset structure, and bias literature based on review from @sleepinyourhat

lhoestq reviewed Oct 1, 2020

View reviewed changes

julien-c reviewed Oct 1, 2020

View reviewed changes

mcmillanmajora and others added 2 commits October 12, 2020 11:02

Create README.md

e6b3bf2

Initial tags

Merge remote-tracking branch 'upstream/master' into add_dataset_docum…

82af205

…entation Updating remote branch to be up to date with HF commits

yjernite merged commit c57e66e into huggingface:master Oct 12, 2020

julien-c reviewed Oct 13, 2020

View reviewed changes


		It was supported by a Google Faculty Research Award, a gift from Bloomberg L.P., the Defense Advanced Research Projects Agency (DARPA) Deep Exploration and Filtering of Text (DEFT) Program under Air Force Research Laboratory (AFRL) contract no. FA8750-13-2-0040, the National Science Foundation under grant no. IIS 1159679, and the Department of the Navy, Office of Naval Research, under grant no. N00014-10-1-0109.

		### Who are the language producers (who wrote the text / created the base content)?


		### Who are the annotators?

		The annotators of the validation task were a closed set of about 30 trusted crowdworkers on Amazon Mechanical Turk. It is unknown if any demographic information was collected or how they were compensated.


		### Who are the language producers (who wrote the text / created the base content)?

		A large portion of the premises (160k) were produced in the [Flickr 30k corpus](http://shannon.cs.illinois.edu/DenotationGraph/) by an unknown number of crowdworkers. About 2,500 crowdworkers from Amazon Mechanical Turk produced the associated hypotheses. The premises from the Flickr 30k project describe people and animals whose photos were collected and presented to the Flickr 30k crowdworkers, but the SNLI corpus did not present the photos to the hypotheses creators. Neither report crowdworker or photo subject demographic information or crowdworker compensation.


		### Annotation process

		56,941 of the total sentence pairs were further annotated in a validation task. Four annotators each labeled a premise-hypothesis pair for entailment, contradiction, or neither, resulting in 5 total judgements including the original hypothesis author judgement. See Section 2.2 for more details (Bowman et al., 2015).


		### Example ID

		The ID is an integer starting from 0. It has no inherent meaning.


		### Other known limitations

		[Gururangan et al (2018)](https://www.aclweb.org/anthology/N18-2017.pdf) showed that the SNLI corpus had a number of annotation artifacts. Using a simple classifer, they correctly predicted the label of the hypothesis 67% of the time without using the premise.

Created dataset card snli.md #663

Created dataset card snli.md #663

Conversation

mcmillanmajora commented Sep 22, 2020

yjernite commented Sep 23, 2020 • edited

yjernite commented Sep 23, 2020

yjernite commented Sep 23, 2020

mcmillanmajora commented Sep 24, 2020

julien-c commented Sep 24, 2020

sgugger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lhoestq Sep 25, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yjernite Sep 24, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomwolf left a comment

Choose a reason for hiding this comment

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq Sep 25, 2020 • edited

Choose a reason for hiding this comment

yjernite commented Sep 28, 2020

sleepinyourhat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mcmillanmajora commented Sep 30, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sleepinyourhat commented Oct 1, 2020 via email

julien-c commented Oct 1, 2020

Choose a reason for hiding this comment

sleepinyourhat commented Oct 12, 2020 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mcmillanmajora commented Oct 13, 2020

yjernite commented Sep 23, 2020 •

edited

lhoestq Sep 25, 2020 •

edited

yjernite Sep 24, 2020 •

edited

lhoestq Sep 25, 2020 •

edited