Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Created dataset card snli.md #663

Merged
merged 6 commits into from
Oct 12, 2020

Conversation

mcmillanmajora
Copy link
Contributor

First draft of a dataset card using the SNLI corpus as an example.

This is mostly based on the Google Doc draft, but I added a few sections and moved some things around.

  • I moved Who Was Involved to follow Language, both because I thought the authors should be presented more towards the front and because I think it makes sense to present the speakers close to the language so it doesn't have to be repeated.

  • I created a section I called Data Characteristics by pulling some things out of the other sections. I was thinking that this would be more about the language use in context of the specific task construction. That name isn't very descriptive though and could probably be improved.
    -- Domain and language type out of Language. I particularly wanted to keep the Language section as simple and as abstracted from the task as possible.
    -- 'How was the data collected' out of Who Was Involved
    -- Normalization out of Features/Dataset Structure
    -- I also added an annotation process section.

  • I kept the Features section mostly the same as the Google Doc, but I renamed it Dataset Structure to more clearly separate it from the language use, and added some links to the documentation pages.

  • I also kept Tasks Supported, Known Limitations, and Licensing Information mostly the same. Looking at it again though, maybe Tasks Supported should come before Data Characteristics?

The trickiest part about writing a dataset card for the SNLI corpus specifically is that it's built on datasets which are themselves built on datasets so I had to dig in a lot of places to find information. I think this will be easier with other datasets and once there is more uptake of dataset cards so they can just link to each other. (Maybe that needs to be an added section?)

I also made an effort not to repeat information across the sections or to refer to a previous section if the information was relevant in a later one. Is there too much repetition still?

First draft of a dataset card using the SNLI corpus as an example
@mcmillanmajora mcmillanmajora added the Dataset discussion Discussions on the datasets label Sep 22, 2020
@yjernite yjernite self-assigned this Sep 22, 2020
@yjernite
Copy link
Member

yjernite commented Sep 23, 2020

@yjernite
Copy link
Member

It would be amazing if we ended up with this much information on all of our datasets :)

I don't think there's too much repetition, everything that is in here is relevant. The main challenge will be to figure out how to structure the sheet so that all of the information can be presented without overwhelming the reader. We'll also want to have as much of it as possible in structured form so it can be easily navigated.

@yjernite
Copy link
Member

@mcmillanmajora for now can you remove the prompts / quoted blocks so we can see what the datasheet would look like on its own?

Would also love to hear if @sgugger has some first impressions

Removed section prompts for clarity
@mcmillanmajora
Copy link
Contributor Author

I removed the prompts. It's definitely a little easier to read without them!

@julien-c
Copy link
Member

Should we name the file README.md for consistency with models?

Copy link
Contributor

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me, thanks a lot for all the work!

I don't feel like there are too many repetitions (almost the opposite side of the scale, I feel there are slightly too many see this section for more information).

I just think the "Tasks supported" section could be expanded a bit and moved a bit more toward the beginning.

1 | Two women are embracing while holding to go packages. | Two woman are holding packages. | 0
2 | Two women are embracing while holding to go packages. | The men are fighting outside a deli. | 2

## Tasks supported:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this section should be presented before as it's probably the information a user wants to know first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. It's also short enough that it doesn't push the author information too far down/

Copy link
Member

@lhoestq lhoestq Sep 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree !

Furthermore it'd be cool to have a section in yaml with the tasks. It would allow moon-landing to parse it and allow users to do dataset search by task :) cc @julien-c

## Tasks supported:
### Task categorization / tags

Text to structured, three-way text classification
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have pages that explain what are those in details? Otherwise a brief sentence explaining what both are would be super useful.

Copy link
Member

@yjernite yjernite Sep 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make some structured tags for the task types / families (and have a page somewhere explaining what the tags mean)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, it would be great to have links here to that page.

Copy link
Member

@thomwolf thomwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really amazing example of what a dataset card could/should be 😍

Maybe when it starts to contain so much information we could even have a table of content at the top?

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it !

1 | Two women are embracing while holding to go packages. | Two woman are holding packages. | 0
2 | Two women are embracing while holding to go packages. | The men are fighting outside a deli. | 2

## Tasks supported:
Copy link
Member

@lhoestq lhoestq Sep 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree !

Furthermore it'd be cool to have a section in yaml with the tasks. It would allow moon-landing to parse it and allow users to do dataset search by task :) cc @julien-c

Renamed to README.md. Task section moved to top. TOC added. YAML content added. Multilingual section removed because it seemed repetitive.
@yjernite
Copy link
Member

Asked @sleepinyourhat for some insights too :)

Copy link

@sleepinyourhat sleepinyourhat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this!

## Tasks supported:
### Task categorization / tags

Text to structured, three-way text classification

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What makes this structured? That would usually make me think of structured prediction—i.e., predicting something more complex than a label from a small fixed set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was in contrast with text-to-text, but we're still working on the tag set. Is there something you think would fit better?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just cut 'structured'?


It was supported by a Google Faculty Research Award, a gift from Bloomberg L.P., the Defense Advanced Research Projects Agency (DARPA) Deep Exploration and Filtering of Text (DEFT) Program under Air Force Research Laboratory (AFRL) contract no. FA8750-13-2-0040, the National Science Foundation under grant no. IIS 1159679, and the Department of the Navy, Office of Naval Research, under grant no. N00014-10-1-0109.

### Who are the language producers (who wrote the text / created the base content)?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be worth flagging here that the annotations themselves are also free-form text, so the rest of this document is also relevant to this question.


### Who are the annotators?

The annotators of the validation task were a closed set of about 30 trusted crowdworkers on Amazon Mechanical Turk. It is unknown if any demographic information was collected or how they were compensated.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No demographic information was collected.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We gave $1 bonuses in cases where annotator labels agreed with our labels for 250 randomly distributed examples.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, annotators were paid per HIT under similar terms to the above.


### Who are the language producers (who wrote the text / created the base content)?

A large portion of the premises (160k) were produced in the [Flickr 30k corpus](http://shannon.cs.illinois.edu/DenotationGraph/) by an unknown number of crowdworkers. About 2,500 crowdworkers from Amazon Mechanical Turk produced the associated hypotheses. The premises from the Flickr 30k project describe people and animals whose photos were collected and presented to the Flickr 30k crowdworkers, but the SNLI corpus did not present the photos to the hypotheses creators. Neither report crowdworker or photo subject demographic information or crowdworker compensation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 'neither' here makes it sound like you might be discussing the SNLI hypothesis creators, but it's not totally clear.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is the place to talk about compensation, I sadly don't think I have the full documentation that I'd need to give useful numbers. Here's what I recall, with the help of some vaguely-relevant emails. Feel free to use any of this if it seems formal enoguh:

  • Writing-phase workers were paid per HIT with no special incentives. We rejected (didn't pay for) work only in one or two extreme cases where a single annotator submitted a gigantic amount of junk data in a way that was clearly automated. We disqualified workers who clearly ignored the guidelines over many HITs.
  • I believe we targeted an hourly rate in the $10-15/hour range, based on our own in-house estimates of task difficulty, though we have no reliable way to check those estimates, and it's quite plausible that they were too low once you account for time spent switching HITs.
  • The pay rate varied over time (as we fine-tuned our estimate of how hard the task was), and I believe we also separated very long premises into batches that paid slightly more. I believe the rates were all between $.1 and $.5, and clustered toward the middle of that range.


### Annotation process

56,941 of the total sentence pairs were further annotated in a validation task. Four annotators each labeled a premise-hypothesis pair for entailment, contradiction, or neither, resulting in 5 total judgements including the original hypothesis author judgement. See Section 2.2 for more details (Bowman et al., 2015).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for => as?

Premise | 14.1
Hypothesis | 8.3

The _label_ has 4 possible values, _0_, _1_, _2_, _-_. which correspond to _entailment_, _neutral_, _contradiction_, and _no label_ respectively. The dataset was developed so that the first three values would be evenly distributed across the splits. See the Annotation Process section for details on _no label_.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is specific to datasets format—both original formats use string-valued labels.


### Example ID

The ID is an integer starting from 0. It has no inherent meaning.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IDs in the original dataset correspond to identifiers from Flickr30k or (the draft version of) VisualGenome, suffixed with an internal identifier.

## Known Limitations
### Known social biases

The language reflects the content of the photos collected from Flickr, as described in the Data Collection section.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a paper that quantifies some kinds of bias: https://www.aclweb.org/anthology/W17-1609/


### Other known limitations

[Gururangan et al (2018)](https://www.aclweb.org/anthology/N18-2017.pdf) showed that the SNLI corpus had a number of annotation artifacts. Using a simple classifer, they correctly predicted the label of the hypothesis 67% of the time without using the premise.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was roughly simultaneous with two other very similar papers by Tsuchiya and by Poliak et al, so it's worth mentioning all three if you mention one. The Gururangan paper cites the other two.

## Tasks supported:
### Task categorization / tags

Text to structured, three-way text classification

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, it has also been used for entailment generation, though I'm not sure that it's actually useful for that task (or if that task is ever useful):

https://arxiv.org/abs/1606.01404

Updated compensation information, dataset structure, and bias literature based on review from @sleepinyourhat
@mcmillanmajora
Copy link
Contributor Author

Thank you for taking the time to look through the card and for all your comments @sleepinyourhat ! I've incorporated them in the latest update.

Comment on lines +2 to +16
language:
- en
task:
- text-classification
purpose:
- NLI
size:
- ">100k"
language producers:
- crowdsourced
annotation:
- crowdsourced
tags:
- extended-from-other-datasets
license: "CC BY-SA 4.0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love it, this will be very useful.

One thing about size: maybe we should use a range like "500k-1M"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to support multiple values of size for a single dataset? Otherwise it should probably be

size: ">100k"

(just a value, not an array)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some datasets have sub-datasets (glue, wikipedia, etc.) with different sizes so we can have several sizes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make sense to have a list (not a map) of all sub-dataset sizes in that case though?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably have a map from sub-datasets names to specific features (at least size, language for Wikipedia, purpose for KILT, etc...) so people can search for either the full dataset or any sub-dataset. E.g. filtering by portuguese and LM tags returns the Portuguese subset of Wikipedia.

- crowdsourced
tags:
- extended-from-other-datasets
license: "CC BY-SA 4.0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
license: "CC BY-SA 4.0"
license: cc-by-4.0

For models (documented at https://huggingface.co/docs#what-metadata-can-i-add-to-my-model-card) we use the License keywords listed by GitHub at https://docs.github.com/en/free-pro-team@latest/github/creating-cloning-and-archiving-repositories/licensing-a-repository#searching-github-by-license-type

(Hopefully we'll plug some sort of form validation for users at some point)

Comment on lines +4 to +9
task:
- text-classification
purpose:
- NLI
size:
- ">100k"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not in this particular PR's scope, but do we already have a sense of the taxonomy i.e. the possible values for those tags? (to display them in the website)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mcmillanmajora is working on that next (after taking a look at more of our datasets to get of sense of what we need)

@sleepinyourhat
Copy link

sleepinyourhat commented Oct 1, 2020 via email

@julien-c
Copy link
Member

julien-c commented Oct 1, 2020

@sleepinyourhat You're right, wrong copy/paste

- crowdsourced
tags:
- extended-from-other-datasets
license: "CC BY-SA 4.0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
license: "CC BY-SA 4.0"
license: cc-by-sa-4.0

mcmillanmajora and others added 2 commits October 12, 2020 11:02
Initial tags
…entation

Updating remote branch to be up to date with HF commits
@yjernite yjernite merged commit c57e66e into huggingface:master Oct 12, 2020
@@ -0,0 +1,17 @@
---
language:
- BCP-47: fr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't have the same scheme as the other card (and won't be compatible with our website filtering out of the box), was this intended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I included this by mistake. Sorry about that!

@mcmillanmajora
Copy link
Contributor Author

@sleepinyourhat The schema is definitely drawing from Data Statements and Datasheets for Datasets but we also wanted to include some more general information to introduce the dataset to new users. If you have any suggestions for changes to the schema itself, please let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dataset discussion Discussions on the datasets
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants