Skip to content

Commit

Permalink
add dataset card title (#2381)
Browse files Browse the repository at this point in the history
* add dataset card title

* YAML tags for multi_nli_mismatch

* extra info added in s2orc and multi_nli

* minor change
  • Loading branch information
bhavitvyamalik committed May 20, 2021
1 parent f3dc890 commit e8abb4a
Show file tree
Hide file tree
Showing 8 changed files with 42 additions and 11 deletions.
2 changes: 1 addition & 1 deletion datasets/circa/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ task_ids:
- text-classification-other-question-answer-pair-classification
---

# Dataset Card Creation Guide
# Dataset Card for CIRCA

## Table of Contents
- [Dataset Description](#dataset-description)
Expand Down
12 changes: 9 additions & 3 deletions datasets/multi_nli/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ task_ids:
- semantic-similarity-scoring
---

# Dataset Card for "multi_nli"
# Dataset Card for Multi-Genre Natural Language Inference (MultiNLI)

## Table of Contents
- [Dataset Description](#dataset-description)
Expand Down Expand Up @@ -127,17 +127,23 @@ They constructed MultiNLI so as to make it possible to explicitly evaluate model

### Source Data

#### Initial Data Collection and Normalization

They created each sentence pair by selecting a premise sentence from a preexisting text source and asked a human annotator to compose a novel sentence to pair with it as a hypothesis.

#### Who are the source language producers?

[More Information Needed]

### Annotations

#### Annotation process

[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
[More Information Needed]

#### Who are the annotators?

[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
[More Information Needed]

### Personal and Sensitive Information

Expand Down
24 changes: 23 additions & 1 deletion datasets/multi_nli_mismatch/README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,29 @@
---
annotations_creators:
- crowdsourced
language_creators:
- crowdsourced
- found
languages:
- en
licenses:
- cc-by-3.0
- cc-by-sa-3.0-at
- mit
- other-Open Portion of the American National Corpus
multilinguality:
- monolingual
size_categories:
- 100K<n<1M
source_datasets:
- original
task_categories:
- text-scoring
task_ids:
- semantic-similarity-scoring
---

# Dataset Card for "multi_nli_mismatch"
# Dataset Card for Multi-Genre Natural Language Inference (Mismatched only)

## Table of Contents
- [Dataset Description](#dataset-description)
Expand Down
2 changes: 1 addition & 1 deletion datasets/para_pat/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ task_ids:
- language-modeling
---

# Dataset Card Creation Guide
# Dataset Card for ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts

## Table of Contents
- [Dataset Description](#dataset-description)
Expand Down
2 changes: 1 addition & 1 deletion datasets/paws-x/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ task_ids:
- text-scoring-other-paraphrase-identification
---

# Dataset Card Creation Guide
# Dataset Card for PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification

## Table of Contents
- [Dataset Description](#dataset-description)
Expand Down
2 changes: 1 addition & 1 deletion datasets/paws/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ task_ids:
- text-scoring-other-paraphrase-identification
---

# Dataset Card Creation Guide
# Dataset Card for PAWS: Paraphrase Adversaries from Word Scrambling

## Table of Contents
- [Dataset Description](#dataset-description)
Expand Down
2 changes: 1 addition & 1 deletion datasets/re_dial/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ task_ids:
- text-classification-other-dialogue-sentiment-classification
---

# Dataset Card Creation Guide
# Dataset Card for ReDial (Recommendation Dialogues)

## Table of Contents
- [Dataset Description](#dataset-description)
Expand Down
7 changes: 5 additions & 2 deletions datasets/s2orc/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ task_ids:
- other-other-citation-recommendation
---

# Dataset Card Creation Guide
# Dataset Card for S2ORC: The Semantic Scholar Open Research Corpus

## Table of Contents
- [Dataset Description](#dataset-description)
Expand Down Expand Up @@ -242,4 +242,7 @@ Semantic Scholar Open Research Corpus is licensed under ODC-BY.
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
```
### Contributions

Thanks to [@bhavitvyamalik](https://github.com/bhavitvyamalik) for adding this dataset.

1 comment on commit e8abb4a

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==1.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.022793 / 0.011353 (0.011441) 0.016088 / 0.011008 (0.005080) 0.046370 / 0.038508 (0.007862) 0.035086 / 0.023109 (0.011977) 0.342783 / 0.275898 (0.066885) 0.394128 / 0.323480 (0.070648) 0.011050 / 0.007986 (0.003064) 0.005356 / 0.004328 (0.001027) 0.010677 / 0.004250 (0.006427) 0.046402 / 0.037052 (0.009349) 0.348490 / 0.258489 (0.090001) 0.396384 / 0.293841 (0.102543) 0.166216 / 0.128546 (0.037670) 0.127299 / 0.075646 (0.051653) 0.411002 / 0.419271 (-0.008270) 0.400628 / 0.043533 (0.357096) 0.385507 / 0.255139 (0.130368) 0.404449 / 0.283200 (0.121250) 1.591353 / 0.141683 (1.449671) 1.775193 / 1.452155 (0.323038) 1.884147 / 1.492716 (0.391431)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.008439 / 0.018006 (-0.009567) 0.471459 / 0.000490 (0.470969) 0.002282 / 0.000200 (0.002082) 0.000071 / 0.000054 (0.000016)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.042347 / 0.037411 (0.004936) 0.025441 / 0.014526 (0.010915) 0.027082 / 0.176557 (-0.149475) 0.043413 / 0.737135 (-0.693723) 0.027741 / 0.296338 (-0.268598)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.490720 / 0.215209 (0.275511) 4.813860 / 2.077655 (2.736206) 2.483999 / 1.504120 (0.979879) 1.909035 / 1.541195 (0.367840) 1.869181 / 1.468490 (0.400691) 7.102323 / 4.584777 (2.517546) 6.747079 / 3.745712 (3.001367) 8.766189 / 5.269862 (3.496327) 7.764401 / 4.565676 (3.198724) 0.734412 / 0.424275 (0.310137) 0.010719 / 0.007607 (0.003112) 0.683428 / 0.226044 (0.457383) 6.585449 / 2.268929 (4.316521) 3.212886 / 55.444624 (-52.231738) 2.283878 / 6.876477 (-4.592598) 2.362790 / 2.142072 (0.220717) 7.395375 / 4.805227 (2.590147) 4.862597 / 6.500664 (-1.638067) 9.021729 / 0.075469 (8.946260)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 10.973504 / 1.841788 (9.131716) 13.040236 / 8.074308 (4.965928) 40.602557 / 10.191392 (30.411165) 1.009016 / 0.680424 (0.328592) 0.685848 / 0.534201 (0.151647) 0.792323 / 0.579283 (0.213040) 0.640699 / 0.434364 (0.206335) 0.714434 / 0.540337 (0.174097) 1.529983 / 1.386936 (0.143047)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.021895 / 0.011353 (0.010542) 0.016083 / 0.011008 (0.005074) 0.048996 / 0.038508 (0.010488) 0.034880 / 0.023109 (0.011771) 0.317014 / 0.275898 (0.041116) 0.345712 / 0.323480 (0.022232) 0.011084 / 0.007986 (0.003098) 0.005568 / 0.004328 (0.001239) 0.010881 / 0.004250 (0.006630) 0.051393 / 0.037052 (0.014341) 0.312082 / 0.258489 (0.053593) 0.348911 / 0.293841 (0.055070) 0.160419 / 0.128546 (0.031873) 0.134119 / 0.075646 (0.058472) 0.414192 / 0.419271 (-0.005080) 0.412109 / 0.043533 (0.368577) 0.355149 / 0.255139 (0.100010) 0.338320 / 0.283200 (0.055121) 1.573061 / 0.141683 (1.431378) 1.716055 / 1.452155 (0.263900) 1.771535 / 1.492716 (0.278819)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.009840 / 0.018006 (-0.008166) 0.505619 / 0.000490 (0.505129) 0.002918 / 0.000200 (0.002718) 0.000084 / 0.000054 (0.000029)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.043922 / 0.037411 (0.006510) 0.028487 / 0.014526 (0.013961) 0.031224 / 0.176557 (-0.145332) 0.054971 / 0.737135 (-0.682165) 0.033013 / 0.296338 (-0.263326)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.450873 / 0.215209 (0.235664) 4.474106 / 2.077655 (2.396451) 2.020914 / 1.504120 (0.516794) 1.708487 / 1.541195 (0.167292) 1.700633 / 1.468490 (0.232143) 6.835161 / 4.584777 (2.250384) 6.140094 / 3.745712 (2.394382) 8.569432 / 5.269862 (3.299570) 7.635942 / 4.565676 (3.070266) 0.713023 / 0.424275 (0.288748) 0.010164 / 0.007607 (0.002557) 0.637439 / 0.226044 (0.411395) 6.203184 / 2.268929 (3.934256) 2.676527 / 55.444624 (-52.768097) 2.061355 / 6.876477 (-4.815122) 2.359748 / 2.142072 (0.217676) 7.174909 / 4.805227 (2.369682) 4.630621 / 6.500664 (-1.870043) 5.770101 / 0.075469 (5.694632)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 10.634990 / 1.841788 (8.793202) 13.028324 / 8.074308 (4.954016) 41.498924 / 10.191392 (31.307532) 0.861674 / 0.680424 (0.181250) 0.595123 / 0.534201 (0.060922) 0.780564 / 0.579283 (0.201281) 0.621599 / 0.434364 (0.187235) 0.702952 / 0.540337 (0.162615) 1.579572 / 1.386936 (0.192636)

CML watermark

Please sign in to comment.