Skip to content

Commit

Permalink
fix duplicated tag in wikicorpus dataset card
Browse files Browse the repository at this point in the history
  • Loading branch information
lhoestq committed Dec 20, 2021
1 parent ad69b3e commit cfaf17e
Showing 1 changed file with 0 additions and 1 deletion.
1 change: 0 additions & 1 deletion datasets/wikicorpus/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,6 @@ task_ids:
- part-of-speech-tagging
- text-classification-other-word-sense-disambiguation
paperswithcode_id: null
pretty_name: Wikicorpus
---

# Dataset Card for Wikicorpus
Expand Down

1 comment on commit cfaf17e

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==3.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.012679 / 0.011353 (0.001326) 0.005536 / 0.011008 (-0.005472) 0.040429 / 0.038508 (0.001921) 0.040505 / 0.023109 (0.017396) 0.392570 / 0.275898 (0.116672) 0.424710 / 0.323480 (0.101230) 0.010326 / 0.007986 (0.002340) 0.006623 / 0.004328 (0.002295) 0.011550 / 0.004250 (0.007299) 0.042316 / 0.037052 (0.005263) 0.376035 / 0.258489 (0.117546) 0.419833 / 0.293841 (0.125992) 0.046588 / 0.128546 (-0.081959) 0.014642 / 0.075646 (-0.061005) 0.352393 / 0.419271 (-0.066879) 0.065066 / 0.043533 (0.021533) 0.394185 / 0.255139 (0.139046) 0.403652 / 0.283200 (0.120452) 0.096463 / 0.141683 (-0.045220) 2.248619 / 1.452155 (0.796464) 2.362822 / 1.492716 (0.870106)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.357643 / 0.018006 (0.339637) 0.532755 / 0.000490 (0.532265) 0.017897 / 0.000200 (0.017697) 0.000481 / 0.000054 (0.000426)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.046562 / 0.037411 (0.009151) 0.028238 / 0.014526 (0.013712) 0.036431 / 0.176557 (-0.140126) 0.082580 / 0.737135 (-0.654555) 0.040453 / 0.296338 (-0.255886)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.624863 / 0.215209 (0.409654) 6.429333 / 2.077655 (4.351679) 2.422050 / 1.504120 (0.917930) 2.076697 / 1.541195 (0.535502) 2.136731 / 1.468490 (0.668241) 0.790056 / 4.584777 (-3.794721) 6.771682 / 3.745712 (3.025970) 5.282852 / 5.269862 (0.012991) 1.524022 / 4.565676 (-3.041654) 0.093594 / 0.424275 (-0.330681) 0.015552 / 0.007607 (0.007944) 0.819060 / 0.226044 (0.593016) 8.234293 / 2.268929 (5.965365) 3.234014 / 55.444624 (-52.210610) 2.431298 / 6.876477 (-4.445178) 2.532354 / 2.142072 (0.390281) 0.994084 / 4.805227 (-3.811143) 0.219354 / 6.500664 (-6.281310) 0.079675 / 0.075469 (0.004206)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.978660 / 1.841788 (0.136872) 15.164211 / 8.074308 (7.089902) 44.358521 / 10.191392 (34.167129) 1.002496 / 0.680424 (0.322072) 0.683417 / 0.534201 (0.149216) 0.657508 / 0.579283 (0.078224) 0.741649 / 0.434364 (0.307285) 0.434269 / 0.540337 (-0.106069) 0.453614 / 1.386936 (-0.933322)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.010771 / 0.011353 (-0.000581) 0.004808 / 0.011008 (-0.006201) 0.041553 / 0.038508 (0.003045) 0.038303 / 0.023109 (0.015194) 0.371385 / 0.275898 (0.095487) 0.438592 / 0.323480 (0.115112) 0.007882 / 0.007986 (-0.000103) 0.005806 / 0.004328 (0.001478) 0.009065 / 0.004250 (0.004815) 0.048205 / 0.037052 (0.011152) 0.362419 / 0.258489 (0.103930) 0.423306 / 0.293841 (0.129465) 0.048085 / 0.128546 (-0.080461) 0.014913 / 0.075646 (-0.060733) 0.337152 / 0.419271 (-0.082119) 0.073520 / 0.043533 (0.029988) 0.385812 / 0.255139 (0.130673) 0.423273 / 0.283200 (0.140074) 0.091043 / 0.141683 (-0.050640) 2.237717 / 1.452155 (0.785562) 2.280709 / 1.492716 (0.787992)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.320578 / 0.018006 (0.302571) 0.562446 / 0.000490 (0.561956) 0.021938 / 0.000200 (0.021738) 0.000455 / 0.000054 (0.000401)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.043446 / 0.037411 (0.006035) 0.027619 / 0.014526 (0.013094) 0.033231 / 0.176557 (-0.143325) 0.074208 / 0.737135 (-0.662927) 0.035268 / 0.296338 (-0.261070)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.643144 / 0.215209 (0.427935) 6.455019 / 2.077655 (4.377365) 2.451910 / 1.504120 (0.947790) 2.064949 / 1.541195 (0.523754) 2.177615 / 1.468490 (0.709125) 0.773020 / 4.584777 (-3.811757) 7.011327 / 3.745712 (3.265615) 5.275260 / 5.269862 (0.005398) 1.564441 / 4.565676 (-3.001236) 0.085331 / 0.424275 (-0.338944) 0.014705 / 0.007607 (0.007098) 0.809584 / 0.226044 (0.583540) 8.257684 / 2.268929 (5.988755) 3.332339 / 55.444624 (-52.112285) 2.529540 / 6.876477 (-4.346936) 2.564972 / 2.142072 (0.422900) 0.986753 / 4.805227 (-3.818474) 0.197303 / 6.500664 (-6.303361) 0.080438 / 0.075469 (0.004969)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 2.057832 / 1.841788 (0.216044) 15.202811 / 8.074308 (7.128503) 43.851022 / 10.191392 (33.659630) 1.104621 / 0.680424 (0.424197) 0.733174 / 0.534201 (0.198973) 0.628639 / 0.579283 (0.049356) 0.742644 / 0.434364 (0.308280) 0.432707 / 0.540337 (-0.107630) 0.458694 / 1.386936 (-0.928242)

CML watermark

Please sign in to comment.