Deprecate task api #5865

mariosasko · 2023-05-15T16:48:24Z

The task API is not well adopted in the ecosystem, so this PR deprecates it. The train_eval_index is a newer, more flexible solution that should be used instead (I think?).

These are the projects that still use the task API :

the image classification example in Transformers: here and here
autotrain: here
api-inference-community: here (but the rest of the code does not call the resolve_dataset function)

So we need to update these files after the merge.

cc @lewtun

HuggingFaceDocBuilderDev · 2023-05-15T16:52:52Z

The documentation is not available anymore as the PR was closed or merged.

lhoestq · 2023-05-15T17:42:10Z

If it's easy to keep supporting it we can keep it no ? There are many datasets on the hub that implement the tasks templates in dataset scripts and it's maybe easier to keep task templates than opening PRs to those datasets.

polinaeterna · 2023-05-15T18:14:37Z

do we know if people use the tasks api?

edit: i mean, i'm fine with removing it if it's not used much, especially considering that it's not documented well.

mariosasko · 2023-07-07T15:38:02Z

@lhoestq

Less than 80 public datasets (all canonical) implement task_templates, so updating them should be easy.

PS: I skipped gated datasets when checking for the presence of task_templates, but it's safe to assume their contribution to the total count is insignificant.

github-actions · 2023-07-07T15:39:47Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006480 / 0.011353 (-0.004872)	0.003904 / 0.011008 (-0.007104)	0.084287 / 0.038508 (0.045779)	0.071438 / 0.023109 (0.048329)	0.309823 / 0.275898 (0.033925)	0.341038 / 0.323480 (0.017558)	0.005163 / 0.007986 (-0.002822)	0.003291 / 0.004328 (-0.001037)	0.064473 / 0.004250 (0.060222)	0.053385 / 0.037052 (0.016332)	0.323561 / 0.258489 (0.065072)	0.346332 / 0.293841 (0.052491)	0.030588 / 0.128546 (-0.097958)	0.008342 / 0.075646 (-0.067305)	0.287205 / 0.419271 (-0.132067)	0.051953 / 0.043533 (0.008420)	0.310925 / 0.255139 (0.055786)	0.344443 / 0.283200 (0.061244)	0.022754 / 0.141683 (-0.118928)	1.459648 / 1.452155 (0.007494)	1.528413 / 1.492716 (0.035697)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.206404 / 0.018006 (0.188398)	0.461864 / 0.000490 (0.461374)	0.004501 / 0.000200 (0.004302)	0.000080 / 0.000054 (0.000026)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.026891 / 0.037411 (-0.010520)	0.081206 / 0.014526 (0.066680)	0.093648 / 0.176557 (-0.082908)	0.148491 / 0.737135 (-0.588645)	0.093874 / 0.296338 (-0.202464)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.401715 / 0.215209 (0.186506)	4.018597 / 2.077655 (1.940943)	2.029735 / 1.504120 (0.525615)	1.860069 / 1.541195 (0.318875)	1.935712 / 1.468490 (0.467222)	0.485896 / 4.584777 (-4.098881)	3.638177 / 3.745712 (-0.107535)	5.124058 / 5.269862 (-0.145804)	3.099666 / 4.565676 (-1.466011)	0.057173 / 0.424275 (-0.367102)	0.007240 / 0.007607 (-0.000367)	0.478758 / 0.226044 (0.252713)	4.798471 / 2.268929 (2.529543)	2.502980 / 55.444624 (-52.941645)	2.170650 / 6.876477 (-4.705827)	2.381394 / 2.142072 (0.239321)	0.578766 / 4.805227 (-4.226462)	0.132342 / 6.500664 (-6.368322)	0.059759 / 0.075469 (-0.015710)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.249238 / 1.841788 (-0.592549)	19.224673 / 8.074308 (11.150365)	13.786894 / 10.191392 (3.595502)	0.164633 / 0.680424 (-0.515791)	0.018065 / 0.534201 (-0.516136)	0.390589 / 0.579283 (-0.188694)	0.408993 / 0.434364 (-0.025370)	0.457001 / 0.540337 (-0.083336)	0.625327 / 1.386936 (-0.761609)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006827 / 0.011353 (-0.004526)	0.004007 / 0.011008 (-0.007001)	0.065239 / 0.038508 (0.026731)	0.079829 / 0.023109 (0.056719)	0.400323 / 0.275898 (0.124425)	0.434158 / 0.323480 (0.110678)	0.005314 / 0.007986 (-0.002671)	0.003354 / 0.004328 (-0.000974)	0.065044 / 0.004250 (0.060794)	0.060315 / 0.037052 (0.023262)	0.401513 / 0.258489 (0.143024)	0.441119 / 0.293841 (0.147278)	0.031783 / 0.128546 (-0.096763)	0.008608 / 0.075646 (-0.067038)	0.071755 / 0.419271 (-0.347517)	0.048816 / 0.043533 (0.005283)	0.393896 / 0.255139 (0.138757)	0.412156 / 0.283200 (0.128956)	0.024410 / 0.141683 (-0.117272)	1.515159 / 1.452155 (0.063005)	1.562217 / 1.492716 (0.069501)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.229993 / 0.018006 (0.211987)	0.449898 / 0.000490 (0.449409)	0.000376 / 0.000200 (0.000176)	0.000056 / 0.000054 (0.000002)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.030297 / 0.037411 (-0.007115)	0.086737 / 0.014526 (0.072212)	0.098312 / 0.176557 (-0.078244)	0.152890 / 0.737135 (-0.584246)	0.099335 / 0.296338 (-0.197003)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.415786 / 0.215209 (0.200577)	4.137606 / 2.077655 (2.059952)	2.120082 / 1.504120 (0.615963)	1.943984 / 1.541195 (0.402789)	2.040821 / 1.468490 (0.572331)	0.479273 / 4.584777 (-4.105504)	3.563854 / 3.745712 (-0.181858)	3.396071 / 5.269862 (-1.873790)	2.011302 / 4.565676 (-2.554374)	0.057202 / 0.424275 (-0.367073)	0.007338 / 0.007607 (-0.000269)	0.488378 / 0.226044 (0.262333)	4.881615 / 2.268929 (2.612686)	2.669685 / 55.444624 (-52.774939)	2.258236 / 6.876477 (-4.618241)	2.343303 / 2.142072 (0.201230)	0.606762 / 4.805227 (-4.198466)	0.133190 / 6.500664 (-6.367475)	0.062971 / 0.075469 (-0.012498)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.345215 / 1.841788 (-0.496573)	20.023713 / 8.074308 (11.949405)	14.555777 / 10.191392 (4.364385)	0.162388 / 0.680424 (-0.518036)	0.018528 / 0.534201 (-0.515673)	0.393055 / 0.579283 (-0.186229)	0.411820 / 0.434364 (-0.022544)	0.461705 / 0.540337 (-0.078633)	0.629395 / 1.386936 (-0.757541)

lhoestq · 2023-07-07T16:12:39Z

Ok ! I also know https://huggingface.co/datasets/hf-internal-testing/cats_vs_dogs_sample/blob/main/cats_vs_dogs_sample.py that needs to be updated as well

docs/source/package_reference/task_templates.mdx

github-actions · 2023-07-07T17:19:26Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.009100 / 0.011353 (-0.002253)	0.005158 / 0.011008 (-0.005850)	0.109291 / 0.038508 (0.070782)	0.086053 / 0.023109 (0.062943)	0.469859 / 0.275898 (0.193961)	0.476142 / 0.323480 (0.152662)	0.006739 / 0.007986 (-0.001247)	0.005077 / 0.004328 (0.000748)	0.078193 / 0.004250 (0.073943)	0.065956 / 0.037052 (0.028904)	0.490323 / 0.258489 (0.231834)	0.497418 / 0.293841 (0.203577)	0.060562 / 0.128546 (-0.067984)	0.016321 / 0.075646 (-0.059325)	0.379703 / 0.419271 (-0.039568)	0.087335 / 0.043533 (0.043802)	0.488240 / 0.255139 (0.233101)	0.497391 / 0.283200 (0.214191)	0.040699 / 0.141683 (-0.100984)	1.778925 / 1.452155 (0.326770)	1.856436 / 1.492716 (0.363720)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.236428 / 0.018006 (0.218422)	0.551950 / 0.000490 (0.551460)	0.007400 / 0.000200 (0.007201)	0.000120 / 0.000054 (0.000066)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.028461 / 0.037411 (-0.008950)	0.093441 / 0.014526 (0.078915)	0.103868 / 0.176557 (-0.072688)	0.176269 / 0.737135 (-0.560867)	0.107760 / 0.296338 (-0.188578)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.593382 / 0.215209 (0.378173)	5.863711 / 2.077655 (3.786057)	2.493777 / 1.504120 (0.989657)	2.088547 / 1.541195 (0.547352)	2.173147 / 1.468490 (0.704656)	0.875661 / 4.584777 (-3.709116)	5.209023 / 3.745712 (1.463310)	4.483261 / 5.269862 (-0.786600)	2.843288 / 4.565676 (-1.722388)	0.098488 / 0.424275 (-0.325787)	0.008371 / 0.007607 (0.000764)	0.668413 / 0.226044 (0.442368)	6.709802 / 2.268929 (4.440873)	3.132453 / 55.444624 (-52.312172)	2.428736 / 6.876477 (-4.447741)	2.560867 / 2.142072 (0.418794)	0.983550 / 4.805227 (-3.821677)	0.207072 / 6.500664 (-6.293592)	0.073786 / 0.075469 (-0.001683)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.625871 / 1.841788 (-0.215917)	23.481015 / 8.074308 (15.406707)	20.556677 / 10.191392 (10.365285)	0.238147 / 0.680424 (-0.442277)	0.029453 / 0.534201 (-0.504748)	0.464589 / 0.579283 (-0.114695)	0.599129 / 0.434364 (0.164765)	0.550146 / 0.540337 (0.009808)	0.794646 / 1.386936 (-0.592290)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008613 / 0.011353 (-0.002739)	0.004979 / 0.011008 (-0.006030)	0.078095 / 0.038508 (0.039587)	0.080285 / 0.023109 (0.057176)	0.482881 / 0.275898 (0.206983)	0.520442 / 0.323480 (0.196962)	0.006241 / 0.007986 (-0.001744)	0.003964 / 0.004328 (-0.000364)	0.080027 / 0.004250 (0.075777)	0.065209 / 0.037052 (0.028157)	0.476113 / 0.258489 (0.217623)	0.535383 / 0.293841 (0.241542)	0.053084 / 0.128546 (-0.075462)	0.014284 / 0.075646 (-0.061362)	0.083859 / 0.419271 (-0.335413)	0.061024 / 0.043533 (0.017492)	0.477810 / 0.255139 (0.222671)	0.508718 / 0.283200 (0.225518)	0.036602 / 0.141683 (-0.105081)	1.810422 / 1.452155 (0.358267)	1.832833 / 1.492716 (0.340117)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.281443 / 0.018006 (0.263437)	0.568249 / 0.000490 (0.567760)	0.000493 / 0.000200 (0.000293)	0.000077 / 0.000054 (0.000023)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.033302 / 0.037411 (-0.004110)	0.100433 / 0.014526 (0.085907)	0.105465 / 0.176557 (-0.071091)	0.161986 / 0.737135 (-0.575149)	0.115736 / 0.296338 (-0.180603)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.622892 / 0.215209 (0.407683)	6.144361 / 2.077655 (4.066706)	2.849443 / 1.504120 (1.345323)	2.544097 / 1.541195 (1.002902)	2.579859 / 1.468490 (1.111369)	0.826078 / 4.584777 (-3.758699)	5.021808 / 3.745712 (1.276096)	4.694784 / 5.269862 (-0.575077)	2.796263 / 4.565676 (-1.769413)	0.090983 / 0.424275 (-0.333292)	0.008445 / 0.007607 (0.000838)	0.744675 / 0.226044 (0.518631)	7.662989 / 2.268929 (5.394060)	3.665611 / 55.444624 (-51.779013)	2.942836 / 6.876477 (-3.933641)	2.874402 / 2.142072 (0.732329)	1.010097 / 4.805227 (-3.795130)	0.218008 / 6.500664 (-6.282656)	0.087359 / 0.075469 (0.011890)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.655631 / 1.841788 (-0.186157)	23.539596 / 8.074308 (15.465288)	20.909512 / 10.191392 (10.718120)	0.202092 / 0.680424 (-0.478332)	0.029807 / 0.534201 (-0.504394)	0.487591 / 0.579283 (-0.091692)	0.573719 / 0.434364 (0.139355)	0.531168 / 0.540337 (-0.009170)	0.742375 / 1.386936 (-0.644561)

github-actions · 2023-07-07T17:27:30Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006247 / 0.011353 (-0.005106)	0.003650 / 0.011008 (-0.007358)	0.079655 / 0.038508 (0.041147)	0.060279 / 0.023109 (0.037170)	0.309033 / 0.275898 (0.033135)	0.338479 / 0.323480 (0.014999)	0.004651 / 0.007986 (-0.003335)	0.002849 / 0.004328 (-0.001480)	0.062852 / 0.004250 (0.058602)	0.049230 / 0.037052 (0.012178)	0.312502 / 0.258489 (0.054012)	0.354558 / 0.293841 (0.060717)	0.027497 / 0.128546 (-0.101049)	0.007885 / 0.075646 (-0.067762)	0.260232 / 0.419271 (-0.159040)	0.045459 / 0.043533 (0.001926)	0.311629 / 0.255139 (0.056490)	0.367806 / 0.283200 (0.084606)	0.020875 / 0.141683 (-0.120808)	1.423802 / 1.452155 (-0.028352)	1.497729 / 1.492716 (0.005013)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.185629 / 0.018006 (0.167623)	0.441421 / 0.000490 (0.440931)	0.004847 / 0.000200 (0.004647)	0.000074 / 0.000054 (0.000020)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.022428 / 0.037411 (-0.014984)	0.073375 / 0.014526 (0.058849)	0.083194 / 0.176557 (-0.093363)	0.143984 / 0.737135 (-0.593151)	0.084128 / 0.296338 (-0.212211)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.397220 / 0.215209 (0.182010)	3.954394 / 2.077655 (1.876740)	1.920638 / 1.504120 (0.416518)	1.744284 / 1.541195 (0.203089)	1.802623 / 1.468490 (0.334133)	0.501988 / 4.584777 (-4.082789)	3.096071 / 3.745712 (-0.649642)	4.648267 / 5.269862 (-0.621595)	2.770995 / 4.565676 (-1.794682)	0.057513 / 0.424275 (-0.366762)	0.006315 / 0.007607 (-0.001292)	0.467683 / 0.226044 (0.241639)	4.683959 / 2.268929 (2.415031)	2.384980 / 55.444624 (-53.059645)	2.030894 / 6.876477 (-4.845583)	2.148374 / 2.142072 (0.006302)	0.585142 / 4.805227 (-4.220085)	0.123173 / 6.500664 (-6.377491)	0.059140 / 0.075469 (-0.016329)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.244707 / 1.841788 (-0.597080)	18.176043 / 8.074308 (10.101735)	13.742770 / 10.191392 (3.551378)	0.149692 / 0.680424 (-0.530732)	0.016591 / 0.534201 (-0.517610)	0.342138 / 0.579283 (-0.237145)	0.353931 / 0.434364 (-0.080433)	0.392317 / 0.540337 (-0.148020)	0.524011 / 1.386936 (-0.862925)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005937 / 0.011353 (-0.005416)	0.003609 / 0.011008 (-0.007399)	0.061729 / 0.038508 (0.023221)	0.057844 / 0.023109 (0.034735)	0.418051 / 0.275898 (0.142153)	0.453014 / 0.323480 (0.129534)	0.004530 / 0.007986 (-0.003456)	0.002861 / 0.004328 (-0.001468)	0.062236 / 0.004250 (0.057986)	0.048612 / 0.037052 (0.011560)	0.418487 / 0.258489 (0.159998)	0.455114 / 0.293841 (0.161273)	0.027419 / 0.128546 (-0.101127)	0.007919 / 0.075646 (-0.067728)	0.066940 / 0.419271 (-0.352331)	0.041816 / 0.043533 (-0.001717)	0.419788 / 0.255139 (0.164649)	0.439682 / 0.283200 (0.156483)	0.020902 / 0.141683 (-0.120781)	1.473993 / 1.452155 (0.021838)	1.532438 / 1.492716 (0.039722)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.228766 / 0.018006 (0.210760)	0.412189 / 0.000490 (0.411699)	0.000371 / 0.000200 (0.000171)	0.000054 / 0.000054 (-0.000000)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.026139 / 0.037411 (-0.011272)	0.076626 / 0.014526 (0.062100)	0.088262 / 0.176557 (-0.088295)	0.143096 / 0.737135 (-0.594039)	0.089642 / 0.296338 (-0.206696)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.423030 / 0.215209 (0.207821)	4.218333 / 2.077655 (2.140679)	2.280943 / 1.504120 (0.776823)	2.051746 / 1.541195 (0.510551)	2.101085 / 1.468490 (0.632595)	0.495860 / 4.584777 (-4.088917)	3.108065 / 3.745712 (-0.637647)	2.944188 / 5.269862 (-2.325673)	1.833693 / 4.565676 (-2.731984)	0.057509 / 0.424275 (-0.366766)	0.006406 / 0.007607 (-0.001201)	0.497208 / 0.226044 (0.271164)	4.974972 / 2.268929 (2.706044)	2.786639 / 55.444624 (-52.657985)	2.423815 / 6.876477 (-4.452662)	2.446377 / 2.142072 (0.304305)	0.584521 / 4.805227 (-4.220706)	0.124129 / 6.500664 (-6.376535)	0.061373 / 0.075469 (-0.014096)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.307076 / 1.841788 (-0.534711)	18.443873 / 8.074308 (10.369565)	13.835730 / 10.191392 (3.644338)	0.159795 / 0.680424 (-0.520629)	0.016643 / 0.534201 (-0.517558)	0.334300 / 0.579283 (-0.244983)	0.347136 / 0.434364 (-0.087228)	0.394633 / 0.540337 (-0.145704)	0.552445 / 1.386936 (-0.834491)

github-actions · 2023-07-10T12:33:58Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007273 / 0.011353 (-0.004080)	0.004704 / 0.011008 (-0.006304)	0.105857 / 0.038508 (0.067349)	0.062493 / 0.023109 (0.039384)	0.325704 / 0.275898 (0.049806)	0.355795 / 0.323480 (0.032315)	0.005552 / 0.007986 (-0.002433)	0.003543 / 0.004328 (-0.000785)	0.068098 / 0.004250 (0.063848)	0.049563 / 0.037052 (0.012511)	0.362956 / 0.258489 (0.104467)	0.376047 / 0.293841 (0.082206)	0.039272 / 0.128546 (-0.089275)	0.011521 / 0.075646 (-0.064125)	0.291899 / 0.419271 (-0.127373)	0.056916 / 0.043533 (0.013383)	0.365352 / 0.255139 (0.110213)	0.357251 / 0.283200 (0.074051)	0.031670 / 0.141683 (-0.110013)	1.533294 / 1.452155 (0.081140)	1.566580 / 1.492716 (0.073864)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.219812 / 0.018006 (0.201805)	0.499808 / 0.000490 (0.499318)	0.000343 / 0.000200 (0.000143)	0.000066 / 0.000054 (0.000011)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.024011 / 0.037411 (-0.013400)	0.079686 / 0.014526 (0.065161)	0.087925 / 0.176557 (-0.088631)	0.149065 / 0.737135 (-0.588071)	0.088514 / 0.296338 (-0.207824)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.495003 / 0.215209 (0.279794)	5.106371 / 2.077655 (3.028717)	2.285497 / 1.504120 (0.781377)	2.056052 / 1.541195 (0.514858)	2.024913 / 1.468490 (0.556423)	0.726048 / 4.584777 (-3.858729)	4.873945 / 3.745712 (1.128233)	7.488671 / 5.269862 (2.218809)	4.361208 / 4.565676 (-0.204469)	0.089014 / 0.424275 (-0.335261)	0.007178 / 0.007607 (-0.000429)	0.633669 / 0.226044 (0.407625)	6.328154 / 2.268929 (4.059226)	3.071598 / 55.444624 (-52.373026)	2.416077 / 6.876477 (-4.460399)	2.431033 / 2.142072 (0.288961)	0.918167 / 4.805227 (-3.887060)	0.193829 / 6.500664 (-6.306836)	0.073446 / 0.075469 (-0.002023)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.344994 / 1.841788 (-0.496793)	19.911699 / 8.074308 (11.837391)	17.182697 / 10.191392 (6.991305)	0.216932 / 0.680424 (-0.463492)	0.025415 / 0.534201 (-0.508786)	0.416806 / 0.579283 (-0.162477)	0.524934 / 0.434364 (0.090570)	0.510783 / 0.540337 (-0.029554)	0.687856 / 1.386936 (-0.699081)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008469 / 0.011353 (-0.002884)	0.003797 / 0.011008 (-0.007211)	0.067276 / 0.038508 (0.028768)	0.066825 / 0.023109 (0.043716)	0.394976 / 0.275898 (0.119078)	0.432563 / 0.323480 (0.109083)	0.006003 / 0.007986 (-0.001982)	0.003399 / 0.004328 (-0.000930)	0.070899 / 0.004250 (0.066649)	0.050940 / 0.037052 (0.013887)	0.378291 / 0.258489 (0.119802)	0.429889 / 0.293841 (0.136048)	0.043245 / 0.128546 (-0.085302)	0.012182 / 0.075646 (-0.063465)	0.074560 / 0.419271 (-0.344711)	0.065290 / 0.043533 (0.021757)	0.371209 / 0.255139 (0.116070)	0.389731 / 0.283200 (0.106532)	0.045729 / 0.141683 (-0.095954)	1.451785 / 1.452155 (-0.000370)	1.598539 / 1.492716 (0.105822)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.261357 / 0.018006 (0.243351)	0.520142 / 0.000490 (0.519653)	0.008305 / 0.000200 (0.008105)	0.000089 / 0.000054 (0.000034)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.026492 / 0.037411 (-0.010919)	0.082430 / 0.014526 (0.067904)	0.095979 / 0.176557 (-0.080578)	0.151752 / 0.737135 (-0.585383)	0.090086 / 0.296338 (-0.206252)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.535967 / 0.215209 (0.320758)	5.228605 / 2.077655 (3.150950)	2.395078 / 1.504120 (0.890959)	2.185500 / 1.541195 (0.644306)	2.219456 / 1.468490 (0.750966)	0.764794 / 4.584777 (-3.819983)	4.796617 / 3.745712 (1.050905)	4.143450 / 5.269862 (-1.126411)	2.527391 / 4.565676 (-2.038286)	0.081418 / 0.424275 (-0.342857)	0.007170 / 0.007607 (-0.000437)	0.706071 / 0.226044 (0.480026)	6.501060 / 2.268929 (4.232131)	3.176315 / 55.444624 (-52.268309)	2.443245 / 6.876477 (-4.433232)	2.517832 / 2.142072 (0.375759)	0.916254 / 4.805227 (-3.888973)	0.184282 / 6.500664 (-6.316382)	0.062613 / 0.075469 (-0.012857)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.444283 / 1.841788 (-0.397504)	20.227311 / 8.074308 (12.153003)	17.512856 / 10.191392 (7.321464)	0.219556 / 0.680424 (-0.460868)	0.024705 / 0.534201 (-0.509496)	0.423215 / 0.579283 (-0.156068)	0.513103 / 0.434364 (0.078739)	0.473853 / 0.540337 (-0.066485)	0.738165 / 1.386936 (-0.648771)

mariosasko added 2 commits May 12, 2023 13:54

Deprecate Task API

6afcd33

Typo

857ec44

mariosasko requested review from albertvillanova, polinaeterna and lhoestq May 15, 2023 16:48

Merge branch 'main' into deprecate-task-api

4f54f2f

lhoestq reviewed Jul 7, 2023

View reviewed changes

docs/source/package_reference/task_templates.mdx Outdated Show resolved Hide resolved

lhoestq approved these changes Jul 7, 2023

View reviewed changes

mariosasko added 2 commits July 7, 2023 19:08

Update task_templates.mdx

aa231a7

Update task_templates.mdx

8cfc026

mariosasko merged commit b65660b into main Jul 10, 2023
13 checks passed

mariosasko deleted the deprecate-task-api branch July 10, 2023 12:24

Deprecate task api #5865

Deprecate task api #5865

Conversation

mariosasko commented May 15, 2023

HuggingFaceDocBuilderDev commented May 15, 2023 • edited

lhoestq commented May 15, 2023

polinaeterna commented May 15, 2023 • edited

mariosasko commented Jul 7, 2023 • edited

github-actions bot commented Jul 7, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

lhoestq commented Jul 7, 2023

github-actions bot commented Jul 7, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Jul 7, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Jul 10, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

HuggingFaceDocBuilderDev commented May 15, 2023 •

edited

polinaeterna commented May 15, 2023 •

edited

mariosasko commented Jul 7, 2023 •

edited