[docs] Custom decoding transforms #5836

stevhliu · 2023-05-09T21:21:41Z

Adds custom decoding transform solution to the docs to fix #5782.

HuggingFaceDocBuilderDev · 2023-05-09T21:27:11Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

lhoestq

Nice addition :) thanks

mariosasko

Thanks!

docs/source/process.mdx

mariosasko

LGTM :)

docs/source/process.mdx

Co-authored-by: Mario Šaško <mariosasko777@gmail.com>

mariosasko · 2023-05-10T20:03:36Z

The error seems unrelated to the changes, so feel free to merge.

github-actions · 2023-05-10T20:30:18Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006562 / 0.011353 (-0.004791)	0.004568 / 0.011008 (-0.006440)	0.098151 / 0.038508 (0.059643)	0.028117 / 0.023109 (0.005008)	0.305442 / 0.275898 (0.029544)	0.338288 / 0.323480 (0.014808)	0.005012 / 0.007986 (-0.002973)	0.003415 / 0.004328 (-0.000913)	0.075022 / 0.004250 (0.070771)	0.036869 / 0.037052 (-0.000183)	0.301427 / 0.258489 (0.042937)	0.348485 / 0.293841 (0.054644)	0.030761 / 0.128546 (-0.097785)	0.011461 / 0.075646 (-0.064185)	0.321987 / 0.419271 (-0.097285)	0.042885 / 0.043533 (-0.000648)	0.300691 / 0.255139 (0.045552)	0.333208 / 0.283200 (0.050008)	0.090203 / 0.141683 (-0.051480)	1.459744 / 1.452155 (0.007590)	1.522960 / 1.492716 (0.030243)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.213219 / 0.018006 (0.195213)	0.408118 / 0.000490 (0.407629)	0.003716 / 0.000200 (0.003516)	0.000077 / 0.000054 (0.000022)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.023060 / 0.037411 (-0.014351)	0.097423 / 0.014526 (0.082897)	0.103988 / 0.176557 (-0.072568)	0.162793 / 0.737135 (-0.574343)	0.108282 / 0.296338 (-0.188056)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.431628 / 0.215209 (0.216419)	4.300881 / 2.077655 (2.223226)	2.058853 / 1.504120 (0.554733)	1.897910 / 1.541195 (0.356715)	1.991723 / 1.468490 (0.523233)	0.699686 / 4.584777 (-3.885091)	3.395004 / 3.745712 (-0.350708)	1.841613 / 5.269862 (-3.428248)	1.152347 / 4.565676 (-3.413330)	0.082517 / 0.424275 (-0.341758)	0.012323 / 0.007607 (0.004715)	0.535812 / 0.226044 (0.309767)	5.374103 / 2.268929 (3.105174)	2.429662 / 55.444624 (-53.014962)	2.097199 / 6.876477 (-4.779277)	2.172625 / 2.142072 (0.030552)	0.810156 / 4.805227 (-3.995071)	0.151629 / 6.500664 (-6.349035)	0.066528 / 0.075469 (-0.008941)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.220667 / 1.841788 (-0.621121)	13.696976 / 8.074308 (5.622668)	14.042916 / 10.191392 (3.851524)	0.129626 / 0.680424 (-0.550798)	0.016593 / 0.534201 (-0.517607)	0.383747 / 0.579283 (-0.195536)	0.386872 / 0.434364 (-0.047492)	0.456524 / 0.540337 (-0.083813)	0.545033 / 1.386936 (-0.841903)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006361 / 0.011353 (-0.004992)	0.004516 / 0.011008 (-0.006493)	0.077155 / 0.038508 (0.038647)	0.027239 / 0.023109 (0.004130)	0.359892 / 0.275898 (0.083994)	0.391994 / 0.323480 (0.068514)	0.004950 / 0.007986 (-0.003036)	0.003379 / 0.004328 (-0.000949)	0.077057 / 0.004250 (0.072806)	0.039562 / 0.037052 (0.002509)	0.364244 / 0.258489 (0.105755)	0.416033 / 0.293841 (0.122192)	0.031049 / 0.128546 (-0.097497)	0.011479 / 0.075646 (-0.064167)	0.086479 / 0.419271 (-0.332793)	0.039381 / 0.043533 (-0.004151)	0.372143 / 0.255139 (0.117004)	0.388569 / 0.283200 (0.105369)	0.090954 / 0.141683 (-0.050728)	1.540957 / 1.452155 (0.088802)	1.596841 / 1.492716 (0.104125)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.221130 / 0.018006 (0.203123)	0.403728 / 0.000490 (0.403238)	0.003172 / 0.000200 (0.002972)	0.000078 / 0.000054 (0.000024)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.024963 / 0.037411 (-0.012449)	0.101065 / 0.014526 (0.086539)	0.110846 / 0.176557 (-0.065710)	0.158578 / 0.737135 (-0.578557)	0.112235 / 0.296338 (-0.184104)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.457320 / 0.215209 (0.242111)	4.548094 / 2.077655 (2.470439)	2.175376 / 1.504120 (0.671256)	1.964755 / 1.541195 (0.423561)	2.008128 / 1.468490 (0.539638)	0.702448 / 4.584777 (-3.882329)	3.437595 / 3.745712 (-0.308117)	3.009871 / 5.269862 (-2.259990)	1.558181 / 4.565676 (-3.007496)	0.082568 / 0.424275 (-0.341707)	0.012371 / 0.007607 (0.004764)	0.550688 / 0.226044 (0.324644)	5.534210 / 2.268929 (3.265282)	2.649605 / 55.444624 (-52.795020)	2.317293 / 6.876477 (-4.559184)	2.351525 / 2.142072 (0.209453)	0.808971 / 4.805227 (-3.996256)	0.152737 / 6.500664 (-6.347927)	0.068416 / 0.075469 (-0.007053)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.340219 / 1.841788 (-0.501569)	13.903388 / 8.074308 (5.829080)	13.063477 / 10.191392 (2.872085)	0.130216 / 0.680424 (-0.550208)	0.016522 / 0.534201 (-0.517679)	0.398946 / 0.579283 (-0.180337)	0.382450 / 0.434364 (-0.051914)	0.491007 / 0.540337 (-0.049330)	0.577747 / 1.386936 (-0.809189)

github-actions · 2023-05-15T07:36:12Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007812 / 0.011353 (-0.003541)	0.005563 / 0.011008 (-0.005446)	0.099372 / 0.038508 (0.060864)	0.035629 / 0.023109 (0.012520)	0.301457 / 0.275898 (0.025559)	0.339136 / 0.323480 (0.015656)	0.006152 / 0.007986 (-0.001834)	0.005843 / 0.004328 (0.001515)	0.075280 / 0.004250 (0.071030)	0.052789 / 0.037052 (0.015736)	0.301805 / 0.258489 (0.043316)	0.347918 / 0.293841 (0.054078)	0.036182 / 0.128546 (-0.092364)	0.012655 / 0.075646 (-0.062991)	0.334428 / 0.419271 (-0.084844)	0.062746 / 0.043533 (0.019213)	0.296932 / 0.255139 (0.041793)	0.314115 / 0.283200 (0.030916)	0.121291 / 0.141683 (-0.020392)	1.453252 / 1.452155 (0.001097)	1.564714 / 1.492716 (0.071997)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.243810 / 0.018006 (0.225804)	0.547129 / 0.000490 (0.546640)	0.004666 / 0.000200 (0.004466)	0.000089 / 0.000054 (0.000035)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.028214 / 0.037411 (-0.009197)	0.108878 / 0.014526 (0.094352)	0.122313 / 0.176557 (-0.054243)	0.182412 / 0.737135 (-0.554723)	0.127014 / 0.296338 (-0.169324)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.423946 / 0.215209 (0.208737)	4.207112 / 2.077655 (2.129457)	2.048658 / 1.504120 (0.544538)	1.843593 / 1.541195 (0.302398)	1.952426 / 1.468490 (0.483936)	0.712098 / 4.584777 (-3.872679)	3.824971 / 3.745712 (0.079258)	3.507141 / 5.269862 (-1.762721)	1.868866 / 4.565676 (-2.696810)	0.087895 / 0.424275 (-0.336380)	0.012783 / 0.007607 (0.005176)	0.524087 / 0.226044 (0.298042)	5.246498 / 2.268929 (2.977570)	2.495944 / 55.444624 (-52.948680)	2.126779 / 6.876477 (-4.749698)	2.315545 / 2.142072 (0.173472)	0.859546 / 4.805227 (-3.945681)	0.173457 / 6.500664 (-6.327208)	0.067483 / 0.075469 (-0.007986)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.173851 / 1.841788 (-0.667937)	15.091913 / 8.074308 (7.017605)	14.640035 / 10.191392 (4.448643)	0.168498 / 0.680424 (-0.511926)	0.017513 / 0.534201 (-0.516688)	0.425770 / 0.579283 (-0.153513)	0.434248 / 0.434364 (-0.000116)	0.504204 / 0.540337 (-0.036134)	0.616885 / 1.386936 (-0.770051)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007775 / 0.011353 (-0.003578)	0.005153 / 0.011008 (-0.005855)	0.075461 / 0.038508 (0.036953)	0.034994 / 0.023109 (0.011885)	0.372389 / 0.275898 (0.096491)	0.397911 / 0.323480 (0.074431)	0.006572 / 0.007986 (-0.001413)	0.005549 / 0.004328 (0.001220)	0.075101 / 0.004250 (0.070851)	0.054014 / 0.037052 (0.016962)	0.368964 / 0.258489 (0.110475)	0.425353 / 0.293841 (0.131512)	0.035546 / 0.128546 (-0.093001)	0.012707 / 0.075646 (-0.062939)	0.087418 / 0.419271 (-0.331853)	0.046425 / 0.043533 (0.002893)	0.363982 / 0.255139 (0.108843)	0.376421 / 0.283200 (0.093221)	0.105369 / 0.141683 (-0.036314)	1.494408 / 1.452155 (0.042253)	1.596783 / 1.492716 (0.104067)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.258780 / 0.018006 (0.240773)	0.533373 / 0.000490 (0.532883)	0.000432 / 0.000200 (0.000232)	0.000058 / 0.000054 (0.000003)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.030687 / 0.037411 (-0.006725)	0.110231 / 0.014526 (0.095705)	0.123738 / 0.176557 (-0.052819)	0.171999 / 0.737135 (-0.565137)	0.127673 / 0.296338 (-0.168665)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.448058 / 0.215209 (0.232849)	4.459381 / 2.077655 (2.381726)	2.234020 / 1.504120 (0.729900)	2.038616 / 1.541195 (0.497421)	2.123795 / 1.468490 (0.655305)	0.702664 / 4.584777 (-3.882113)	3.837133 / 3.745712 (0.091420)	2.138574 / 5.269862 (-3.131287)	1.375955 / 4.565676 (-3.189722)	0.086996 / 0.424275 (-0.337280)	0.012461 / 0.007607 (0.004854)	0.557978 / 0.226044 (0.331934)	5.648613 / 2.268929 (3.379685)	2.777829 / 55.444624 (-52.666796)	2.392424 / 6.876477 (-4.484052)	2.482823 / 2.142072 (0.340750)	0.851891 / 4.805227 (-3.953336)	0.171335 / 6.500664 (-6.329329)	0.065041 / 0.075469 (-0.010428)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.319697 / 1.841788 (-0.522091)	15.748688 / 8.074308 (7.674380)	13.397042 / 10.191392 (3.205650)	0.166424 / 0.680424 (-0.514000)	0.017755 / 0.534201 (-0.516446)	0.424989 / 0.579283 (-0.154294)	0.424705 / 0.434364 (-0.009659)	0.494190 / 0.540337 (-0.046147)	0.588315 / 1.386936 (-0.798622)

add custom decoding transforms

9d0004c

stevhliu requested a review from mariosasko May 9, 2023 21:28

lhoestq approved these changes May 10, 2023

View reviewed changes

mariosasko reviewed May 10, 2023

View reviewed changes

docs/source/process.mdx Outdated Show resolved Hide resolved

docs/source/process.mdx Outdated Show resolved Hide resolved

apply feedback

50d2e64

mariosasko approved these changes May 10, 2023

View reviewed changes

docs/source/process.mdx Outdated Show resolved Hide resolved

Update docs/source/process.mdx

4439334

Co-authored-by: Mario Šaško <mariosasko777@gmail.com>

stevhliu merged commit 15c37ed into huggingface:main May 10, 2023
10 of 13 checks passed

stevhliu deleted the custom-decoding-transform branch May 10, 2023 20:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[docs] Custom decoding transforms #5836

[docs] Custom decoding transforms #5836

stevhliu commented May 9, 2023 •

edited

HuggingFaceDocBuilderDev commented May 9, 2023

lhoestq left a comment

mariosasko left a comment

mariosasko left a comment

mariosasko commented May 10, 2023

github-actions bot commented May 10, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented May 15, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

[docs] Custom decoding transforms #5836

[docs] Custom decoding transforms #5836

Conversation

stevhliu commented May 9, 2023 • edited

HuggingFaceDocBuilderDev commented May 9, 2023

lhoestq left a comment

Choose a reason for hiding this comment

mariosasko left a comment

Choose a reason for hiding this comment

mariosasko left a comment

Choose a reason for hiding this comment

mariosasko commented May 10, 2023

github-actions bot commented May 10, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented May 15, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

stevhliu commented May 9, 2023 •

edited