Make tiktoken tokenizers hashable #5552

mariosasko · 2023-02-20T16:50:09Z

Fix for https://discord.com/channels/879548962464493619/1075729627546406912/1075729627546406912

HuggingFaceDocBuilderDev · 2023-02-20T16:55:44Z

The documentation is not available anymore as the PR was closed or merged.

github-actions · 2023-02-20T16:59:38Z

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.011635 / 0.011353 (0.000282)	0.005446 / 0.011008 (-0.005562)	0.111044 / 0.038508 (0.072536)	0.034243 / 0.023109 (0.011134)	0.357560 / 0.275898 (0.081662)	0.403940 / 0.323480 (0.080460)	0.008532 / 0.007986 (0.000546)	0.004327 / 0.004328 (-0.000002)	0.084659 / 0.004250 (0.080408)	0.040914 / 0.037052 (0.003861)	0.367142 / 0.258489 (0.108653)	0.381651 / 0.293841 (0.087810)	0.053865 / 0.128546 (-0.074681)	0.019060 / 0.075646 (-0.056587)	0.371994 / 0.419271 (-0.047277)	0.058417 / 0.043533 (0.014884)	0.357740 / 0.255139 (0.102601)	0.367423 / 0.283200 (0.084224)	0.104336 / 0.141683 (-0.037347)	1.632128 / 1.452155 (0.179974)	1.676216 / 1.492716 (0.183499)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.199649 / 0.018006 (0.181642)	0.490945 / 0.000490 (0.490455)	0.001598 / 0.000200 (0.001398)	0.000094 / 0.000054 (0.000039)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.024541 / 0.037411 (-0.012871)	0.104713 / 0.014526 (0.090187)	0.119438 / 0.176557 (-0.057118)	0.160854 / 0.737135 (-0.576281)	0.127323 / 0.296338 (-0.169016)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.586483 / 0.215209 (0.371274)	5.771689 / 2.077655 (3.694034)	2.378962 / 1.504120 (0.874842)	1.998787 / 1.541195 (0.457592)	1.993016 / 1.468490 (0.524526)	1.199169 / 4.584777 (-3.385608)	5.281648 / 3.745712 (1.535936)	5.589235 / 5.269862 (0.319373)	2.715162 / 4.565676 (-1.850514)	0.153312 / 0.424275 (-0.270963)	0.014302 / 0.007607 (0.006695)	0.761185 / 0.226044 (0.535140)	7.602517 / 2.268929 (5.333589)	3.095271 / 55.444624 (-52.349354)	2.407394 / 6.876477 (-4.469083)	2.519074 / 2.142072 (0.377002)	1.459270 / 4.805227 (-3.345957)	0.259578 / 6.500664 (-6.241086)	0.077356 / 0.075469 (0.001887)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.502123 / 1.841788 (-0.339665)	16.254010 / 8.074308 (8.179702)	19.971713 / 10.191392 (9.780321)	0.221491 / 0.680424 (-0.458933)	0.043959 / 0.534201 (-0.490242)	0.512566 / 0.579283 (-0.066717)	0.594724 / 0.434364 (0.160360)	0.573855 / 0.540337 (0.033518)	0.680503 / 1.386936 (-0.706433)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008543 / 0.011353 (-0.002810)	0.005828 / 0.011008 (-0.005180)	0.083696 / 0.038508 (0.045188)	0.036186 / 0.023109 (0.013077)	0.379777 / 0.275898 (0.103879)	0.437361 / 0.323480 (0.113881)	0.006788 / 0.007986 (-0.001197)	0.005110 / 0.004328 (0.000782)	0.106075 / 0.004250 (0.101824)	0.048770 / 0.037052 (0.011718)	0.390770 / 0.258489 (0.132281)	0.420813 / 0.293841 (0.126972)	0.050622 / 0.128546 (-0.077924)	0.019939 / 0.075646 (-0.055707)	0.106890 / 0.419271 (-0.312382)	0.070800 / 0.043533 (0.027267)	0.406094 / 0.255139 (0.150955)	0.419796 / 0.283200 (0.136597)	0.107237 / 0.141683 (-0.034446)	1.687894 / 1.452155 (0.235739)	1.735680 / 1.492716 (0.242963)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.216403 / 0.018006 (0.198397)	0.495002 / 0.000490 (0.494512)	0.004841 / 0.000200 (0.004641)	0.000117 / 0.000054 (0.000063)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.043774 / 0.037411 (0.006363)	0.119144 / 0.014526 (0.104618)	0.143694 / 0.176557 (-0.032862)	0.195548 / 0.737135 (-0.541587)	0.151426 / 0.296338 (-0.144912)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.617694 / 0.215209 (0.402485)	6.216237 / 2.077655 (4.138582)	2.578341 / 1.504120 (1.074221)	2.184868 / 1.541195 (0.643673)	2.244954 / 1.468490 (0.776464)	1.236072 / 4.584777 (-3.348705)	5.257919 / 3.745712 (1.512207)	4.634682 / 5.269862 (-0.635180)	2.722579 / 4.565676 (-1.843097)	0.131433 / 0.424275 (-0.292843)	0.012928 / 0.007607 (0.005321)	0.768315 / 0.226044 (0.542270)	7.625277 / 2.268929 (5.356349)	3.146364 / 55.444624 (-52.298260)	2.577886 / 6.876477 (-4.298590)	2.572626 / 2.142072 (0.430554)	1.468160 / 4.805227 (-3.337067)	0.252524 / 6.500664 (-6.248140)	0.083264 / 0.075469 (0.007794)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.452614 / 1.841788 (-0.389174)	15.906162 / 8.074308 (7.831853)	17.803630 / 10.191392 (7.612238)	0.210769 / 0.680424 (-0.469655)	0.024672 / 0.534201 (-0.509529)	0.486486 / 0.579283 (-0.092797)	0.545256 / 0.434364 (0.110892)	0.598736 / 0.540337 (0.058399)	0.689083 / 1.386936 (-0.697853)

…tiktoken

lhoestq

Nice thanks ! Feel free to merge main into your branch to re-run the CI and then merge when everything is green :)

github-actions · 2023-02-21T13:01:03Z

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008806 / 0.011353 (-0.002547)	0.004947 / 0.011008 (-0.006061)	0.098559 / 0.038508 (0.060051)	0.034293 / 0.023109 (0.011183)	0.311924 / 0.275898 (0.036026)	0.377501 / 0.323480 (0.054021)	0.007916 / 0.007986 (-0.000069)	0.004131 / 0.004328 (-0.000197)	0.074934 / 0.004250 (0.070684)	0.043396 / 0.037052 (0.006344)	0.344788 / 0.258489 (0.086299)	0.369943 / 0.293841 (0.076102)	0.036846 / 0.128546 (-0.091700)	0.011803 / 0.075646 (-0.063843)	0.331306 / 0.419271 (-0.087965)	0.047015 / 0.043533 (0.003483)	0.305890 / 0.255139 (0.050751)	0.332658 / 0.283200 (0.049459)	0.101134 / 0.141683 (-0.040549)	1.485615 / 1.452155 (0.033461)	1.510230 / 1.492716 (0.017514)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.274272 / 0.018006 (0.256266)	0.514739 / 0.000490 (0.514250)	0.003433 / 0.000200 (0.003234)	0.000078 / 0.000054 (0.000023)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.027054 / 0.037411 (-0.010357)	0.106416 / 0.014526 (0.091890)	0.118761 / 0.176557 (-0.057796)	0.156115 / 0.737135 (-0.581021)	0.123801 / 0.296338 (-0.172537)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.403121 / 0.215209 (0.187912)	4.008806 / 2.077655 (1.931151)	1.891253 / 1.504120 (0.387133)	1.698523 / 1.541195 (0.157328)	1.778533 / 1.468490 (0.310043)	0.688207 / 4.584777 (-3.896570)	3.674350 / 3.745712 (-0.071362)	1.848438 / 5.269862 (-3.421423)	1.202380 / 4.565676 (-3.363297)	0.073490 / 0.424275 (-0.350785)	0.010655 / 0.007607 (0.003048)	0.446939 / 0.226044 (0.220894)	4.478489 / 2.268929 (2.209560)	1.992281 / 55.444624 (-53.452343)	1.684077 / 6.876477 (-5.192400)	1.715435 / 2.142072 (-0.426638)	0.731454 / 4.805227 (-4.073773)	0.143679 / 6.500664 (-6.356985)	0.053415 / 0.075469 (-0.022054)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.060583 / 1.841788 (-0.781205)	13.730462 / 8.074308 (5.656153)	13.038976 / 10.191392 (2.847583)	0.144168 / 0.680424 (-0.536256)	0.025788 / 0.534201 (-0.508413)	0.393332 / 0.579283 (-0.185951)	0.409495 / 0.434364 (-0.024869)	0.523745 / 0.540337 (-0.016592)	0.601595 / 1.386936 (-0.785341)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006369 / 0.011353 (-0.004983)	0.005019 / 0.011008 (-0.005990)	0.065226 / 0.038508 (0.026718)	0.029634 / 0.023109 (0.006524)	0.302871 / 0.275898 (0.026972)	0.331055 / 0.323480 (0.007575)	0.005470 / 0.007986 (-0.002516)	0.005372 / 0.004328 (0.001043)	0.064930 / 0.004250 (0.060680)	0.046979 / 0.037052 (0.009927)	0.305633 / 0.258489 (0.047144)	0.345305 / 0.293841 (0.051464)	0.032951 / 0.128546 (-0.095596)	0.011447 / 0.075646 (-0.064199)	0.077054 / 0.419271 (-0.342218)	0.045744 / 0.043533 (0.002211)	0.303446 / 0.255139 (0.048307)	0.319837 / 0.283200 (0.036637)	0.098631 / 0.141683 (-0.043052)	1.266593 / 1.452155 (-0.185562)	1.355388 / 1.492716 (-0.137328)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.291301 / 0.018006 (0.273295)	0.537848 / 0.000490 (0.537359)	0.006697 / 0.000200 (0.006497)	0.000110 / 0.000054 (0.000055)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.027677 / 0.037411 (-0.009734)	0.099633 / 0.014526 (0.085107)	0.110626 / 0.176557 (-0.065931)	0.144724 / 0.737135 (-0.592412)	0.114955 / 0.296338 (-0.181383)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.375344 / 0.215209 (0.160135)	3.717490 / 2.077655 (1.639835)	1.845886 / 1.504120 (0.341766)	1.713274 / 1.541195 (0.172079)	1.761286 / 1.468490 (0.292796)	0.627924 / 4.584777 (-3.956853)	3.628154 / 3.745712 (-0.117558)	3.261851 / 5.269862 (-2.008011)	1.701008 / 4.565676 (-2.864669)	0.076703 / 0.424275 (-0.347572)	0.010839 / 0.007607 (0.003231)	0.459193 / 0.226044 (0.233148)	4.589066 / 2.268929 (2.320137)	2.193972 / 55.444624 (-53.250653)	1.892115 / 6.876477 (-4.984362)	1.892453 / 2.142072 (-0.249619)	0.745727 / 4.805227 (-4.059500)	0.150232 / 6.500664 (-6.350432)	0.057245 / 0.075469 (-0.018224)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.114657 / 1.841788 (-0.727131)	13.595215 / 8.074308 (5.520907)	12.267177 / 10.191392 (2.075785)	0.151362 / 0.680424 (-0.529061)	0.015609 / 0.534201 (-0.518591)	0.379151 / 0.579283 (-0.200132)	0.386125 / 0.434364 (-0.048238)	0.470037 / 0.540337 (-0.070301)	0.562340 / 1.386936 (-0.824596)

github-actions · 2023-02-21T13:20:42Z

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.009847 / 0.011353 (-0.001505)	0.005609 / 0.011008 (-0.005399)	0.101951 / 0.038508 (0.063443)	0.038082 / 0.023109 (0.014972)	0.299933 / 0.275898 (0.024035)	0.377081 / 0.323480 (0.053601)	0.008900 / 0.007986 (0.000915)	0.004608 / 0.004328 (0.000279)	0.077723 / 0.004250 (0.073473)	0.048592 / 0.037052 (0.011540)	0.310789 / 0.258489 (0.052300)	0.345627 / 0.293841 (0.051787)	0.038716 / 0.128546 (-0.089830)	0.012653 / 0.075646 (-0.062993)	0.336885 / 0.419271 (-0.082387)	0.048715 / 0.043533 (0.005182)	0.295336 / 0.255139 (0.040197)	0.316735 / 0.283200 (0.033536)	0.115142 / 0.141683 (-0.026541)	1.480332 / 1.452155 (0.028177)	1.604972 / 1.492716 (0.112256)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.299516 / 0.018006 (0.281510)	0.525892 / 0.000490 (0.525402)	0.002246 / 0.000200 (0.002046)	0.000095 / 0.000054 (0.000040)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.031547 / 0.037411 (-0.005864)	0.120611 / 0.014526 (0.106085)	0.124516 / 0.176557 (-0.052041)	0.166036 / 0.737135 (-0.571100)	0.131689 / 0.296338 (-0.164650)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.400728 / 0.215209 (0.185519)	4.007027 / 2.077655 (1.929372)	1.793922 / 1.504120 (0.289803)	1.596709 / 1.541195 (0.055514)	1.752130 / 1.468490 (0.283640)	0.717464 / 4.584777 (-3.867313)	3.798844 / 3.745712 (0.053132)	3.685088 / 5.269862 (-1.584774)	1.914041 / 4.565676 (-2.651636)	0.086181 / 0.424275 (-0.338094)	0.012753 / 0.007607 (0.005146)	0.507984 / 0.226044 (0.281940)	5.086255 / 2.268929 (2.817326)	2.280650 / 55.444624 (-53.163974)	1.929294 / 6.876477 (-4.947183)	2.057884 / 2.142072 (-0.084188)	0.852863 / 4.805227 (-3.952364)	0.165497 / 6.500664 (-6.335168)	0.063356 / 0.075469 (-0.012113)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.212593 / 1.841788 (-0.629194)	16.270507 / 8.074308 (8.196199)	15.708406 / 10.191392 (5.517014)	0.162346 / 0.680424 (-0.518078)	0.029702 / 0.534201 (-0.504499)	0.447685 / 0.579283 (-0.131598)	0.449361 / 0.434364 (0.014997)	0.530536 / 0.540337 (-0.009801)	0.613439 / 1.386936 (-0.773497)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007741 / 0.011353 (-0.003612)	0.005752 / 0.011008 (-0.005256)	0.076600 / 0.038508 (0.038092)	0.034841 / 0.023109 (0.011732)	0.345106 / 0.275898 (0.069208)	0.385685 / 0.323480 (0.062205)	0.006466 / 0.007986 (-0.001519)	0.005806 / 0.004328 (0.001478)	0.075110 / 0.004250 (0.070860)	0.052936 / 0.037052 (0.015883)	0.343576 / 0.258489 (0.085087)	0.408749 / 0.293841 (0.114908)	0.037345 / 0.128546 (-0.091201)	0.012807 / 0.075646 (-0.062839)	0.087732 / 0.419271 (-0.331540)	0.050218 / 0.043533 (0.006685)	0.338963 / 0.255139 (0.083824)	0.361629 / 0.283200 (0.078429)	0.107488 / 0.141683 (-0.034195)	1.465284 / 1.452155 (0.013130)	1.562218 / 1.492716 (0.069502)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.322496 / 0.018006 (0.304489)	0.522782 / 0.000490 (0.522292)	0.006680 / 0.000200 (0.006480)	0.000144 / 0.000054 (0.000090)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.031801 / 0.037411 (-0.005611)	0.116839 / 0.014526 (0.102313)	0.127552 / 0.176557 (-0.049005)	0.167670 / 0.737135 (-0.569465)	0.134170 / 0.296338 (-0.162168)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.425449 / 0.215209 (0.210240)	4.229367 / 2.077655 (2.151713)	2.014663 / 1.504120 (0.510543)	1.812981 / 1.541195 (0.271787)	1.964039 / 1.468490 (0.495549)	0.703454 / 4.584777 (-3.881323)	3.786985 / 3.745712 (0.041273)	2.262377 / 5.269862 (-3.007485)	1.404868 / 4.565676 (-3.160808)	0.086234 / 0.424275 (-0.338041)	0.012616 / 0.007607 (0.005009)	0.525784 / 0.226044 (0.299739)	5.268295 / 2.268929 (2.999366)	2.496674 / 55.444624 (-52.947950)	2.177773 / 6.876477 (-4.698704)	2.313677 / 2.142072 (0.171605)	0.846202 / 4.805227 (-3.959026)	0.170152 / 6.500664 (-6.330513)	0.066772 / 0.075469 (-0.008698)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.254719 / 1.841788 (-0.587069)	16.017627 / 8.074308 (7.943319)	14.560583 / 10.191392 (4.369191)	0.168275 / 0.680424 (-0.512149)	0.017935 / 0.534201 (-0.516266)	0.430806 / 0.579283 (-0.148477)	0.428737 / 0.434364 (-0.005626)	0.532001 / 0.540337 (-0.008336)	0.633680 / 1.386936 (-0.753256)

* Make tiktoken tokenizers hashable * Fix for direction creation * Missing comma

This reverts commit a484696.

mariosasko added 3 commits February 20, 2023 17:35

Make tiktoken tokenizers hashable

1f8f0f5

Fix for direction creation

a69a87f

Missing comma

189a870

Merge branch 'main' of github.com:huggingface/datasets into hashable-…

526578c

…tiktoken

lhoestq approved these changes Feb 21, 2023

View reviewed changes

mariosasko merged commit c2c75df into main Feb 21, 2023

mariosasko deleted the hashable-tiktoken branch February 21, 2023 13:13

AJDERS pushed a commit to AJDERS/datasets that referenced this pull request Feb 21, 2023

Make tiktoken tokenizers hashable (huggingface#5552)

a484696

* Make tiktoken tokenizers hashable * Fix for direction creation * Missing comma

AJDERS added a commit to AJDERS/datasets that referenced this pull request Feb 21, 2023

Revert "Make tiktoken tokenizers hashable (huggingface#5552)"

bf26600

This reverts commit a484696.

lhoestq mentioned this pull request Feb 22, 2023

Failure to hash function when using .map() #5536

Closed

jklj077 mentioned this pull request Aug 4, 2023

tiktoken不支持多线程tokenize? QwenLM/Qwen#36

Closed

Make tiktoken tokenizers hashable #5552

Make tiktoken tokenizers hashable #5552

Conversation

mariosasko commented Feb 20, 2023

HuggingFaceDocBuilderDev commented Feb 20, 2023 • edited

github-actions bot commented Feb 20, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

lhoestq left a comment

Choose a reason for hiding this comment

github-actions bot commented Feb 21, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Feb 21, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

HuggingFaceDocBuilderDev commented Feb 20, 2023 •

edited