No gzip encoding from github #6076

lhoestq · 2023-07-26T12:46:07Z

Don't accept gzip encoding from github, otherwise some files are not streamable + seekable.

fix https://huggingface.co/datasets/code_x_glue_cc_code_to_code_trans/discussions/2#64c0e0c1a04a514ba6303e84

and making sure #2918 works as well

HuggingFaceDocBuilderDev · 2023-07-26T12:51:44Z

The documentation is not available anymore as the PR was closed or merged.

github-actions · 2023-07-26T12:55:54Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008191 / 0.011353 (-0.003162)	0.004669 / 0.011008 (-0.006339)	0.101315 / 0.038508 (0.062807)	0.090235 / 0.023109 (0.067126)	0.381265 / 0.275898 (0.105367)	0.418266 / 0.323480 (0.094786)	0.006292 / 0.007986 (-0.001693)	0.003979 / 0.004328 (-0.000349)	0.075946 / 0.004250 (0.071696)	0.070678 / 0.037052 (0.033625)	0.378006 / 0.258489 (0.119517)	0.425825 / 0.293841 (0.131984)	0.036325 / 0.128546 (-0.092221)	0.009814 / 0.075646 (-0.065832)	0.345687 / 0.419271 (-0.073584)	0.063846 / 0.043533 (0.020313)	0.386003 / 0.255139 (0.130864)	0.400875 / 0.283200 (0.117675)	0.027806 / 0.141683 (-0.113877)	1.814810 / 1.452155 (0.362655)	1.879897 / 1.492716 (0.387180)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.218684 / 0.018006 (0.200677)	0.501715 / 0.000490 (0.501225)	0.004808 / 0.000200 (0.004608)	0.000093 / 0.000054 (0.000039)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.035494 / 0.037411 (-0.001917)	0.100949 / 0.014526 (0.086423)	0.114639 / 0.176557 (-0.061917)	0.188908 / 0.737135 (-0.548227)	0.115794 / 0.296338 (-0.180545)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.462537 / 0.215209 (0.247328)	4.612469 / 2.077655 (2.534814)	2.298065 / 1.504120 (0.793945)	2.088738 / 1.541195 (0.547543)	2.188072 / 1.468490 (0.719582)	0.565412 / 4.584777 (-4.019364)	4.180394 / 3.745712 (0.434681)	3.848696 / 5.269862 (-1.421165)	2.391381 / 4.565676 (-2.174296)	0.067647 / 0.424275 (-0.356628)	0.008847 / 0.007607 (0.001240)	0.553288 / 0.226044 (0.327243)	5.517962 / 2.268929 (3.249033)	2.866622 / 55.444624 (-52.578002)	2.439025 / 6.876477 (-4.437452)	2.740156 / 2.142072 (0.598084)	0.694796 / 4.805227 (-4.110431)	0.159022 / 6.500664 (-6.341642)	0.074471 / 0.075469 (-0.000998)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.534979 / 1.841788 (-0.306808)	23.297273 / 8.074308 (15.222965)	16.859178 / 10.191392 (6.667786)	0.207594 / 0.680424 (-0.472830)	0.021990 / 0.534201 (-0.512211)	0.472059 / 0.579283 (-0.107224)	0.497632 / 0.434364 (0.063268)	0.565672 / 0.540337 (0.025335)	0.772485 / 1.386936 (-0.614451)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007777 / 0.011353 (-0.003576)	0.004679 / 0.011008 (-0.006329)	0.077317 / 0.038508 (0.038809)	0.087433 / 0.023109 (0.064324)	0.437389 / 0.275898 (0.161491)	0.479562 / 0.323480 (0.156082)	0.006137 / 0.007986 (-0.001849)	0.003938 / 0.004328 (-0.000390)	0.074769 / 0.004250 (0.070518)	0.066605 / 0.037052 (0.029553)	0.454865 / 0.258489 (0.196376)	0.485103 / 0.293841 (0.191262)	0.036540 / 0.128546 (-0.092006)	0.009983 / 0.075646 (-0.065664)	0.083566 / 0.419271 (-0.335706)	0.059527 / 0.043533 (0.015994)	0.449154 / 0.255139 (0.194015)	0.462542 / 0.283200 (0.179342)	0.027581 / 0.141683 (-0.114102)	1.776720 / 1.452155 (0.324565)	1.847920 / 1.492716 (0.355204)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.246792 / 0.018006 (0.228786)	0.494513 / 0.000490 (0.494024)	0.004376 / 0.000200 (0.004176)	0.000115 / 0.000054 (0.000061)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.037837 / 0.037411 (0.000426)	0.112752 / 0.014526 (0.098226)	0.121742 / 0.176557 (-0.054815)	0.189365 / 0.737135 (-0.547770)	0.124366 / 0.296338 (-0.171973)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.492890 / 0.215209 (0.277681)	4.920270 / 2.077655 (2.842615)	2.565350 / 1.504120 (1.061230)	2.378679 / 1.541195 (0.837484)	2.483794 / 1.468490 (1.015304)	0.579623 / 4.584777 (-4.005154)	4.195924 / 3.745712 (0.450212)	3.903382 / 5.269862 (-1.366479)	2.466884 / 4.565676 (-2.098793)	0.064145 / 0.424275 (-0.360130)	0.008695 / 0.007607 (0.001088)	0.579300 / 0.226044 (0.353256)	5.809064 / 2.268929 (3.540136)	3.145393 / 55.444624 (-52.299232)	2.832760 / 6.876477 (-4.043717)	3.020460 / 2.142072 (0.878388)	0.700235 / 4.805227 (-4.104992)	0.161262 / 6.500664 (-6.339402)	0.076484 / 0.075469 (0.001015)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.606504 / 1.841788 (-0.235284)	23.747863 / 8.074308 (15.673555)	17.281712 / 10.191392 (7.090320)	0.203874 / 0.680424 (-0.476550)	0.021839 / 0.534201 (-0.512362)	0.472365 / 0.579283 (-0.106918)	0.475150 / 0.434364 (0.040786)	0.571713 / 0.540337 (0.031376)	0.759210 / 1.386936 (-0.627726)

albertvillanova

Thanks for the fix.

Some questions: won't this have an impact on downloading time, once we do not longer compress the payload? What is the advantage of this approach over the one with block_size: 0?

See: https://huggingface.co/datasets/code_x_glue_cc_code_to_code_trans/discussions/3

lhoestq · 2023-07-27T13:58:27Z

Some questions: won't this have an impact on downloading time, once we do not longer compress the payload? What is the advantage of this approach over the one with block_size: 0?

Surely, but this prevents random access which is needed at multiple places in the code (eg to check the compression type).
Github isn't a good place for big files anyway so we should be fine

no gzip encoding from github

c3a7fc0

lhoestq requested a review from albertvillanova July 26, 2023 14:01

albertvillanova reviewed Jul 27, 2023

View reviewed changes

lhoestq merged commit 73fbf7d into main Jul 27, 2023
13 checks passed

lhoestq deleted the stream-from-github branch July 27, 2023 16:14

severo mentioned this pull request Jul 27, 2023

upgrade datasets to 2.14 huggingface/dataset-viewer#1550

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No gzip encoding from github #6076

No gzip encoding from github #6076

lhoestq commented Jul 26, 2023 •

edited

HuggingFaceDocBuilderDev commented Jul 26, 2023 •

edited

github-actions bot commented Jul 26, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

albertvillanova left a comment •

edited

lhoestq commented Jul 27, 2023

No gzip encoding from github #6076

No gzip encoding from github #6076

Conversation

lhoestq commented Jul 26, 2023 • edited

HuggingFaceDocBuilderDev commented Jul 26, 2023 • edited

github-actions bot commented Jul 26, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

albertvillanova left a comment • edited

Choose a reason for hiding this comment

lhoestq commented Jul 27, 2023

lhoestq commented Jul 26, 2023 •

edited

HuggingFaceDocBuilderDev commented Jul 26, 2023 •

edited

albertvillanova left a comment •

edited