Fix commit message formatting in multi-commit uploads #6313

qgallouedec · 2023-10-19T07:53:56Z

Currently, the commit message keeps on adding:

Upload dataset (part 00000-of-00002)
Upload dataset (part 00000-of-00002) (part 00001-of-00002)

Introduced in #6269

This PR fixes this issue to have

Upload dataset (part 00000-of-00002)
Upload dataset (part 00001-of-00002)

HuggingFaceDocBuilderDev · 2023-10-19T08:01:30Z

The documentation is not available anymore as the PR was closed or merged.

src/datasets/dataset_dict.py

mariosasko

Thanks!

github-actions · 2023-10-20T14:06:12Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006760 / 0.011353 (-0.004593)	0.003918 / 0.011008 (-0.007091)	0.084016 / 0.038508 (0.045508)	0.069927 / 0.023109 (0.046818)	0.307898 / 0.275898 (0.032000)	0.337453 / 0.323480 (0.013973)	0.004132 / 0.007986 (-0.003854)	0.003248 / 0.004328 (-0.001081)	0.064526 / 0.004250 (0.060275)	0.056424 / 0.037052 (0.019371)	0.316313 / 0.258489 (0.057824)	0.356302 / 0.293841 (0.062461)	0.030634 / 0.128546 (-0.097912)	0.008467 / 0.075646 (-0.067180)	0.286676 / 0.419271 (-0.132595)	0.051813 / 0.043533 (0.008280)	0.309874 / 0.255139 (0.054735)	0.332513 / 0.283200 (0.049313)	0.023919 / 0.141683 (-0.117764)	1.509033 / 1.452155 (0.056878)	1.549636 / 1.492716 (0.056920)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.221464 / 0.018006 (0.203458)	0.447873 / 0.000490 (0.447384)	0.002408 / 0.000200 (0.002208)	0.000090 / 0.000054 (0.000035)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.027634 / 0.037411 (-0.009777)	0.081802 / 0.014526 (0.067276)	0.781489 / 0.176557 (0.604933)	0.165184 / 0.737135 (-0.571951)	0.121526 / 0.296338 (-0.174813)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.408215 / 0.215209 (0.193006)	4.091192 / 2.077655 (2.013538)	2.062608 / 1.504120 (0.558488)	1.895747 / 1.541195 (0.354552)	1.873682 / 1.468490 (0.405192)	0.484184 / 4.584777 (-4.100593)	3.469096 / 3.745712 (-0.276616)	3.365325 / 5.269862 (-1.904537)	2.000333 / 4.565676 (-2.565343)	0.056661 / 0.424275 (-0.367614)	0.007100 / 0.007607 (-0.000507)	0.478587 / 0.226044 (0.252542)	4.768703 / 2.268929 (2.499774)	2.472432 / 55.444624 (-52.972192)	2.133611 / 6.876477 (-4.742865)	2.154296 / 2.142072 (0.012223)	0.582293 / 4.805227 (-4.222934)	0.131932 / 6.500664 (-6.368732)	0.060259 / 0.075469 (-0.015211)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.259167 / 1.841788 (-0.582620)	18.465604 / 8.074308 (10.391296)	14.024528 / 10.191392 (3.833136)	0.162320 / 0.680424 (-0.518104)	0.018144 / 0.534201 (-0.516057)	0.389931 / 0.579283 (-0.189352)	0.396456 / 0.434364 (-0.037908)	0.454734 / 0.540337 (-0.085603)	0.636406 / 1.386936 (-0.750530)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006565 / 0.011353 (-0.004788)	0.004008 / 0.011008 (-0.007000)	0.064526 / 0.038508 (0.026018)	0.071963 / 0.023109 (0.048854)	0.415456 / 0.275898 (0.139557)	0.441199 / 0.323480 (0.117719)	0.005619 / 0.007986 (-0.002366)	0.003261 / 0.004328 (-0.001067)	0.064817 / 0.004250 (0.060567)	0.055349 / 0.037052 (0.018296)	0.425172 / 0.258489 (0.166683)	0.452629 / 0.293841 (0.158788)	0.031676 / 0.128546 (-0.096870)	0.008432 / 0.075646 (-0.067214)	0.071752 / 0.419271 (-0.347519)	0.047176 / 0.043533 (0.003643)	0.408641 / 0.255139 (0.153502)	0.428579 / 0.283200 (0.145380)	0.021548 / 0.141683 (-0.120135)	1.495153 / 1.452155 (0.042999)	1.557933 / 1.492716 (0.065217)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.212749 / 0.018006 (0.194743)	0.441263 / 0.000490 (0.440773)	0.005831 / 0.000200 (0.005631)	0.000092 / 0.000054 (0.000037)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.031844 / 0.037411 (-0.005567)	0.091590 / 0.014526 (0.077064)	0.102859 / 0.176557 (-0.073697)	0.155859 / 0.737135 (-0.581276)	0.104717 / 0.296338 (-0.191622)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.425924 / 0.215209 (0.210715)	4.292829 / 2.077655 (2.215174)	2.314350 / 1.504120 (0.810230)	2.163087 / 1.541195 (0.621892)	2.217310 / 1.468490 (0.748820)	0.490889 / 4.584777 (-4.093887)	3.498287 / 3.745712 (-0.247425)	3.224980 / 5.269862 (-2.044881)	1.987739 / 4.565676 (-2.577938)	0.057486 / 0.424275 (-0.366790)	0.007199 / 0.007607 (-0.000408)	0.501194 / 0.226044 (0.275149)	5.015202 / 2.268929 (2.746273)	2.816307 / 55.444624 (-52.628318)	2.474593 / 6.876477 (-4.401884)	2.649510 / 2.142072 (0.507437)	0.597167 / 4.805227 (-4.208060)	0.131199 / 6.500664 (-6.369465)	0.059532 / 0.075469 (-0.015938)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.384053 / 1.841788 (-0.457734)	18.964201 / 8.074308 (10.889893)	14.336209 / 10.191392 (4.144817)	0.187522 / 0.680424 (-0.492902)	0.020201 / 0.534201 (-0.514000)	0.394778 / 0.579283 (-0.184505)	0.408393 / 0.434364 (-0.025971)	0.470965 / 0.540337 (-0.069373)	0.667974 / 1.386936 (-0.718962)

qgallouedec added 2 commits October 19, 2023 09:49

fix commit message

d8c965b

fix the fix

5188bf8

Fix dataset too

3477bf7

qgallouedec commented Oct 19, 2023

View reviewed changes

src/datasets/dataset_dict.py Outdated Show resolved Hide resolved

qgallouedec added 2 commits October 19, 2023 19:33

Merge branch 'main' into fix_commit_message_push_to_hub

205cc84

Update src/datasets/dataset_dict.py

a69d636

mariosasko approved these changes Oct 20, 2023

View reviewed changes

mariosasko merged commit 3b3333d into huggingface:main Oct 20, 2023
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix commit message formatting in multi-commit uploads #6313

Fix commit message formatting in multi-commit uploads #6313

qgallouedec commented Oct 19, 2023

HuggingFaceDocBuilderDev commented Oct 19, 2023 •

edited

Loading

mariosasko left a comment

github-actions bot commented Oct 20, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Fix commit message formatting in multi-commit uploads #6313

Fix commit message formatting in multi-commit uploads #6313

Conversation

qgallouedec commented Oct 19, 2023

HuggingFaceDocBuilderDev commented Oct 19, 2023 • edited Loading

mariosasko left a comment

Choose a reason for hiding this comment

github-actions bot commented Oct 20, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

HuggingFaceDocBuilderDev commented Oct 19, 2023 •

edited

Loading