canonicalize data dir in config ID hash #5899

kylrth · 2023-05-25T18:17:10Z

fixes #5871

The second commit is optional but improves readability.

HuggingFaceDocBuilderDev · 2023-06-01T13:17:32Z

The documentation is not available anymore as the PR was closed or merged.

mariosasko

Thanks!

src/datasets/builder.py

This leaves the hash unchanged when the data dir changes in insubstantial ways, like adding a trailing slash or using a symlink. fixes huggingface#5871

github-actions · 2023-06-02T16:02:15Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.009137 / 0.011353 (-0.002216)	0.006119 / 0.011008 (-0.004889)	0.136530 / 0.038508 (0.098022)	0.038434 / 0.023109 (0.015325)	0.427900 / 0.275898 (0.152002)	0.449757 / 0.323480 (0.126277)	0.007673 / 0.007986 (-0.000313)	0.007147 / 0.004328 (0.002818)	0.108029 / 0.004250 (0.103778)	0.055072 / 0.037052 (0.018020)	0.439245 / 0.258489 (0.180756)	0.477285 / 0.293841 (0.183444)	0.044838 / 0.128546 (-0.083708)	0.020814 / 0.075646 (-0.054832)	0.436098 / 0.419271 (0.016826)	0.067459 / 0.043533 (0.023926)	0.427470 / 0.255139 (0.172331)	0.443260 / 0.283200 (0.160060)	0.125466 / 0.141683 (-0.016216)	1.996756 / 1.452155 (0.544601)	2.100679 / 1.492716 (0.607962)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.278407 / 0.018006 (0.260401)	0.625855 / 0.000490 (0.625365)	0.005544 / 0.000200 (0.005344)	0.000107 / 0.000054 (0.000053)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.033495 / 0.037411 (-0.003916)	0.134718 / 0.014526 (0.120192)	0.150151 / 0.176557 (-0.026406)	0.221385 / 0.737135 (-0.515751)	0.150932 / 0.296338 (-0.145406)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.668845 / 0.215209 (0.453636)	6.678436 / 2.077655 (4.600781)	2.714074 / 1.504120 (1.209954)	2.275784 / 1.541195 (0.734589)	2.332852 / 1.468490 (0.864361)	1.014877 / 4.584777 (-3.569900)	6.086455 / 3.745712 (2.340743)	2.990029 / 5.269862 (-2.279832)	1.862236 / 4.565676 (-2.703441)	0.122179 / 0.424275 (-0.302096)	0.015706 / 0.007607 (0.008099)	0.873473 / 0.226044 (0.647429)	8.580109 / 2.268929 (6.311180)	3.458360 / 55.444624 (-51.986264)	2.738801 / 6.876477 (-4.137676)	2.918428 / 2.142072 (0.776356)	1.224910 / 4.805227 (-3.580317)	0.243006 / 6.500664 (-6.257658)	0.087121 / 0.075469 (0.011652)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.757802 / 1.841788 (-0.083986)	19.447999 / 8.074308 (11.373691)	24.518157 / 10.191392 (14.326765)	0.245013 / 0.680424 (-0.435411)	0.032290 / 0.534201 (-0.501911)	0.542043 / 0.579283 (-0.037240)	0.708154 / 0.434364 (0.273790)	0.660584 / 0.540337 (0.120247)	0.794868 / 1.386936 (-0.592068)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.009496 / 0.011353 (-0.001857)	0.005842 / 0.011008 (-0.005166)	0.112813 / 0.038508 (0.074305)	0.039120 / 0.023109 (0.016011)	0.489717 / 0.275898 (0.213819)	0.532586 / 0.323480 (0.209107)	0.007681 / 0.007986 (-0.000304)	0.005337 / 0.004328 (0.001009)	0.107244 / 0.004250 (0.102994)	0.056847 / 0.037052 (0.019794)	0.499447 / 0.258489 (0.240958)	0.548995 / 0.293841 (0.255154)	0.058047 / 0.128546 (-0.070499)	0.015468 / 0.075646 (-0.060179)	0.124600 / 0.419271 (-0.294671)	0.060940 / 0.043533 (0.017407)	0.488370 / 0.255139 (0.233231)	0.518540 / 0.283200 (0.235341)	0.124147 / 0.141683 (-0.017536)	1.902922 / 1.452155 (0.450767)	2.033519 / 1.492716 (0.540803)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.319527 / 0.018006 (0.301521)	0.629641 / 0.000490 (0.629152)	0.000721 / 0.000200 (0.000521)	0.000101 / 0.000054 (0.000046)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.033150 / 0.037411 (-0.004262)	0.134250 / 0.014526 (0.119724)	0.161273 / 0.176557 (-0.015283)	0.211471 / 0.737135 (-0.525664)	0.155326 / 0.296338 (-0.141012)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.705244 / 0.215209 (0.490035)	7.043040 / 2.077655 (4.965386)	3.308948 / 1.504120 (1.804828)	2.885050 / 1.541195 (1.343855)	2.810260 / 1.468490 (1.341770)	1.027095 / 4.584777 (-3.557682)	6.111398 / 3.745712 (2.365686)	5.385545 / 5.269862 (0.115684)	2.521668 / 4.565676 (-2.044009)	0.122419 / 0.424275 (-0.301856)	0.016376 / 0.007607 (0.008768)	0.830856 / 0.226044 (0.604811)	8.952199 / 2.268929 (6.683271)	4.207875 / 55.444624 (-51.236749)	3.346624 / 6.876477 (-3.529853)	3.395316 / 2.142072 (1.253244)	1.351816 / 4.805227 (-3.453411)	0.303056 / 6.500664 (-6.197608)	0.098713 / 0.075469 (0.023244)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.841903 / 1.841788 (0.000116)	20.472125 / 8.074308 (12.397817)	23.433200 / 10.191392 (13.241808)	0.242599 / 0.680424 (-0.437825)	0.030701 / 0.534201 (-0.503500)	0.541614 / 0.579283 (-0.037669)	0.657827 / 0.434364 (0.223463)	0.652448 / 0.540337 (0.112111)	0.773743 / 1.386936 (-0.613193)

mariosasko reviewed Jun 1, 2023

View reviewed changes

src/datasets/builder.py Outdated Show resolved Hide resolved

src/datasets/builder.py Outdated Show resolved Hide resolved

kylrth force-pushed the main branch from 348b5b6 to eac649f Compare June 1, 2023 17:46

canonicalize data dir in config ID hash

7789d90

This leaves the hash unchanged when the data dir changes in insubstantial ways, like adding a trailing slash or using a symlink. fixes huggingface#5871

kylrth force-pushed the main branch from eac649f to 7789d90 Compare June 1, 2023 17:48

mariosasko approved these changes Jun 2, 2023

View reviewed changes

mariosasko merged commit 02ee418 into huggingface:main Jun 2, 2023
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

canonicalize data dir in config ID hash #5899

canonicalize data dir in config ID hash #5899

kylrth commented May 25, 2023

HuggingFaceDocBuilderDev commented Jun 1, 2023 •

edited

mariosasko left a comment

github-actions bot commented Jun 2, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

canonicalize data dir in config ID hash #5899

canonicalize data dir in config ID hash #5899

Conversation

kylrth commented May 25, 2023

HuggingFaceDocBuilderDev commented Jun 1, 2023 • edited

mariosasko left a comment

Choose a reason for hiding this comment

github-actions bot commented Jun 2, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

HuggingFaceDocBuilderDev commented Jun 1, 2023 •

edited