Don't reference self in Spark._validate_cache_dir #6024

maddiedawson · 2023-07-12T20:31:16Z

Fix for #5963

maddiedawson · 2023-07-12T20:31:39Z

Ptal @lhoestq :) I tested this manually on a multi-node Databricks cluster

maddiedawson · 2023-07-12T20:36:43Z

Hm looks like the check_code_quality failures are unrelated to me change... https://github.com/huggingface/datasets/actions/runs/5536162850/jobs/10103451883?pr=6024

HuggingFaceDocBuilderDev · 2023-07-12T20:37:58Z

The documentation is not available anymore as the PR was closed or merged.

lhoestq

Cool ! Let me fix the check_code_quality error in another PR

github-actions · 2023-07-13T12:46:01Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005952 / 0.011353 (-0.005400)	0.003585 / 0.011008 (-0.007424)	0.079163 / 0.038508 (0.040655)	0.057926 / 0.023109 (0.034817)	0.326647 / 0.275898 (0.050749)	0.383485 / 0.323480 (0.060005)	0.004530 / 0.007986 (-0.003456)	0.002821 / 0.004328 (-0.001508)	0.062071 / 0.004250 (0.057820)	0.048023 / 0.037052 (0.010971)	0.329368 / 0.258489 (0.070879)	0.390877 / 0.293841 (0.097036)	0.026959 / 0.128546 (-0.101588)	0.007911 / 0.075646 (-0.067735)	0.259956 / 0.419271 (-0.159315)	0.044582 / 0.043533 (0.001049)	0.320537 / 0.255139 (0.065398)	0.373814 / 0.283200 (0.090614)	0.020275 / 0.141683 (-0.121408)	1.532128 / 1.452155 (0.079973)	1.595031 / 1.492716 (0.102315)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.186127 / 0.018006 (0.168120)	0.428586 / 0.000490 (0.428097)	0.005180 / 0.000200 (0.004980)	0.000069 / 0.000054 (0.000015)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.024876 / 0.037411 (-0.012536)	0.072169 / 0.014526 (0.057643)	0.082015 / 0.176557 (-0.094542)	0.147467 / 0.737135 (-0.589668)	0.082769 / 0.296338 (-0.213570)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.410625 / 0.215209 (0.195416)	4.116742 / 2.077655 (2.039088)	2.172291 / 1.504120 (0.668171)	2.022462 / 1.541195 (0.481268)	2.048142 / 1.468490 (0.579651)	0.503152 / 4.584777 (-4.081625)	3.019135 / 3.745712 (-0.726577)	3.589451 / 5.269862 (-1.680410)	2.206876 / 4.565676 (-2.358801)	0.057687 / 0.424275 (-0.366588)	0.006560 / 0.007607 (-0.001047)	0.475585 / 0.226044 (0.249541)	4.784344 / 2.268929 (2.515416)	2.506322 / 55.444624 (-52.938302)	2.168251 / 6.876477 (-4.708225)	2.324453 / 2.142072 (0.182381)	0.590609 / 4.805227 (-4.214618)	0.124178 / 6.500664 (-6.376486)	0.059197 / 0.075469 (-0.016272)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.212359 / 1.841788 (-0.629429)	17.915843 / 8.074308 (9.841535)	13.128330 / 10.191392 (2.936938)	0.144805 / 0.680424 (-0.535618)	0.016889 / 0.534201 (-0.517312)	0.344056 / 0.579283 (-0.235227)	0.359370 / 0.434364 (-0.074994)	0.404199 / 0.540337 (-0.136138)	0.549117 / 1.386936 (-0.837819)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005914 / 0.011353 (-0.005439)	0.003565 / 0.011008 (-0.007443)	0.061575 / 0.038508 (0.023067)	0.057677 / 0.023109 (0.034568)	0.359753 / 0.275898 (0.083855)	0.394135 / 0.323480 (0.070655)	0.004648 / 0.007986 (-0.003338)	0.002795 / 0.004328 (-0.001534)	0.061877 / 0.004250 (0.057626)	0.049673 / 0.037052 (0.012621)	0.363120 / 0.258489 (0.104631)	0.402685 / 0.293841 (0.108844)	0.027021 / 0.128546 (-0.101525)	0.008006 / 0.075646 (-0.067641)	0.067398 / 0.419271 (-0.351874)	0.044442 / 0.043533 (0.000909)	0.364851 / 0.255139 (0.109712)	0.387219 / 0.283200 (0.104019)	0.027267 / 0.141683 (-0.114416)	1.466675 / 1.452155 (0.014520)	1.512607 / 1.492716 (0.019891)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.206156 / 0.018006 (0.188150)	0.410877 / 0.000490 (0.410387)	0.003061 / 0.000200 (0.002861)	0.000068 / 0.000054 (0.000013)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.024869 / 0.037411 (-0.012542)	0.075736 / 0.014526 (0.061210)	0.083922 / 0.176557 (-0.092634)	0.139510 / 0.737135 (-0.597626)	0.087685 / 0.296338 (-0.208654)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.414473 / 0.215209 (0.199264)	4.150633 / 2.077655 (2.072979)	2.132892 / 1.504120 (0.628773)	1.964072 / 1.541195 (0.422878)	2.003353 / 1.468490 (0.534863)	0.498012 / 4.584777 (-4.086765)	3.010135 / 3.745712 (-0.735577)	2.841130 / 5.269862 (-2.428732)	1.826013 / 4.565676 (-2.739664)	0.057443 / 0.424275 (-0.366832)	0.006374 / 0.007607 (-0.001234)	0.490337 / 0.226044 (0.264292)	4.889628 / 2.268929 (2.620700)	2.575626 / 55.444624 (-52.868998)	2.246522 / 6.876477 (-4.629955)	2.276183 / 2.142072 (0.134110)	0.581465 / 4.805227 (-4.223763)	0.123877 / 6.500664 (-6.376787)	0.060339 / 0.075469 (-0.015130)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.333202 / 1.841788 (-0.508585)	18.363558 / 8.074308 (10.289250)	14.109356 / 10.191392 (3.917964)	0.147358 / 0.680424 (-0.533066)	0.016813 / 0.534201 (-0.517388)	0.334815 / 0.579283 (-0.244468)	0.366576 / 0.434364 (-0.067788)	0.397223 / 0.540337 (-0.143115)	0.547893 / 1.386936 (-0.839043)

Init

5954e50

lhoestq approved these changes Jul 13, 2023

View reviewed changes

lhoestq merged commit 67ac60b into huggingface:main Jul 13, 2023
3 of 4 checks passed

maddiedawson deleted the ES-759942 branch July 13, 2023 16:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't reference self in Spark._validate_cache_dir #6024

Don't reference self in Spark._validate_cache_dir #6024

maddiedawson commented Jul 12, 2023

maddiedawson commented Jul 12, 2023

maddiedawson commented Jul 12, 2023

HuggingFaceDocBuilderDev commented Jul 12, 2023 •

edited

lhoestq left a comment

github-actions bot commented Jul 13, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Don't reference self in Spark._validate_cache_dir #6024

Don't reference self in Spark._validate_cache_dir #6024

Conversation

maddiedawson commented Jul 12, 2023

maddiedawson commented Jul 12, 2023

maddiedawson commented Jul 12, 2023

HuggingFaceDocBuilderDev commented Jul 12, 2023 • edited

lhoestq left a comment

Choose a reason for hiding this comment

github-actions bot commented Jul 13, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

HuggingFaceDocBuilderDev commented Jul 12, 2023 •

edited