Rename "pattern" to "path" in YAML data_files configs #6044

lhoestq · 2023-07-17T15:41:16Z

To make it easier to understand for users.

They can use "path" to specify a single path, ~~or "paths" to use a list of paths.~~

Glob patterns are still supported though

HuggingFaceDocBuilderDev · 2023-07-17T15:47:49Z

The documentation is not available anymore as the PR was closed or merged.

github-actions · 2023-07-17T15:49:58Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006543 / 0.011353 (-0.004809)	0.004085 / 0.011008 (-0.006924)	0.083989 / 0.038508 (0.045481)	0.074733 / 0.023109 (0.051623)	0.310839 / 0.275898 (0.034941)	0.333540 / 0.323480 (0.010060)	0.005566 / 0.007986 (-0.002419)	0.003461 / 0.004328 (-0.000868)	0.065194 / 0.004250 (0.060943)	0.057007 / 0.037052 (0.019954)	0.325633 / 0.258489 (0.067144)	0.351665 / 0.293841 (0.057824)	0.030561 / 0.128546 (-0.097985)	0.008579 / 0.075646 (-0.067068)	0.287457 / 0.419271 (-0.131815)	0.063554 / 0.043533 (0.020021)	0.309182 / 0.255139 (0.054043)	0.327809 / 0.283200 (0.044609)	0.034470 / 0.141683 (-0.107213)	1.452098 / 1.452155 (-0.000057)	1.527130 / 1.492716 (0.034414)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.241736 / 0.018006 (0.223729)	0.552432 / 0.000490 (0.551943)	0.004085 / 0.000200 (0.003885)	0.000089 / 0.000054 (0.000035)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.027290 / 0.037411 (-0.010121)	0.081250 / 0.014526 (0.066724)	0.094739 / 0.176557 (-0.081818)	0.150424 / 0.737135 (-0.586711)	0.095488 / 0.296338 (-0.200851)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.377245 / 0.215209 (0.162036)	3.781021 / 2.077655 (1.703366)	1.820092 / 1.504120 (0.315972)	1.654420 / 1.541195 (0.113225)	1.751256 / 1.468490 (0.282766)	0.475161 / 4.584777 (-4.109616)	3.603462 / 3.745712 (-0.142251)	5.437837 / 5.269862 (0.167975)	3.305598 / 4.565676 (-1.260079)	0.055856 / 0.424275 (-0.368419)	0.007259 / 0.007607 (-0.000348)	0.454205 / 0.226044 (0.228161)	4.544157 / 2.268929 (2.275229)	2.296776 / 55.444624 (-53.147848)	1.951017 / 6.876477 (-4.925459)	2.128759 / 2.142072 (-0.013313)	0.590354 / 4.805227 (-4.214873)	0.129974 / 6.500664 (-6.370690)	0.059506 / 0.075469 (-0.015963)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.285866 / 1.841788 (-0.555921)	19.419446 / 8.074308 (11.345138)	13.985108 / 10.191392 (3.793716)	0.146803 / 0.680424 (-0.533620)	0.018176 / 0.534201 (-0.516025)	0.392345 / 0.579283 (-0.186938)	0.405394 / 0.434364 (-0.028970)	0.454649 / 0.540337 (-0.085688)	0.633075 / 1.386936 (-0.753861)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006497 / 0.011353 (-0.004855)	0.004092 / 0.011008 (-0.006916)	0.064908 / 0.038508 (0.026400)	0.073494 / 0.023109 (0.050385)	0.382227 / 0.275898 (0.106329)	0.407320 / 0.323480 (0.083840)	0.005653 / 0.007986 (-0.002332)	0.003500 / 0.004328 (-0.000829)	0.064570 / 0.004250 (0.060320)	0.058733 / 0.037052 (0.021681)	0.385702 / 0.258489 (0.127213)	0.426463 / 0.293841 (0.132622)	0.031073 / 0.128546 (-0.097473)	0.008710 / 0.075646 (-0.066936)	0.071378 / 0.419271 (-0.347893)	0.050141 / 0.043533 (0.006608)	0.377769 / 0.255139 (0.122630)	0.395016 / 0.283200 (0.111816)	0.025158 / 0.141683 (-0.116525)	1.470503 / 1.452155 (0.018348)	1.532742 / 1.492716 (0.040026)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.214249 / 0.018006 (0.196243)	0.583580 / 0.000490 (0.583090)	0.004027 / 0.000200 (0.003828)	0.000104 / 0.000054 (0.000050)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.030186 / 0.037411 (-0.007226)	0.086927 / 0.014526 (0.072401)	0.102060 / 0.176557 (-0.074497)	0.156281 / 0.737135 (-0.580855)	0.100825 / 0.296338 (-0.195514)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.419942 / 0.215209 (0.204733)	4.183797 / 2.077655 (2.106142)	2.205079 / 1.504120 (0.700959)	2.071219 / 1.541195 (0.530024)	2.194047 / 1.468490 (0.725557)	0.478768 / 4.584777 (-4.106009)	3.584864 / 3.745712 (-0.160848)	3.371635 / 5.269862 (-1.898227)	2.022134 / 4.565676 (-2.543542)	0.056553 / 0.424275 (-0.367722)	0.007231 / 0.007607 (-0.000376)	0.493158 / 0.226044 (0.267113)	4.934370 / 2.268929 (2.665441)	2.699593 / 55.444624 (-52.745031)	2.396371 / 6.876477 (-4.480105)	2.438052 / 2.142072 (0.295979)	0.589578 / 4.805227 (-4.215649)	0.147234 / 6.500664 (-6.353430)	0.062049 / 0.075469 (-0.013420)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.318246 / 1.841788 (-0.523542)	19.829025 / 8.074308 (11.754717)	14.314825 / 10.191392 (4.123433)	0.168309 / 0.680424 (-0.512115)	0.018596 / 0.534201 (-0.515605)	0.397540 / 0.579283 (-0.181743)	0.421280 / 0.434364 (-0.013084)	0.479917 / 0.540337 (-0.060421)	0.643494 / 1.386936 (-0.743442)

github-actions · 2023-07-17T16:04:42Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008349 / 0.011353 (-0.003004)	0.005362 / 0.011008 (-0.005646)	0.100777 / 0.038508 (0.062269)	0.078719 / 0.023109 (0.055609)	0.398105 / 0.275898 (0.122207)	0.444189 / 0.323480 (0.120709)	0.006834 / 0.007986 (-0.001152)	0.004642 / 0.004328 (0.000314)	0.076284 / 0.004250 (0.072034)	0.062738 / 0.037052 (0.025685)	0.409532 / 0.258489 (0.151043)	0.447218 / 0.293841 (0.153377)	0.052996 / 0.128546 (-0.075550)	0.012977 / 0.075646 (-0.062669)	0.347687 / 0.419271 (-0.071585)	0.068076 / 0.043533 (0.024543)	0.394526 / 0.255139 (0.139387)	0.434110 / 0.283200 (0.150910)	0.041719 / 0.141683 (-0.099963)	1.759109 / 1.452155 (0.306955)	1.866049 / 1.492716 (0.373333)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.287633 / 0.018006 (0.269627)	0.611540 / 0.000490 (0.611051)	0.005388 / 0.000200 (0.005188)	0.000096 / 0.000054 (0.000042)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.027394 / 0.037411 (-0.010017)	0.089796 / 0.014526 (0.075270)	0.106931 / 0.176557 (-0.069625)	0.173560 / 0.737135 (-0.563575)	0.106948 / 0.296338 (-0.189391)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.575156 / 0.215209 (0.359947)	5.674170 / 2.077655 (3.596516)	2.463090 / 1.504120 (0.958971)	2.128245 / 1.541195 (0.587050)	2.118982 / 1.468490 (0.650492)	0.876976 / 4.584777 (-3.707801)	5.238229 / 3.745712 (1.492517)	4.548788 / 5.269862 (-0.721074)	2.905243 / 4.565676 (-1.660433)	0.090750 / 0.424275 (-0.333525)	0.008266 / 0.007607 (0.000659)	0.693305 / 0.226044 (0.467260)	7.126970 / 2.268929 (4.858041)	3.152131 / 55.444624 (-52.292494)	2.532118 / 6.876477 (-4.344359)	2.678442 / 2.142072 (0.536369)	0.932745 / 4.805227 (-3.872483)	0.196290 / 6.500664 (-6.304374)	0.074082 / 0.075469 (-0.001387)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.599636 / 1.841788 (-0.242152)	23.271435 / 8.074308 (15.197127)	19.696709 / 10.191392 (9.505317)	0.222668 / 0.680424 (-0.457756)	0.029088 / 0.534201 (-0.505113)	0.492477 / 0.579283 (-0.086806)	0.580578 / 0.434364 (0.146214)	0.558852 / 0.540337 (0.018514)	0.762083 / 1.386936 (-0.624853)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.009021 / 0.011353 (-0.002332)	0.005011 / 0.011008 (-0.005997)	0.076504 / 0.038508 (0.037996)	0.077303 / 0.023109 (0.054193)	0.480660 / 0.275898 (0.204762)	0.493944 / 0.323480 (0.170464)	0.006339 / 0.007986 (-0.001646)	0.004302 / 0.004328 (-0.000026)	0.076228 / 0.004250 (0.071978)	0.060805 / 0.037052 (0.023753)	0.477539 / 0.258489 (0.219050)	0.496799 / 0.293841 (0.202958)	0.049495 / 0.128546 (-0.079052)	0.013333 / 0.075646 (-0.062313)	0.087217 / 0.419271 (-0.332055)	0.061451 / 0.043533 (0.017918)	0.485169 / 0.255139 (0.230030)	0.487348 / 0.283200 (0.204149)	0.035874 / 0.141683 (-0.105809)	1.829137 / 1.452155 (0.376982)	1.906151 / 1.492716 (0.413435)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.304526 / 0.018006 (0.286520)	0.627499 / 0.000490 (0.627009)	0.003786 / 0.000200 (0.003586)	0.000098 / 0.000054 (0.000043)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.035512 / 0.037411 (-0.001899)	0.096684 / 0.014526 (0.082158)	0.111879 / 0.176557 (-0.064678)	0.171489 / 0.737135 (-0.565647)	0.112175 / 0.296338 (-0.184164)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.604791 / 0.215209 (0.389582)	6.089137 / 2.077655 (4.011482)	2.883237 / 1.504120 (1.379117)	2.561109 / 1.541195 (1.019914)	2.542400 / 1.468490 (1.073910)	0.852828 / 4.584777 (-3.731949)	5.236812 / 3.745712 (1.491100)	4.756429 / 5.269862 (-0.513432)	2.885660 / 4.565676 (-1.680016)	0.095643 / 0.424275 (-0.328632)	0.008403 / 0.007607 (0.000796)	0.727707 / 0.226044 (0.501663)	7.428002 / 2.268929 (5.159074)	3.816051 / 55.444624 (-51.628573)	2.971057 / 6.876477 (-3.905420)	2.915965 / 2.142072 (0.773893)	1.006553 / 4.805227 (-3.798674)	0.201840 / 6.500664 (-6.298824)	0.080795 / 0.075469 (0.005326)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.794951 / 1.841788 (-0.046837)	23.624556 / 8.074308 (15.550248)	21.856195 / 10.191392 (11.664802)	0.253043 / 0.680424 (-0.427381)	0.031201 / 0.534201 (-0.503000)	0.461641 / 0.579283 (-0.117642)	0.577789 / 0.434364 (0.143425)	0.569197 / 0.540337 (0.028860)	0.780111 / 1.386936 (-0.606825)

stevhliu

Nice job (RIP bees)! 🙂

docs/source/repository_structure.mdx

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

github-actions · 2023-07-18T09:25:17Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007646 / 0.011353 (-0.003707)	0.004750 / 0.011008 (-0.006258)	0.097981 / 0.038508 (0.059473)	0.088989 / 0.023109 (0.065880)	0.377732 / 0.275898 (0.101834)	0.406805 / 0.323480 (0.083325)	0.006389 / 0.007986 (-0.001597)	0.003854 / 0.004328 (-0.000474)	0.073977 / 0.004250 (0.069727)	0.066497 / 0.037052 (0.029444)	0.371498 / 0.258489 (0.113009)	0.417352 / 0.293841 (0.123511)	0.036326 / 0.128546 (-0.092220)	0.009876 / 0.075646 (-0.065770)	0.330142 / 0.419271 (-0.089130)	0.062423 / 0.043533 (0.018890)	0.369375 / 0.255139 (0.114236)	0.406048 / 0.283200 (0.122848)	0.026564 / 0.141683 (-0.115119)	1.713295 / 1.452155 (0.261140)	1.797493 / 1.492716 (0.304777)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.231889 / 0.018006 (0.213882)	0.512497 / 0.000490 (0.512007)	0.000390 / 0.000200 (0.000190)	0.000069 / 0.000054 (0.000015)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.033978 / 0.037411 (-0.003433)	0.100117 / 0.014526 (0.085592)	0.112460 / 0.176557 (-0.064097)	0.179936 / 0.737135 (-0.557200)	0.114277 / 0.296338 (-0.182061)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.461320 / 0.215209 (0.246111)	4.563180 / 2.077655 (2.485526)	2.249474 / 1.504120 (0.745354)	2.100450 / 1.541195 (0.559255)	2.231080 / 1.468490 (0.762590)	0.567907 / 4.584777 (-4.016870)	4.117233 / 3.745712 (0.371521)	4.943159 / 5.269862 (-0.326703)	3.112299 / 4.565676 (-1.453377)	0.065500 / 0.424275 (-0.358775)	0.008407 / 0.007607 (0.000800)	0.545928 / 0.226044 (0.319883)	5.508058 / 2.268929 (3.239129)	2.834645 / 55.444624 (-52.609980)	2.440328 / 6.876477 (-4.436148)	2.680483 / 2.142072 (0.538410)	0.697191 / 4.805227 (-4.108036)	0.176646 / 6.500664 (-6.324018)	0.073608 / 0.075469 (-0.001861)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.451865 / 1.841788 (-0.389922)	22.752595 / 8.074308 (14.678287)	15.543338 / 10.191392 (5.351946)	0.214644 / 0.680424 (-0.465780)	0.022050 / 0.534201 (-0.512151)	0.463898 / 0.579283 (-0.115385)	0.481691 / 0.434364 (0.047327)	0.549715 / 0.540337 (0.009378)	0.773595 / 1.386936 (-0.613341)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007541 / 0.011353 (-0.003812)	0.004715 / 0.011008 (-0.006293)	0.076782 / 0.038508 (0.038274)	0.086242 / 0.023109 (0.063133)	0.458053 / 0.275898 (0.182155)	0.503097 / 0.323480 (0.179617)	0.006262 / 0.007986 (-0.001724)	0.003882 / 0.004328 (-0.000447)	0.075669 / 0.004250 (0.071419)	0.066004 / 0.037052 (0.028952)	0.469439 / 0.258489 (0.210950)	0.529744 / 0.293841 (0.235903)	0.037228 / 0.128546 (-0.091319)	0.009794 / 0.075646 (-0.065852)	0.082464 / 0.419271 (-0.336808)	0.058797 / 0.043533 (0.015264)	0.452069 / 0.255139 (0.196930)	0.488246 / 0.283200 (0.205046)	0.029324 / 0.141683 (-0.112359)	1.742237 / 1.452155 (0.290082)	1.839676 / 1.492716 (0.346959)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.228106 / 0.018006 (0.210100)	0.491632 / 0.000490 (0.491142)	0.004993 / 0.000200 (0.004793)	0.000114 / 0.000054 (0.000060)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.035413 / 0.037411 (-0.001999)	0.104617 / 0.014526 (0.090091)	0.121948 / 0.176557 (-0.054609)	0.186233 / 0.737135 (-0.550902)	0.121574 / 0.296338 (-0.174764)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.473849 / 0.215209 (0.258640)	4.788312 / 2.077655 (2.710657)	2.470535 / 1.504120 (0.966415)	2.270393 / 1.541195 (0.729198)	2.361096 / 1.468490 (0.892606)	0.556184 / 4.584777 (-4.028593)	4.216852 / 3.745712 (0.471140)	3.901718 / 5.269862 (-1.368143)	2.355209 / 4.565676 (-2.210467)	0.066708 / 0.424275 (-0.357567)	0.008709 / 0.007607 (0.001102)	0.571714 / 0.226044 (0.345669)	5.663150 / 2.268929 (3.394221)	3.025769 / 55.444624 (-52.418855)	2.652554 / 6.876477 (-4.223923)	2.750555 / 2.142072 (0.608483)	0.681536 / 4.805227 (-4.123691)	0.157187 / 6.500664 (-6.343477)	0.073533 / 0.075469 (-0.001936)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.604630 / 1.841788 (-0.237158)	22.735629 / 8.074308 (14.661321)	16.762347 / 10.191392 (6.570955)	0.175514 / 0.680424 (-0.504910)	0.021497 / 0.534201 (-0.512704)	0.461438 / 0.579283 (-0.117845)	0.476184 / 0.434364 (0.041820)	0.571048 / 0.540337 (0.030710)	0.747086 / 1.386936 (-0.639850)

github-actions · 2023-07-18T12:49:19Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006889 / 0.011353 (-0.004464)	0.004241 / 0.011008 (-0.006767)	0.084542 / 0.038508 (0.046034)	0.080484 / 0.023109 (0.057374)	0.309356 / 0.275898 (0.033458)	0.338548 / 0.323480 (0.015068)	0.004904 / 0.007986 (-0.003082)	0.005220 / 0.004328 (0.000892)	0.065501 / 0.004250 (0.061251)	0.062095 / 0.037052 (0.025043)	0.317332 / 0.258489 (0.058843)	0.364797 / 0.293841 (0.070956)	0.030492 / 0.128546 (-0.098054)	0.008991 / 0.075646 (-0.066656)	0.288274 / 0.419271 (-0.130998)	0.052582 / 0.043533 (0.009049)	0.310838 / 0.255139 (0.055699)	0.346304 / 0.283200 (0.063104)	0.027968 / 0.141683 (-0.113715)	1.509727 / 1.452155 (0.057573)	1.577410 / 1.492716 (0.084694)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.269725 / 0.018006 (0.251719)	0.627685 / 0.000490 (0.627195)	0.000419 / 0.000200 (0.000219)	0.000060 / 0.000054 (0.000006)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.031022 / 0.037411 (-0.006389)	0.081858 / 0.014526 (0.067332)	0.099477 / 0.176557 (-0.077080)	0.162981 / 0.737135 (-0.574154)	0.101987 / 0.296338 (-0.194351)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.386297 / 0.215209 (0.171088)	3.845321 / 2.077655 (1.767666)	1.834446 / 1.504120 (0.330326)	1.699730 / 1.541195 (0.158536)	1.764342 / 1.468490 (0.295852)	0.486423 / 4.584777 (-4.098354)	3.527595 / 3.745712 (-0.218117)	4.137034 / 5.269862 (-1.132827)	2.590457 / 4.565676 (-1.975219)	0.057598 / 0.424275 (-0.366677)	0.007318 / 0.007607 (-0.000289)	0.460775 / 0.226044 (0.234730)	4.627576 / 2.268929 (2.358647)	2.402566 / 55.444624 (-53.042059)	2.011392 / 6.876477 (-4.865085)	2.223915 / 2.142072 (0.081842)	0.623217 / 4.805227 (-4.182011)	0.148875 / 6.500664 (-6.351789)	0.059799 / 0.075469 (-0.015671)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.290768 / 1.841788 (-0.551020)	20.455083 / 8.074308 (12.380775)	13.469846 / 10.191392 (3.278454)	0.170329 / 0.680424 (-0.510095)	0.018409 / 0.534201 (-0.515792)	0.394356 / 0.579283 (-0.184927)	0.422685 / 0.434364 (-0.011679)	0.476241 / 0.540337 (-0.064096)	0.662682 / 1.386936 (-0.724254)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006724 / 0.011353 (-0.004629)	0.004508 / 0.011008 (-0.006500)	0.065304 / 0.038508 (0.026796)	0.080243 / 0.023109 (0.057133)	0.384545 / 0.275898 (0.108647)	0.415234 / 0.323480 (0.091754)	0.006361 / 0.007986 (-0.001624)	0.004193 / 0.004328 (-0.000135)	0.065940 / 0.004250 (0.061689)	0.063633 / 0.037052 (0.026581)	0.392799 / 0.258489 (0.134310)	0.443618 / 0.293841 (0.149777)	0.031134 / 0.128546 (-0.097412)	0.009058 / 0.075646 (-0.066588)	0.071051 / 0.419271 (-0.348221)	0.049096 / 0.043533 (0.005563)	0.379526 / 0.255139 (0.124387)	0.403370 / 0.283200 (0.120171)	0.026378 / 0.141683 (-0.115305)	1.457879 / 1.452155 (0.005724)	1.562890 / 1.492716 (0.070174)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.304416 / 0.018006 (0.286410)	0.626046 / 0.000490 (0.625557)	0.000469 / 0.000200 (0.000269)	0.000057 / 0.000054 (0.000002)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.032979 / 0.037411 (-0.004433)	0.086769 / 0.014526 (0.072243)	0.108188 / 0.176557 (-0.068369)	0.163077 / 0.737135 (-0.574058)	0.106276 / 0.296338 (-0.190062)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.406922 / 0.215209 (0.191713)	4.052828 / 2.077655 (1.975174)	2.084802 / 1.504120 (0.580682)	1.927263 / 1.541195 (0.386069)	1.956078 / 1.468490 (0.487587)	0.480110 / 4.584777 (-4.104667)	3.553022 / 3.745712 (-0.192691)	3.554450 / 5.269862 (-1.715411)	2.082681 / 4.565676 (-2.482995)	0.056711 / 0.424275 (-0.367564)	0.007374 / 0.007607 (-0.000234)	0.480555 / 0.226044 (0.254510)	4.795851 / 2.268929 (2.526923)	2.606675 / 55.444624 (-52.837949)	2.249964 / 6.876477 (-4.626512)	2.274234 / 2.142072 (0.132162)	0.571767 / 4.805227 (-4.233461)	0.133312 / 6.500664 (-6.367352)	0.061703 / 0.075469 (-0.013766)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.354308 / 1.841788 (-0.487479)	20.959352 / 8.074308 (12.885044)	14.158420 / 10.191392 (3.967028)	0.197959 / 0.680424 (-0.482465)	0.018412 / 0.534201 (-0.515789)	0.394307 / 0.579283 (-0.184976)	0.402455 / 0.434364 (-0.031909)	0.463314 / 0.540337 (-0.077024)	0.621050 / 1.386936 (-0.765886)

github-actions · 2023-07-18T15:12:22Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007179 / 0.011353 (-0.004174)	0.004318 / 0.011008 (-0.006690)	0.085209 / 0.038508 (0.046701)	0.089989 / 0.023109 (0.066880)	0.328188 / 0.275898 (0.052290)	0.346027 / 0.323480 (0.022547)	0.005711 / 0.007986 (-0.002275)	0.003703 / 0.004328 (-0.000625)	0.065419 / 0.004250 (0.061169)	0.065354 / 0.037052 (0.028301)	0.314531 / 0.258489 (0.056042)	0.354357 / 0.293841 (0.060516)	0.030918 / 0.128546 (-0.097628)	0.008632 / 0.075646 (-0.067015)	0.286817 / 0.419271 (-0.132455)	0.065267 / 0.043533 (0.021735)	0.310918 / 0.255139 (0.055779)	0.330497 / 0.283200 (0.047298)	0.035695 / 0.141683 (-0.105988)	1.471101 / 1.452155 (0.018947)	1.538658 / 1.492716 (0.045942)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.254314 / 0.018006 (0.236308)	0.591413 / 0.000490 (0.590923)	0.006082 / 0.000200 (0.005882)	0.000091 / 0.000054 (0.000037)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.031843 / 0.037411 (-0.005568)	0.089968 / 0.014526 (0.075442)	0.101838 / 0.176557 (-0.074718)	0.164401 / 0.737135 (-0.572734)	0.103785 / 0.296338 (-0.192554)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.380486 / 0.215209 (0.165277)	3.798868 / 2.077655 (1.721213)	1.824645 / 1.504120 (0.320525)	1.660804 / 1.541195 (0.119610)	1.784793 / 1.468490 (0.316303)	0.487222 / 4.584777 (-4.097555)	3.560580 / 3.745712 (-0.185132)	5.392662 / 5.269862 (0.122800)	3.295327 / 4.565676 (-1.270350)	0.057699 / 0.424275 (-0.366576)	0.007559 / 0.007607 (-0.000048)	0.459655 / 0.226044 (0.233611)	4.587583 / 2.268929 (2.318654)	2.304845 / 55.444624 (-53.139779)	1.966433 / 6.876477 (-4.910044)	2.254591 / 2.142072 (0.112519)	0.582978 / 4.805227 (-4.222250)	0.133455 / 6.500664 (-6.367210)	0.061924 / 0.075469 (-0.013546)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.275685 / 1.841788 (-0.566103)	20.814545 / 8.074308 (12.740237)	13.753567 / 10.191392 (3.562175)	0.164076 / 0.680424 (-0.516348)	0.018768 / 0.534201 (-0.515433)	0.390991 / 0.579283 (-0.188293)	0.404417 / 0.434364 (-0.029947)	0.457522 / 0.540337 (-0.082815)	0.624654 / 1.386936 (-0.762282)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007435 / 0.011353 (-0.003918)	0.004255 / 0.011008 (-0.006754)	0.066134 / 0.038508 (0.027626)	0.086035 / 0.023109 (0.062925)	0.364688 / 0.275898 (0.088790)	0.403895 / 0.323480 (0.080415)	0.005868 / 0.007986 (-0.002117)	0.003634 / 0.004328 (-0.000694)	0.065803 / 0.004250 (0.061553)	0.065113 / 0.037052 (0.028061)	0.370057 / 0.258489 (0.111568)	0.412634 / 0.293841 (0.118793)	0.031660 / 0.128546 (-0.096886)	0.008699 / 0.075646 (-0.066947)	0.070618 / 0.419271 (-0.348654)	0.050814 / 0.043533 (0.007281)	0.362320 / 0.255139 (0.107181)	0.383863 / 0.283200 (0.100663)	0.027980 / 0.141683 (-0.113703)	1.486389 / 1.452155 (0.034234)	1.595534 / 1.492716 (0.102817)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.300991 / 0.018006 (0.282985)	0.565265 / 0.000490 (0.564775)	0.000400 / 0.000200 (0.000200)	0.000053 / 0.000054 (-0.000001)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.034942 / 0.037411 (-0.002470)	0.092498 / 0.014526 (0.077972)	0.106737 / 0.176557 (-0.069819)	0.165400 / 0.737135 (-0.571735)	0.107809 / 0.296338 (-0.188529)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.412156 / 0.215209 (0.196947)	4.116747 / 2.077655 (2.039092)	2.199612 / 1.504120 (0.695492)	2.049310 / 1.541195 (0.508115)	2.174342 / 1.468490 (0.705852)	0.482794 / 4.584777 (-4.101983)	3.561344 / 3.745712 (-0.184368)	3.465935 / 5.269862 (-1.803926)	2.076595 / 4.565676 (-2.489081)	0.056242 / 0.424275 (-0.368033)	0.007371 / 0.007607 (-0.000236)	0.489135 / 0.226044 (0.263091)	4.895691 / 2.268929 (2.626763)	2.626936 / 55.444624 (-52.817688)	2.306658 / 6.876477 (-4.569818)	2.421705 / 2.142072 (0.279633)	0.599547 / 4.805227 (-4.205680)	0.133627 / 6.500664 (-6.367037)	0.063830 / 0.075469 (-0.011639)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.383039 / 1.841788 (-0.458748)	21.005346 / 8.074308 (12.931038)	14.911083 / 10.191392 (4.719691)	0.190995 / 0.680424 (-0.489429)	0.018510 / 0.534201 (-0.515691)	0.396346 / 0.579283 (-0.182937)	0.411496 / 0.434364 (-0.022868)	0.470972 / 0.540337 (-0.069366)	0.615670 / 1.386936 (-0.771266)

github-actions · 2023-07-18T16:02:50Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007249 / 0.011353 (-0.004104)	0.004261 / 0.011008 (-0.006747)	0.100645 / 0.038508 (0.062137)	0.078522 / 0.023109 (0.055413)	0.423526 / 0.275898 (0.147628)	0.439541 / 0.323480 (0.116061)	0.005812 / 0.007986 (-0.002173)	0.003615 / 0.004328 (-0.000713)	0.075908 / 0.004250 (0.071658)	0.062490 / 0.037052 (0.025437)	0.414941 / 0.258489 (0.156452)	0.447267 / 0.293841 (0.153426)	0.035127 / 0.128546 (-0.093419)	0.009642 / 0.075646 (-0.066004)	0.354093 / 0.419271 (-0.065179)	0.060970 / 0.043533 (0.017437)	0.418579 / 0.255139 (0.163440)	0.427972 / 0.283200 (0.144772)	0.025838 / 0.141683 (-0.115845)	1.778349 / 1.452155 (0.326194)	1.845965 / 1.492716 (0.353249)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.227304 / 0.018006 (0.209298)	0.571833 / 0.000490 (0.571343)	0.001328 / 0.000200 (0.001128)	0.000071 / 0.000054 (0.000017)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.031343 / 0.037411 (-0.006068)	0.096400 / 0.014526 (0.081875)	0.106881 / 0.176557 (-0.069676)	0.175449 / 0.737135 (-0.561686)	0.108751 / 0.296338 (-0.187588)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.480204 / 0.215209 (0.264995)	4.622063 / 2.077655 (2.544408)	2.211505 / 1.504120 (0.707385)	2.065154 / 1.541195 (0.523959)	2.159446 / 1.468490 (0.690956)	0.584571 / 4.584777 (-4.000206)	4.392449 / 3.745712 (0.646737)	4.790166 / 5.269862 (-0.479695)	2.840615 / 4.565676 (-1.725062)	0.070845 / 0.424275 (-0.353430)	0.009112 / 0.007607 (0.001505)	0.580251 / 0.226044 (0.354207)	5.660311 / 2.268929 (3.391382)	2.836136 / 55.444624 (-52.608489)	2.412859 / 6.876477 (-4.463618)	2.556710 / 2.142072 (0.414637)	0.691946 / 4.805227 (-4.113282)	0.160123 / 6.500664 (-6.340541)	0.072593 / 0.075469 (-0.002876)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.547339 / 1.841788 (-0.294448)	21.724793 / 8.074308 (13.650485)	16.315304 / 10.191392 (6.123912)	0.188733 / 0.680424 (-0.491690)	0.022109 / 0.534201 (-0.512092)	0.481623 / 0.579283 (-0.097660)	0.464316 / 0.434364 (0.029952)	0.557953 / 0.540337 (0.017615)	0.756023 / 1.386936 (-0.630913)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008637 / 0.011353 (-0.002716)	0.005286 / 0.011008 (-0.005723)	0.091387 / 0.038508 (0.052879)	0.114092 / 0.023109 (0.090983)	0.457547 / 0.275898 (0.181649)	0.506878 / 0.323480 (0.183398)	0.006849 / 0.007986 (-0.001137)	0.004255 / 0.004328 (-0.000073)	0.079556 / 0.004250 (0.075306)	0.077729 / 0.037052 (0.040677)	0.454094 / 0.258489 (0.195605)	0.515812 / 0.293841 (0.221971)	0.038271 / 0.128546 (-0.090275)	0.010110 / 0.075646 (-0.065536)	0.094254 / 0.419271 (-0.325017)	0.065392 / 0.043533 (0.021860)	0.459749 / 0.255139 (0.204610)	0.489829 / 0.283200 (0.206629)	0.040393 / 0.141683 (-0.101290)	1.810414 / 1.452155 (0.358259)	1.913212 / 1.492716 (0.420496)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.236898 / 0.018006 (0.218891)	0.513118 / 0.000490 (0.512628)	0.004432 / 0.000200 (0.004232)	0.000115 / 0.000054 (0.000060)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.035074 / 0.037411 (-0.002337)	0.102384 / 0.014526 (0.087858)	0.117326 / 0.176557 (-0.059231)	0.182596 / 0.737135 (-0.554539)	0.116384 / 0.296338 (-0.179955)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.514544 / 0.215209 (0.299335)	5.152930 / 2.077655 (3.075275)	2.624477 / 1.504120 (1.120357)	2.363209 / 1.541195 (0.822014)	2.436060 / 1.468490 (0.967570)	0.592523 / 4.584777 (-3.992254)	4.209668 / 3.745712 (0.463956)	6.284372 / 5.269862 (1.014511)	3.667303 / 4.565676 (-0.898374)	0.067017 / 0.424275 (-0.357259)	0.008607 / 0.007607 (0.001000)	0.600840 / 0.226044 (0.374796)	5.992630 / 2.268929 (3.723701)	3.114532 / 55.444624 (-52.330093)	2.693242 / 6.876477 (-4.183235)	2.767187 / 2.142072 (0.625115)	0.687591 / 4.805227 (-4.117636)	0.158477 / 6.500664 (-6.342187)	0.075504 / 0.075469 (0.000034)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.605039 / 1.841788 (-0.236749)	21.524730 / 8.074308 (13.450422)	17.014643 / 10.191392 (6.823251)	0.201580 / 0.680424 (-0.478843)	0.023028 / 0.534201 (-0.511173)	0.483801 / 0.579283 (-0.095482)	0.490221 / 0.434364 (0.055857)	0.589292 / 0.540337 (0.048955)	0.758532 / 1.386936 (-0.628404)

polinaeterna

thank you! makes more sense that "patterns". left just some small text suggestions

docs/source/repository_structure.mdx

src/datasets/data_files.py

Co-authored-by: Polina Kazakova <polina@huggingface.co>

github-actions · 2023-07-18T16:31:39Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008080 / 0.011353 (-0.003273)	0.004859 / 0.011008 (-0.006149)	0.101895 / 0.038508 (0.063387)	0.091168 / 0.023109 (0.068059)	0.378914 / 0.275898 (0.103016)	0.417172 / 0.323480 (0.093692)	0.006314 / 0.007986 (-0.001672)	0.004069 / 0.004328 (-0.000259)	0.076566 / 0.004250 (0.072315)	0.070986 / 0.037052 (0.033934)	0.380935 / 0.258489 (0.122446)	0.417131 / 0.293841 (0.123290)	0.036343 / 0.128546 (-0.092203)	0.009996 / 0.075646 (-0.065650)	0.346386 / 0.419271 (-0.072886)	0.063162 / 0.043533 (0.019630)	0.372620 / 0.255139 (0.117481)	0.404902 / 0.283200 (0.121702)	0.028217 / 0.141683 (-0.113466)	1.793875 / 1.452155 (0.341721)	1.836284 / 1.492716 (0.343568)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.223830 / 0.018006 (0.205823)	0.503643 / 0.000490 (0.503153)	0.004957 / 0.000200 (0.004757)	0.000107 / 0.000054 (0.000053)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.035455 / 0.037411 (-0.001957)	0.108015 / 0.014526 (0.093489)	0.116887 / 0.176557 (-0.059669)	0.188174 / 0.737135 (-0.548961)	0.117217 / 0.296338 (-0.179121)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.471681 / 0.215209 (0.256472)	4.694509 / 2.077655 (2.616855)	2.369539 / 1.504120 (0.865419)	2.176839 / 1.541195 (0.635644)	2.300536 / 1.468490 (0.832045)	0.575689 / 4.584777 (-4.009088)	4.232765 / 3.745712 (0.487053)	4.766775 / 5.269862 (-0.503087)	2.864667 / 4.565676 (-1.701010)	0.069390 / 0.424275 (-0.354885)	0.008822 / 0.007607 (0.001214)	0.559620 / 0.226044 (0.333576)	5.580401 / 2.268929 (3.311472)	2.920293 / 55.444624 (-52.524331)	2.552166 / 6.876477 (-4.324311)	2.795890 / 2.142072 (0.653818)	0.687863 / 4.805227 (-4.117364)	0.159129 / 6.500664 (-6.341535)	0.073475 / 0.075469 (-0.001994)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.505892 / 1.841788 (-0.335896)	24.127650 / 8.074308 (16.053342)	16.758238 / 10.191392 (6.566846)	0.200555 / 0.680424 (-0.479869)	0.021596 / 0.534201 (-0.512605)	0.480668 / 0.579283 (-0.098615)	0.483528 / 0.434364 (0.049164)	0.571241 / 0.540337 (0.030903)	0.790547 / 1.386936 (-0.596390)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007997 / 0.011353 (-0.003356)	0.004842 / 0.011008 (-0.006166)	0.077190 / 0.038508 (0.038681)	0.092765 / 0.023109 (0.069656)	0.457475 / 0.275898 (0.181577)	0.523914 / 0.323480 (0.200434)	0.006349 / 0.007986 (-0.001637)	0.003902 / 0.004328 (-0.000427)	0.075860 / 0.004250 (0.071609)	0.069708 / 0.037052 (0.032656)	0.459612 / 0.258489 (0.201123)	0.555028 / 0.293841 (0.261187)	0.036854 / 0.128546 (-0.091692)	0.010078 / 0.075646 (-0.065568)	0.083871 / 0.419271 (-0.335400)	0.061221 / 0.043533 (0.017689)	0.435737 / 0.255139 (0.180598)	0.509700 / 0.283200 (0.226500)	0.038091 / 0.141683 (-0.103592)	1.777161 / 1.452155 (0.325006)	1.859603 / 1.492716 (0.366886)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.250020 / 0.018006 (0.232014)	0.486198 / 0.000490 (0.485708)	0.007080 / 0.000200 (0.006880)	0.000114 / 0.000054 (0.000060)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.038163 / 0.037411 (0.000751)	0.110812 / 0.014526 (0.096286)	0.122489 / 0.176557 (-0.054068)	0.188215 / 0.737135 (-0.548920)	0.122375 / 0.296338 (-0.173963)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.484534 / 0.215209 (0.269325)	4.828654 / 2.077655 (2.751000)	2.545102 / 1.504120 (1.040982)	2.368867 / 1.541195 (0.827672)	2.458042 / 1.468490 (0.989552)	0.576372 / 4.584777 (-4.008404)	4.814033 / 3.745712 (1.068321)	6.175972 / 5.269862 (0.906110)	4.033422 / 4.565676 (-0.532254)	0.068544 / 0.424275 (-0.355731)	0.008906 / 0.007607 (0.001299)	0.581767 / 0.226044 (0.355723)	5.808623 / 2.268929 (3.539695)	3.120312 / 55.444624 (-52.324313)	2.774834 / 6.876477 (-4.101642)	2.770413 / 2.142072 (0.628340)	0.692715 / 4.805227 (-4.112512)	0.158883 / 6.500664 (-6.341782)	0.075894 / 0.075469 (0.000425)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.631250 / 1.841788 (-0.210538)	24.693250 / 8.074308 (16.618942)	17.434790 / 10.191392 (7.243398)	0.196456 / 0.680424 (-0.483968)	0.022505 / 0.534201 (-0.511696)	0.474788 / 0.579283 (-0.104495)	0.500947 / 0.434364 (0.066583)	0.553596 / 0.540337 (0.013259)	0.737767 / 1.386936 (-0.649169)

github-actions · 2023-07-19T16:06:56Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006629 / 0.011353 (-0.004724)	0.004115 / 0.011008 (-0.006894)	0.083934 / 0.038508 (0.045426)	0.074952 / 0.023109 (0.051843)	0.313069 / 0.275898 (0.037171)	0.345878 / 0.323480 (0.022398)	0.006034 / 0.007986 (-0.001952)	0.003413 / 0.004328 (-0.000916)	0.065130 / 0.004250 (0.060880)	0.057363 / 0.037052 (0.020310)	0.314483 / 0.258489 (0.055994)	0.352626 / 0.293841 (0.058785)	0.031325 / 0.128546 (-0.097221)	0.008577 / 0.075646 (-0.067069)	0.288137 / 0.419271 (-0.131135)	0.053651 / 0.043533 (0.010118)	0.313006 / 0.255139 (0.057867)	0.338668 / 0.283200 (0.055468)	0.023709 / 0.141683 (-0.117974)	1.481209 / 1.452155 (0.029054)	1.559801 / 1.492716 (0.067085)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.211543 / 0.018006 (0.193537)	0.452185 / 0.000490 (0.451696)	0.003177 / 0.000200 (0.002977)	0.000078 / 0.000054 (0.000024)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.028821 / 0.037411 (-0.008591)	0.083290 / 0.014526 (0.068765)	0.097478 / 0.176557 (-0.079079)	0.153506 / 0.737135 (-0.583629)	0.097054 / 0.296338 (-0.199284)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.385847 / 0.215209 (0.170638)	3.835629 / 2.077655 (1.757974)	1.880938 / 1.504120 (0.376819)	1.711848 / 1.541195 (0.170653)	1.785099 / 1.468490 (0.316609)	0.486256 / 4.584777 (-4.098521)	3.629026 / 3.745712 (-0.116686)	3.321578 / 5.269862 (-1.948283)	2.024314 / 4.565676 (-2.541363)	0.058097 / 0.424275 (-0.366179)	0.007724 / 0.007607 (0.000117)	0.458293 / 0.226044 (0.232249)	4.581314 / 2.268929 (2.312386)	2.314379 / 55.444624 (-53.130246)	1.966089 / 6.876477 (-4.910387)	2.203824 / 2.142072 (0.061752)	0.611581 / 4.805227 (-4.193647)	0.149166 / 6.500664 (-6.351498)	0.059825 / 0.075469 (-0.015644)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.235546 / 1.841788 (-0.606242)	19.747439 / 8.074308 (11.673131)	14.628383 / 10.191392 (4.436991)	0.193074 / 0.680424 (-0.487350)	0.020327 / 0.534201 (-0.513874)	0.397051 / 0.579283 (-0.182232)	0.418491 / 0.434364 (-0.015873)	0.462055 / 0.540337 (-0.078282)	0.637524 / 1.386936 (-0.749412)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007069 / 0.011353 (-0.004284)	0.004106 / 0.011008 (-0.006902)	0.065818 / 0.038508 (0.027310)	0.077101 / 0.023109 (0.053991)	0.363323 / 0.275898 (0.087425)	0.399463 / 0.323480 (0.075983)	0.005540 / 0.007986 (-0.002446)	0.003480 / 0.004328 (-0.000849)	0.065176 / 0.004250 (0.060926)	0.060867 / 0.037052 (0.023815)	0.365763 / 0.258489 (0.107273)	0.407789 / 0.293841 (0.113949)	0.032018 / 0.128546 (-0.096528)	0.008550 / 0.075646 (-0.067096)	0.071750 / 0.419271 (-0.347521)	0.050625 / 0.043533 (0.007092)	0.361434 / 0.255139 (0.106295)	0.384799 / 0.283200 (0.101599)	0.026104 / 0.141683 (-0.115579)	1.496093 / 1.452155 (0.043938)	1.592909 / 1.492716 (0.100193)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.185794 / 0.018006 (0.167787)	0.453379 / 0.000490 (0.452890)	0.004365 / 0.000200 (0.004165)	0.000092 / 0.000054 (0.000038)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.031666 / 0.037411 (-0.005746)	0.088323 / 0.014526 (0.073798)	0.104602 / 0.176557 (-0.071954)	0.159827 / 0.737135 (-0.577308)	0.103725 / 0.296338 (-0.192614)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.413509 / 0.215209 (0.198300)	4.126071 / 2.077655 (2.048416)	2.137088 / 1.504120 (0.632968)	1.981034 / 1.541195 (0.439839)	2.063660 / 1.468490 (0.595170)	0.478798 / 4.584777 (-4.105979)	3.642801 / 3.745712 (-0.102911)	3.428994 / 5.269862 (-1.840867)	2.031902 / 4.565676 (-2.533774)	0.056244 / 0.424275 (-0.368032)	0.007365 / 0.007607 (-0.000242)	0.484371 / 0.226044 (0.258327)	4.838537 / 2.268929 (2.569608)	2.559497 / 55.444624 (-52.885127)	2.251863 / 6.876477 (-4.624614)	2.339227 / 2.142072 (0.197155)	0.607228 / 4.805227 (-4.198000)	0.133877 / 6.500664 (-6.366787)	0.062049 / 0.075469 (-0.013420)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.350389 / 1.841788 (-0.491399)	20.060359 / 8.074308 (11.986051)	14.305675 / 10.191392 (4.114283)	0.165642 / 0.680424 (-0.514782)	0.018206 / 0.534201 (-0.515994)	0.396907 / 0.579283 (-0.182376)	0.431896 / 0.434364 (-0.002468)	0.475778 / 0.540337 (-0.064559)	0.644688 / 1.386936 (-0.742248)

github-actions · 2023-07-19T16:59:54Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.009048 / 0.011353 (-0.002305)	0.005787 / 0.011008 (-0.005221)	0.111617 / 0.038508 (0.073109)	0.087603 / 0.023109 (0.064494)	0.446481 / 0.275898 (0.170583)	0.491726 / 0.323480 (0.168247)	0.007052 / 0.007986 (-0.000934)	0.004481 / 0.004328 (0.000152)	0.084331 / 0.004250 (0.080081)	0.072006 / 0.037052 (0.034953)	0.454238 / 0.258489 (0.195749)	0.496749 / 0.293841 (0.202908)	0.049027 / 0.128546 (-0.079520)	0.014005 / 0.075646 (-0.061641)	0.372550 / 0.419271 (-0.046722)	0.071414 / 0.043533 (0.027881)	0.459432 / 0.255139 (0.204293)	0.467332 / 0.283200 (0.184133)	0.037539 / 0.141683 (-0.104144)	1.869179 / 1.452155 (0.417024)	1.983641 / 1.492716 (0.490925)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.265426 / 0.018006 (0.247419)	0.672527 / 0.000490 (0.672037)	0.001152 / 0.000200 (0.000953)	0.000181 / 0.000054 (0.000127)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.032967 / 0.037411 (-0.004445)	0.103023 / 0.014526 (0.088497)	0.115978 / 0.176557 (-0.060578)	0.191698 / 0.737135 (-0.545438)	0.117867 / 0.296338 (-0.178471)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.602208 / 0.215209 (0.386999)	6.147784 / 2.077655 (4.070129)	2.768933 / 1.504120 (1.264813)	2.415619 / 1.541195 (0.874424)	2.456159 / 1.468490 (0.987669)	0.836270 / 4.584777 (-3.748507)	5.447754 / 3.745712 (1.702042)	7.751825 / 5.269862 (2.481963)	4.591892 / 4.565676 (0.026215)	0.108269 / 0.424275 (-0.316006)	0.009626 / 0.007607 (0.002019)	0.719260 / 0.226044 (0.493216)	7.313442 / 2.268929 (5.044514)	3.490739 / 55.444624 (-51.953885)	2.743543 / 6.876477 (-4.132934)	3.035071 / 2.142072 (0.892999)	1.042791 / 4.805227 (-3.762436)	0.217080 / 6.500664 (-6.283584)	0.084286 / 0.075469 (0.008817)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.655427 / 1.841788 (-0.186361)	25.386536 / 8.074308 (17.312228)	21.740666 / 10.191392 (11.549274)	0.246388 / 0.680424 (-0.434036)	0.029723 / 0.534201 (-0.504478)	0.491537 / 0.579283 (-0.087746)	0.603495 / 0.434364 (0.169131)	0.573938 / 0.540337 (0.033600)	0.981875 / 1.386936 (-0.405061)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.009664 / 0.011353 (-0.001689)	0.006446 / 0.011008 (-0.004562)	0.085113 / 0.038508 (0.046605)	0.094533 / 0.023109 (0.071424)	0.498388 / 0.275898 (0.222490)	0.540127 / 0.323480 (0.216647)	0.007316 / 0.007986 (-0.000670)	0.004252 / 0.004328 (-0.000077)	0.086292 / 0.004250 (0.082041)	0.067956 / 0.037052 (0.030903)	0.507664 / 0.258489 (0.249175)	0.554324 / 0.293841 (0.260483)	0.050107 / 0.128546 (-0.078439)	0.014277 / 0.075646 (-0.061370)	0.098838 / 0.419271 (-0.320433)	0.066053 / 0.043533 (0.022521)	0.491090 / 0.255139 (0.235951)	0.537432 / 0.283200 (0.254232)	0.035937 / 0.141683 (-0.105746)	1.820715 / 1.452155 (0.368561)	1.996268 / 1.492716 (0.503552)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.300859 / 0.018006 (0.282852)	0.610958 / 0.000490 (0.610468)	0.000474 / 0.000200 (0.000274)	0.000098 / 0.000054 (0.000044)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.036372 / 0.037411 (-0.001039)	0.109115 / 0.014526 (0.094589)	0.122802 / 0.176557 (-0.053755)	0.187092 / 0.737135 (-0.550044)	0.123432 / 0.296338 (-0.172906)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.646979 / 0.215209 (0.431770)	6.577713 / 2.077655 (4.500058)	3.004606 / 1.504120 (1.500486)	2.661183 / 1.541195 (1.119989)	2.726717 / 1.468490 (1.258227)	0.889497 / 4.584777 (-3.695280)	5.485055 / 3.745712 (1.739343)	4.852043 / 5.269862 (-0.417819)	3.177392 / 4.565676 (-1.388285)	0.099796 / 0.424275 (-0.324479)	0.009868 / 0.007607 (0.002261)	0.819919 / 0.226044 (0.593874)	7.911255 / 2.268929 (5.642326)	3.839877 / 55.444624 (-51.604747)	3.088663 / 6.876477 (-3.787813)	3.371184 / 2.142072 (1.229112)	1.072762 / 4.805227 (-3.732466)	0.224536 / 6.500664 (-6.276128)	0.083415 / 0.075469 (0.007946)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.754426 / 1.841788 (-0.087361)	25.546690 / 8.074308 (17.472382)	22.998252 / 10.191392 (12.806860)	0.258019 / 0.680424 (-0.422405)	0.030104 / 0.534201 (-0.504097)	0.518406 / 0.579283 (-0.060877)	0.605753 / 0.434364 (0.171389)	0.599630 / 0.540337 (0.059292)	0.819042 / 1.386936 (-0.567894)

lhoestq added 2 commits July 17, 2023 17:26

rename pattern to path

d8da2e7

docs

5be59be

better _raise_if_data_files_field_not_valid

4904f14

stevhliu approved these changes Jul 18, 2023

View reviewed changes

docs/source/repository_structure.mdx Outdated Show resolved Hide resolved

docs/source/repository_structure.mdx Outdated Show resolved Hide resolved

Apply suggestions from code review

6ea38fc

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

fix check

d7298d4

fix

d6d2ba4

lhoestq marked this pull request as ready for review July 18, 2023 15:20

lhoestq requested a review from polinaeterna July 18, 2023 15:20

only "path" (removed plural)

8c9c24d

polinaeterna approved these changes Jul 18, 2023

View reviewed changes

docs/source/repository_structure.mdx Outdated Show resolved Hide resolved

src/datasets/data_files.py Outdated Show resolved Hide resolved

Apply suggestions from code review

f87d6e6

Co-authored-by: Polina Kazakova <polina@huggingface.co>

style

8f6fa96

lhoestq mentioned this pull request Jul 19, 2023

Remove HfFileSystem and deprecate S3FileSystem #6052

Merged

lhoestq merged commit 350f4fd into main Jul 19, 2023
13 checks passed

lhoestq deleted the rename-pattern-to-path branch July 19, 2023 16:48

Rename "pattern" to "path" in YAML data_files configs #6044

Rename "pattern" to "path" in YAML data_files configs #6044

Conversation

lhoestq commented Jul 17, 2023 • edited

HuggingFaceDocBuilderDev commented Jul 17, 2023 • edited

github-actions bot commented Jul 17, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Jul 17, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

stevhliu left a comment

Choose a reason for hiding this comment

github-actions bot commented Jul 18, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Jul 18, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Jul 18, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Jul 18, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

polinaeterna left a comment • edited

Choose a reason for hiding this comment

github-actions bot commented Jul 18, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

lhoestq commented Jul 17, 2023 •

edited

HuggingFaceDocBuilderDev commented Jul 17, 2023 •

edited

polinaeterna left a comment •

edited