Use new hffs #6028

lhoestq · 2023-07-13T15:41:44Z

Thanks to @janineguo 's work in #5919 which was needed to support HfFileSystem.

Switching to HfFileSystem will help implementing optimization in data files resolution

Implementation details

I replaced all the from_hf_repo and from_local_or_remote in data_files.py to only use a new from_patterns which works for any fsspec path, including hf:// paths, https:// URLs and local paths. This simplifies the codebase since there is no logic duplication anymore when it comes to data files resolution.

I added _prepare_path_and_storage_options which returns the right storage_options to use given a path and a DownloadConfig. This is the only place where the logic depends on the filesystem type that must be used.

I also removed the get_metadata_data_files_list and get_patterns_and_data_files functions added recently, since data files resolution is now handled using a common interface.

New features

hf:// paths are now supported in data_files

Breaking changes

DataFilesList and DataFilesDict:

use str paths instead of Union[Path, Url]
require posix paths for windows paths

close #6017

HuggingFaceDocBuilderDev · 2023-07-13T15:48:11Z

The documentation is not available anymore as the PR was closed or merged.

github-actions · 2023-07-13T15:50:24Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006665 / 0.011353 (-0.004688)	0.004376 / 0.011008 (-0.006633)	0.085529 / 0.038508 (0.047021)	0.076372 / 0.023109 (0.053263)	0.310019 / 0.275898 (0.034121)	0.341404 / 0.323480 (0.017924)	0.005666 / 0.007986 (-0.002320)	0.003763 / 0.004328 (-0.000566)	0.064678 / 0.004250 (0.060427)	0.059283 / 0.037052 (0.022231)	0.316194 / 0.258489 (0.057704)	0.349397 / 0.293841 (0.055557)	0.031199 / 0.128546 (-0.097347)	0.008724 / 0.075646 (-0.066923)	0.300236 / 0.419271 (-0.119035)	0.068872 / 0.043533 (0.025339)	0.308521 / 0.255139 (0.053382)	0.331292 / 0.283200 (0.048092)	0.028236 / 0.141683 (-0.113447)	1.501365 / 1.452155 (0.049211)	1.554334 / 1.492716 (0.061618)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.238291 / 0.018006 (0.220285)	0.565069 / 0.000490 (0.564580)	0.001626 / 0.000200 (0.001426)	0.000070 / 0.000054 (0.000015)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.029777 / 0.037411 (-0.007634)	0.082873 / 0.014526 (0.068347)	0.099619 / 0.176557 (-0.076937)	0.156572 / 0.737135 (-0.580563)	0.099887 / 0.296338 (-0.196452)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.401017 / 0.215209 (0.185808)	3.827192 / 2.077655 (1.749537)	1.861554 / 1.504120 (0.357434)	1.699869 / 1.541195 (0.158674)	1.720043 / 1.468490 (0.251553)	0.486757 / 4.584777 (-4.098020)	3.638125 / 3.745712 (-0.107587)	5.844959 / 5.269862 (0.575097)	3.454901 / 4.565676 (-1.110775)	0.057650 / 0.424275 (-0.366625)	0.007341 / 0.007607 (-0.000266)	0.462698 / 0.226044 (0.236654)	4.633472 / 2.268929 (2.364544)	2.287607 / 55.444624 (-53.157017)	2.057318 / 6.876477 (-4.819159)	2.203657 / 2.142072 (0.061584)	0.598136 / 4.805227 (-4.207091)	0.134012 / 6.500664 (-6.366653)	0.060824 / 0.075469 (-0.014645)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.277752 / 1.841788 (-0.564036)	20.013398 / 8.074308 (11.939089)	14.372993 / 10.191392 (4.181601)	0.169991 / 0.680424 (-0.510433)	0.018344 / 0.534201 (-0.515857)	0.396985 / 0.579283 (-0.182299)	0.416289 / 0.434364 (-0.018075)	0.458658 / 0.540337 (-0.081680)	0.692980 / 1.386936 (-0.693956)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006689 / 0.011353 (-0.004664)	0.004393 / 0.011008 (-0.006615)	0.064069 / 0.038508 (0.025561)	0.080717 / 0.023109 (0.057607)	0.370090 / 0.275898 (0.094191)	0.400432 / 0.323480 (0.076952)	0.005613 / 0.007986 (-0.002372)	0.003641 / 0.004328 (-0.000687)	0.064771 / 0.004250 (0.060520)	0.057555 / 0.037052 (0.020502)	0.392156 / 0.258489 (0.133667)	0.409842 / 0.293841 (0.116001)	0.031500 / 0.128546 (-0.097047)	0.008786 / 0.075646 (-0.066860)	0.070342 / 0.419271 (-0.348929)	0.048646 / 0.043533 (0.005113)	0.360914 / 0.255139 (0.105775)	0.387626 / 0.283200 (0.104426)	0.022787 / 0.141683 (-0.118896)	1.508915 / 1.452155 (0.056761)	1.539719 / 1.492716 (0.047002)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.257985 / 0.018006 (0.239979)	0.550990 / 0.000490 (0.550501)	0.000407 / 0.000200 (0.000207)	0.000057 / 0.000054 (0.000003)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.030183 / 0.037411 (-0.007228)	0.086882 / 0.014526 (0.072356)	0.102382 / 0.176557 (-0.074175)	0.154745 / 0.737135 (-0.582390)	0.104008 / 0.296338 (-0.192331)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.426284 / 0.215209 (0.211075)	4.240812 / 2.077655 (2.163158)	2.261240 / 1.504120 (0.757120)	2.085905 / 1.541195 (0.544710)	2.160374 / 1.468490 (0.691883)	0.481126 / 4.584777 (-4.103651)	3.516234 / 3.745712 (-0.229478)	3.325322 / 5.269862 (-1.944539)	2.043307 / 4.565676 (-2.522369)	0.056663 / 0.424275 (-0.367612)	0.007786 / 0.007607 (0.000179)	0.497614 / 0.226044 (0.271570)	4.974529 / 2.268929 (2.705600)	2.700018 / 55.444624 (-52.744606)	2.393778 / 6.876477 (-4.482699)	2.628202 / 2.142072 (0.486130)	0.594316 / 4.805227 (-4.210911)	0.147092 / 6.500664 (-6.353572)	0.062207 / 0.075469 (-0.013262)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.315676 / 1.841788 (-0.526112)	20.749251 / 8.074308 (12.674943)	14.371553 / 10.191392 (4.180160)	0.170249 / 0.680424 (-0.510175)	0.018478 / 0.534201 (-0.515722)	0.395710 / 0.579283 (-0.183573)	0.409706 / 0.434364 (-0.024658)	0.463454 / 0.540337 (-0.076884)	0.615657 / 1.386936 (-0.771279)

github-actions · 2023-07-13T17:06:23Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007224 / 0.011353 (-0.004129)	0.004506 / 0.011008 (-0.006503)	0.096729 / 0.038508 (0.058221)	0.082394 / 0.023109 (0.059284)	0.390954 / 0.275898 (0.115056)	0.416647 / 0.323480 (0.093167)	0.005894 / 0.007986 (-0.002092)	0.003756 / 0.004328 (-0.000572)	0.075800 / 0.004250 (0.071549)	0.062683 / 0.037052 (0.025631)	0.398959 / 0.258489 (0.140470)	0.436624 / 0.293841 (0.142783)	0.034650 / 0.128546 (-0.093896)	0.009655 / 0.075646 (-0.065991)	0.315761 / 0.419271 (-0.103511)	0.060957 / 0.043533 (0.017424)	0.385649 / 0.255139 (0.130510)	0.394022 / 0.283200 (0.110822)	0.024601 / 0.141683 (-0.117082)	1.729586 / 1.452155 (0.277431)	1.724153 / 1.492716 (0.231437)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.207070 / 0.018006 (0.189063)	0.466502 / 0.000490 (0.466012)	0.010739 / 0.000200 (0.010540)	0.000214 / 0.000054 (0.000160)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.031633 / 0.037411 (-0.005779)	0.095345 / 0.014526 (0.080819)	0.105399 / 0.176557 (-0.071157)	0.174173 / 0.737135 (-0.562962)	0.104207 / 0.296338 (-0.192132)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.435312 / 0.215209 (0.220103)	4.265600 / 2.077655 (2.187946)	2.056500 / 1.504120 (0.552380)	1.848023 / 1.541195 (0.306828)	1.946156 / 1.468490 (0.477666)	0.557788 / 4.584777 (-4.026989)	4.070289 / 3.745712 (0.324577)	3.608027 / 5.269862 (-1.661835)	2.214556 / 4.565676 (-2.351121)	0.062623 / 0.424275 (-0.361652)	0.008083 / 0.007607 (0.000476)	0.491782 / 0.226044 (0.265738)	4.989963 / 2.268929 (2.721035)	2.575867 / 55.444624 (-52.868757)	2.208045 / 6.876477 (-4.668431)	2.364184 / 2.142072 (0.222112)	0.633925 / 4.805227 (-4.171302)	0.144323 / 6.500664 (-6.356341)	0.067505 / 0.075469 (-0.007965)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.467219 / 1.841788 (-0.374569)	22.334967 / 8.074308 (14.260659)	15.715747 / 10.191392 (5.524355)	0.175443 / 0.680424 (-0.504980)	0.026165 / 0.534201 (-0.508036)	0.490675 / 0.579283 (-0.088608)	0.509211 / 0.434364 (0.074847)	0.586303 / 0.540337 (0.045965)	0.785052 / 1.386936 (-0.601884)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007893 / 0.011353 (-0.003460)	0.004577 / 0.011008 (-0.006431)	0.075781 / 0.038508 (0.037273)	0.095492 / 0.023109 (0.072382)	0.433259 / 0.275898 (0.157361)	0.469386 / 0.323480 (0.145906)	0.006317 / 0.007986 (-0.001669)	0.003708 / 0.004328 (-0.000621)	0.074417 / 0.004250 (0.070167)	0.068605 / 0.037052 (0.031552)	0.448701 / 0.258489 (0.190212)	0.469131 / 0.293841 (0.175290)	0.036647 / 0.128546 (-0.091899)	0.010077 / 0.075646 (-0.065570)	0.082457 / 0.419271 (-0.336815)	0.063255 / 0.043533 (0.019722)	0.428144 / 0.255139 (0.173005)	0.451872 / 0.283200 (0.168672)	0.033953 / 0.141683 (-0.107730)	1.781752 / 1.452155 (0.329597)	1.869014 / 1.492716 (0.376297)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.223596 / 0.018006 (0.205590)	0.470307 / 0.000490 (0.469818)	0.005059 / 0.000200 (0.004859)	0.000104 / 0.000054 (0.000049)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.038804 / 0.037411 (0.001393)	0.117879 / 0.014526 (0.103353)	0.140701 / 0.176557 (-0.035855)	0.194672 / 0.737135 (-0.542463)	0.132806 / 0.296338 (-0.163533)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.510109 / 0.215209 (0.294900)	4.729457 / 2.077655 (2.651803)	2.512113 / 1.504120 (1.007993)	2.302553 / 1.541195 (0.761358)	2.420462 / 1.468490 (0.951972)	0.531682 / 4.584777 (-4.053095)	4.061208 / 3.745712 (0.315496)	3.588542 / 5.269862 (-1.681320)	2.203187 / 4.565676 (-2.362489)	0.065791 / 0.424275 (-0.358484)	0.008839 / 0.007607 (0.001232)	0.562041 / 0.226044 (0.335997)	5.702340 / 2.268929 (3.433412)	3.127609 / 55.444624 (-52.317015)	2.823060 / 6.876477 (-4.053417)	2.898675 / 2.142072 (0.756603)	0.659589 / 4.805227 (-4.145638)	0.148798 / 6.500664 (-6.351866)	0.070787 / 0.075469 (-0.004682)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.478317 / 1.841788 (-0.363471)	21.995400 / 8.074308 (13.921092)	16.770729 / 10.191392 (6.579337)	0.226333 / 0.680424 (-0.454091)	0.021835 / 0.534201 (-0.512366)	0.460373 / 0.579283 (-0.118910)	0.479494 / 0.434364 (0.045130)	0.529470 / 0.540337 (-0.010868)	0.718066 / 1.386936 (-0.668870)

github-actions · 2023-07-13T17:24:35Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007824 / 0.011353 (-0.003529)	0.004601 / 0.011008 (-0.006407)	0.100025 / 0.038508 (0.061517)	0.096046 / 0.023109 (0.072936)	0.376226 / 0.275898 (0.100328)	0.410905 / 0.323480 (0.087425)	0.006048 / 0.007986 (-0.001938)	0.003817 / 0.004328 (-0.000511)	0.076624 / 0.004250 (0.072374)	0.066390 / 0.037052 (0.029338)	0.380098 / 0.258489 (0.121609)	0.413603 / 0.293841 (0.119762)	0.036546 / 0.128546 (-0.092001)	0.009881 / 0.075646 (-0.065765)	0.344338 / 0.419271 (-0.074934)	0.061882 / 0.043533 (0.018350)	0.368568 / 0.255139 (0.113429)	0.397133 / 0.283200 (0.113934)	0.027255 / 0.141683 (-0.114428)	1.795099 / 1.452155 (0.342945)	1.852443 / 1.492716 (0.359727)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.247436 / 0.018006 (0.229430)	0.494119 / 0.000490 (0.493629)	0.004359 / 0.000200 (0.004159)	0.000089 / 0.000054 (0.000035)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.034765 / 0.037411 (-0.002647)	0.104541 / 0.014526 (0.090015)	0.113898 / 0.176557 (-0.062659)	0.183634 / 0.737135 (-0.553501)	0.116423 / 0.296338 (-0.179916)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.458747 / 0.215209 (0.243538)	4.555740 / 2.077655 (2.478085)	2.217240 / 1.504120 (0.713121)	2.039879 / 1.541195 (0.498684)	2.088581 / 1.468490 (0.620091)	0.588063 / 4.584777 (-3.996714)	4.238226 / 3.745712 (0.492514)	4.768060 / 5.269862 (-0.501802)	2.857117 / 4.565676 (-1.708560)	0.068742 / 0.424275 (-0.355533)	0.008667 / 0.007607 (0.001059)	0.549294 / 0.226044 (0.323249)	5.464635 / 2.268929 (3.195706)	2.744435 / 55.444624 (-52.700189)	2.347660 / 6.876477 (-4.528816)	2.616816 / 2.142072 (0.474743)	0.703701 / 4.805227 (-4.101526)	0.159749 / 6.500664 (-6.340915)	0.071990 / 0.075469 (-0.003479)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.486599 / 1.841788 (-0.355188)	22.745438 / 8.074308 (14.671130)	16.822332 / 10.191392 (6.630940)	0.184730 / 0.680424 (-0.495694)	0.021267 / 0.534201 (-0.512934)	0.467108 / 0.579283 (-0.112176)	0.472674 / 0.434364 (0.038311)	0.548094 / 0.540337 (0.007756)	0.735885 / 1.386936 (-0.651051)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007746 / 0.011353 (-0.003607)	0.004585 / 0.011008 (-0.006423)	0.076943 / 0.038508 (0.038435)	0.087473 / 0.023109 (0.064363)	0.480099 / 0.275898 (0.204201)	0.495271 / 0.323480 (0.171791)	0.006348 / 0.007986 (-0.001638)	0.003902 / 0.004328 (-0.000426)	0.077586 / 0.004250 (0.073335)	0.066467 / 0.037052 (0.029415)	0.468741 / 0.258489 (0.210252)	0.506778 / 0.293841 (0.212937)	0.036877 / 0.128546 (-0.091669)	0.010102 / 0.075646 (-0.065545)	0.084419 / 0.419271 (-0.334852)	0.058721 / 0.043533 (0.015188)	0.453633 / 0.255139 (0.198494)	0.481171 / 0.283200 (0.197971)	0.028716 / 0.141683 (-0.112967)	1.853048 / 1.452155 (0.400893)	1.885847 / 1.492716 (0.393130)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.192136 / 0.018006 (0.174130)	0.484481 / 0.000490 (0.483991)	0.002951 / 0.000200 (0.002751)	0.000098 / 0.000054 (0.000044)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.037949 / 0.037411 (0.000538)	0.108364 / 0.014526 (0.093838)	0.119542 / 0.176557 (-0.057014)	0.188542 / 0.737135 (-0.548593)	0.122011 / 0.296338 (-0.174327)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.483135 / 0.215209 (0.267926)	4.849715 / 2.077655 (2.772060)	2.497736 / 1.504120 (0.993616)	2.314243 / 1.541195 (0.773048)	2.412739 / 1.468490 (0.944249)	0.564137 / 4.584777 (-4.020639)	4.242273 / 3.745712 (0.496561)	6.337843 / 5.269862 (1.067982)	3.923250 / 4.565676 (-0.642426)	0.066464 / 0.424275 (-0.357811)	0.009217 / 0.007607 (0.001610)	0.575667 / 0.226044 (0.349623)	5.746187 / 2.268929 (3.477258)	3.069655 / 55.444624 (-52.374969)	2.674798 / 6.876477 (-4.201679)	2.956535 / 2.142072 (0.814463)	0.701043 / 4.805227 (-4.104185)	0.157241 / 6.500664 (-6.343423)	0.073175 / 0.075469 (-0.002294)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.609943 / 1.841788 (-0.231844)	23.478594 / 8.074308 (15.404286)	17.454437 / 10.191392 (7.263045)	0.186422 / 0.680424 (-0.494002)	0.021703 / 0.534201 (-0.512498)	0.471704 / 0.579283 (-0.107579)	0.480553 / 0.434364 (0.046189)	0.552881 / 0.540337 (0.012544)	0.722515 / 1.386936 (-0.664421)

github-actions · 2023-07-13T17:44:05Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007542 / 0.011353 (-0.003811)	0.004692 / 0.011008 (-0.006316)	0.099155 / 0.038508 (0.060647)	0.089365 / 0.023109 (0.066256)	0.370870 / 0.275898 (0.094972)	0.422152 / 0.323480 (0.098673)	0.006223 / 0.007986 (-0.001763)	0.003852 / 0.004328 (-0.000476)	0.075438 / 0.004250 (0.071188)	0.065973 / 0.037052 (0.028921)	0.381513 / 0.258489 (0.123024)	0.416196 / 0.293841 (0.122355)	0.035483 / 0.128546 (-0.093063)	0.009884 / 0.075646 (-0.065762)	0.341290 / 0.419271 (-0.077982)	0.060546 / 0.043533 (0.017014)	0.365101 / 0.255139 (0.109962)	0.391058 / 0.283200 (0.107859)	0.026325 / 0.141683 (-0.115358)	1.815168 / 1.452155 (0.363013)	1.834711 / 1.492716 (0.341994)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.222177 / 0.018006 (0.204171)	0.501151 / 0.000490 (0.500662)	0.010202 / 0.000200 (0.010002)	0.000102 / 0.000054 (0.000048)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.034043 / 0.037411 (-0.003368)	0.097884 / 0.014526 (0.083358)	0.114022 / 0.176557 (-0.062534)	0.186200 / 0.737135 (-0.550935)	0.115555 / 0.296338 (-0.180783)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.485857 / 0.215209 (0.270648)	4.959263 / 2.077655 (2.881608)	2.501085 / 1.504120 (0.996965)	2.234660 / 1.541195 (0.693465)	2.238585 / 1.468490 (0.770095)	0.645431 / 4.584777 (-3.939345)	4.434311 / 3.745712 (0.688599)	4.771491 / 5.269862 (-0.498371)	2.778963 / 4.565676 (-1.786714)	0.075615 / 0.424275 (-0.348660)	0.009502 / 0.007607 (0.001895)	0.546539 / 0.226044 (0.320495)	5.464242 / 2.268929 (3.195314)	2.894101 / 55.444624 (-52.550524)	2.513761 / 6.876477 (-4.362715)	2.719843 / 2.142072 (0.577770)	0.678828 / 4.805227 (-4.126399)	0.157839 / 6.500664 (-6.342825)	0.071305 / 0.075469 (-0.004164)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.496879 / 1.841788 (-0.344909)	22.214452 / 8.074308 (14.140144)	17.707541 / 10.191392 (7.516149)	0.197008 / 0.680424 (-0.483416)	0.024883 / 0.534201 (-0.509318)	0.493611 / 0.579283 (-0.085672)	0.500677 / 0.434364 (0.066313)	0.569381 / 0.540337 (0.029044)	0.773950 / 1.386936 (-0.612986)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007337 / 0.011353 (-0.004015)	0.004572 / 0.011008 (-0.006436)	0.091123 / 0.038508 (0.052615)	0.079762 / 0.023109 (0.056652)	0.450527 / 0.275898 (0.174629)	0.525097 / 0.323480 (0.201617)	0.005873 / 0.007986 (-0.002112)	0.003797 / 0.004328 (-0.000532)	0.076259 / 0.004250 (0.072009)	0.062745 / 0.037052 (0.025692)	0.465553 / 0.258489 (0.207064)	0.546026 / 0.293841 (0.252186)	0.035638 / 0.128546 (-0.092909)	0.010086 / 0.075646 (-0.065560)	0.109269 / 0.419271 (-0.310002)	0.056765 / 0.043533 (0.013233)	0.440887 / 0.255139 (0.185748)	0.513325 / 0.283200 (0.230125)	0.027206 / 0.141683 (-0.114476)	1.863564 / 1.452155 (0.411409)	1.918206 / 1.492716 (0.425490)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.266479 / 0.018006 (0.248473)	0.487971 / 0.000490 (0.487481)	0.012246 / 0.000200 (0.012046)	0.000119 / 0.000054 (0.000065)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.035281 / 0.037411 (-0.002130)	0.102991 / 0.014526 (0.088465)	0.114638 / 0.176557 (-0.061919)	0.184117 / 0.737135 (-0.553018)	0.117943 / 0.296338 (-0.178396)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.497897 / 0.215209 (0.282688)	4.973806 / 2.077655 (2.896151)	2.596146 / 1.504120 (1.092026)	2.419694 / 1.541195 (0.878499)	2.525784 / 1.468490 (1.057294)	0.568021 / 4.584777 (-4.016756)	4.296431 / 3.745712 (0.550719)	3.690682 / 5.269862 (-1.579179)	2.345965 / 4.565676 (-2.219712)	0.066859 / 0.424275 (-0.357416)	0.009093 / 0.007607 (0.001486)	0.582616 / 0.226044 (0.356571)	5.826528 / 2.268929 (3.557600)	3.253222 / 55.444624 (-52.191403)	2.798447 / 6.876477 (-4.078030)	3.054609 / 2.142072 (0.912537)	0.678816 / 4.805227 (-4.126411)	0.157966 / 6.500664 (-6.342698)	0.073797 / 0.075469 (-0.001672)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.599480 / 1.841788 (-0.242308)	23.249738 / 8.074308 (15.175430)	16.965406 / 10.191392 (6.774014)	0.171390 / 0.680424 (-0.509034)	0.021810 / 0.534201 (-0.512391)	0.483339 / 0.579283 (-0.095944)	0.496615 / 0.434364 (0.062251)	0.583786 / 0.540337 (0.043448)	0.741699 / 1.386936 (-0.645237)

github-actions · 2023-07-13T18:06:24Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006054 / 0.011353 (-0.005299)	0.003706 / 0.011008 (-0.007302)	0.080060 / 0.038508 (0.041552)	0.061479 / 0.023109 (0.038370)	0.327981 / 0.275898 (0.052083)	0.356930 / 0.323480 (0.033450)	0.004671 / 0.007986 (-0.003315)	0.002901 / 0.004328 (-0.001428)	0.062425 / 0.004250 (0.058174)	0.046310 / 0.037052 (0.009258)	0.323657 / 0.258489 (0.065168)	0.370130 / 0.293841 (0.076289)	0.027151 / 0.128546 (-0.101395)	0.007850 / 0.075646 (-0.067797)	0.262300 / 0.419271 (-0.156971)	0.045456 / 0.043533 (0.001923)	0.325569 / 0.255139 (0.070430)	0.352962 / 0.283200 (0.069762)	0.020156 / 0.141683 (-0.121527)	1.429404 / 1.452155 (-0.022750)	1.615032 / 1.492716 (0.122316)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.187309 / 0.018006 (0.169303)	0.428848 / 0.000490 (0.428358)	0.003599 / 0.000200 (0.003399)	0.000069 / 0.000054 (0.000015)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.023260 / 0.037411 (-0.014151)	0.072467 / 0.014526 (0.057941)	0.082398 / 0.176557 (-0.094159)	0.142573 / 0.737135 (-0.594562)	0.082570 / 0.296338 (-0.213768)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.426503 / 0.215209 (0.211294)	4.267875 / 2.077655 (2.190220)	2.189762 / 1.504120 (0.685642)	2.027992 / 1.541195 (0.486798)	2.053211 / 1.468490 (0.584721)	0.503850 / 4.584777 (-4.080927)	3.086444 / 3.745712 (-0.659268)	3.319492 / 5.269862 (-1.950370)	2.070714 / 4.565676 (-2.494962)	0.057591 / 0.424275 (-0.366684)	0.006407 / 0.007607 (-0.001200)	0.501145 / 0.226044 (0.275100)	5.017753 / 2.268929 (2.748825)	2.643145 / 55.444624 (-52.801479)	2.327440 / 6.876477 (-4.549037)	2.460250 / 2.142072 (0.318178)	0.589397 / 4.805227 (-4.215830)	0.124948 / 6.500664 (-6.375716)	0.060450 / 0.075469 (-0.015020)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.279870 / 1.841788 (-0.561918)	18.115908 / 8.074308 (10.041600)	13.570032 / 10.191392 (3.378640)	0.132981 / 0.680424 (-0.547442)	0.016942 / 0.534201 (-0.517259)	0.333591 / 0.579283 (-0.245692)	0.358844 / 0.434364 (-0.075520)	0.395748 / 0.540337 (-0.144590)	0.546213 / 1.386936 (-0.840723)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006062 / 0.011353 (-0.005291)	0.003673 / 0.011008 (-0.007336)	0.064726 / 0.038508 (0.026218)	0.061854 / 0.023109 (0.038745)	0.385343 / 0.275898 (0.109445)	0.441284 / 0.323480 (0.117805)	0.004830 / 0.007986 (-0.003156)	0.002909 / 0.004328 (-0.001420)	0.063874 / 0.004250 (0.059624)	0.049331 / 0.037052 (0.012278)	0.418484 / 0.258489 (0.159995)	0.451397 / 0.293841 (0.157556)	0.027665 / 0.128546 (-0.100881)	0.008088 / 0.075646 (-0.067558)	0.069625 / 0.419271 (-0.349646)	0.043437 / 0.043533 (-0.000095)	0.359789 / 0.255139 (0.104650)	0.430206 / 0.283200 (0.147007)	0.022308 / 0.141683 (-0.119375)	1.461030 / 1.452155 (0.008875)	1.513683 / 1.492716 (0.020966)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.230958 / 0.018006 (0.212952)	0.417553 / 0.000490 (0.417063)	0.000802 / 0.000200 (0.000602)	0.000066 / 0.000054 (0.000011)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.025421 / 0.037411 (-0.011991)	0.077156 / 0.014526 (0.062630)	0.087533 / 0.176557 (-0.089024)	0.138048 / 0.737135 (-0.599087)	0.089358 / 0.296338 (-0.206981)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.439172 / 0.215209 (0.223963)	4.409509 / 2.077655 (2.331854)	2.491270 / 1.504120 (0.987150)	2.308446 / 1.541195 (0.767252)	2.378440 / 1.468490 (0.909950)	0.499834 / 4.584777 (-4.084943)	3.083168 / 3.745712 (-0.662544)	2.867543 / 5.269862 (-2.402318)	1.876354 / 4.565676 (-2.689323)	0.057092 / 0.424275 (-0.367183)	0.006955 / 0.007607 (-0.000653)	0.513799 / 0.226044 (0.287754)	5.126660 / 2.268929 (2.857731)	2.917348 / 55.444624 (-52.527277)	2.508035 / 6.876477 (-4.368441)	2.698089 / 2.142072 (0.556016)	0.586828 / 4.805227 (-4.218399)	0.124740 / 6.500664 (-6.375924)	0.062276 / 0.075469 (-0.013193)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.291624 / 1.841788 (-0.550164)	18.199968 / 8.074308 (10.125660)	13.888139 / 10.191392 (3.696747)	0.162955 / 0.680424 (-0.517469)	0.017343 / 0.534201 (-0.516858)	0.334683 / 0.579283 (-0.244600)	0.352708 / 0.434364 (-0.081656)	0.400629 / 0.540337 (-0.139708)	0.539497 / 1.386936 (-0.847439)

github-actions · 2023-07-13T18:29:04Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007500 / 0.011353 (-0.003853)	0.004498 / 0.011008 (-0.006510)	0.100239 / 0.038508 (0.061731)	0.083424 / 0.023109 (0.060315)	0.366664 / 0.275898 (0.090766)	0.406641 / 0.323480 (0.083161)	0.004577 / 0.007986 (-0.003409)	0.004809 / 0.004328 (0.000480)	0.076898 / 0.004250 (0.072647)	0.064021 / 0.037052 (0.026969)	0.375836 / 0.258489 (0.117347)	0.413008 / 0.293841 (0.119167)	0.036010 / 0.128546 (-0.092537)	0.009655 / 0.075646 (-0.065991)	0.342595 / 0.419271 (-0.076677)	0.061846 / 0.043533 (0.018313)	0.376543 / 0.255139 (0.121404)	0.395858 / 0.283200 (0.112659)	0.026792 / 0.141683 (-0.114891)	1.775569 / 1.452155 (0.323414)	1.865077 / 1.492716 (0.372360)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.221521 / 0.018006 (0.203514)	0.474604 / 0.000490 (0.474114)	0.004354 / 0.000200 (0.004154)	0.000090 / 0.000054 (0.000035)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.032947 / 0.037411 (-0.004464)	0.100454 / 0.014526 (0.085928)	0.111955 / 0.176557 (-0.064602)	0.179752 / 0.737135 (-0.557383)	0.114282 / 0.296338 (-0.182056)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.458261 / 0.215209 (0.243052)	4.563536 / 2.077655 (2.485881)	2.231928 / 1.504120 (0.727808)	2.036751 / 1.541195 (0.495556)	2.170413 / 1.468490 (0.701923)	0.570825 / 4.584777 (-4.013952)	4.505762 / 3.745712 (0.760050)	5.033461 / 5.269862 (-0.236401)	2.704989 / 4.565676 (-1.860687)	0.067011 / 0.424275 (-0.357264)	0.008568 / 0.007607 (0.000961)	0.545151 / 0.226044 (0.319106)	5.438984 / 2.268929 (3.170055)	2.771818 / 55.444624 (-52.672806)	2.393082 / 6.876477 (-4.483395)	2.467173 / 2.142072 (0.325101)	0.678849 / 4.805227 (-4.126379)	0.160480 / 6.500664 (-6.340184)	0.073681 / 0.075469 (-0.001788)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.532272 / 1.841788 (-0.309516)	22.548741 / 8.074308 (14.474433)	17.091044 / 10.191392 (6.899652)	0.172100 / 0.680424 (-0.508324)	0.022220 / 0.534201 (-0.511981)	0.467871 / 0.579283 (-0.111412)	0.491135 / 0.434364 (0.056771)	0.548433 / 0.540337 (0.008096)	0.733340 / 1.386936 (-0.653596)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007593 / 0.011353 (-0.003760)	0.004656 / 0.011008 (-0.006352)	0.076940 / 0.038508 (0.038431)	0.085183 / 0.023109 (0.062073)	0.447178 / 0.275898 (0.171280)	0.469545 / 0.323480 (0.146065)	0.006023 / 0.007986 (-0.001962)	0.003808 / 0.004328 (-0.000520)	0.076767 / 0.004250 (0.072517)	0.065713 / 0.037052 (0.028661)	0.445573 / 0.258489 (0.187084)	0.481689 / 0.293841 (0.187848)	0.036893 / 0.128546 (-0.091654)	0.009976 / 0.075646 (-0.065670)	0.084443 / 0.419271 (-0.334829)	0.058829 / 0.043533 (0.015297)	0.429291 / 0.255139 (0.174152)	0.454016 / 0.283200 (0.170816)	0.027289 / 0.141683 (-0.114394)	1.806786 / 1.452155 (0.354632)	1.887680 / 1.492716 (0.394964)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.241012 / 0.018006 (0.223006)	0.470629 / 0.000490 (0.470139)	0.003213 / 0.000200 (0.003013)	0.000107 / 0.000054 (0.000052)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.036896 / 0.037411 (-0.000515)	0.106932 / 0.014526 (0.092406)	0.120333 / 0.176557 (-0.056223)	0.186271 / 0.737135 (-0.550865)	0.121581 / 0.296338 (-0.174758)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.507782 / 0.215209 (0.292573)	5.062932 / 2.077655 (2.985278)	2.689539 / 1.504120 (1.185419)	2.482978 / 1.541195 (0.941784)	2.561320 / 1.468490 (1.092830)	0.570664 / 4.584777 (-4.014113)	4.346051 / 3.745712 (0.600339)	6.479374 / 5.269862 (1.209513)	4.096483 / 4.565676 (-0.469194)	0.067564 / 0.424275 (-0.356711)	0.009147 / 0.007607 (0.001540)	0.596059 / 0.226044 (0.370015)	5.963223 / 2.268929 (3.694295)	3.201039 / 55.444624 (-52.243585)	2.816581 / 6.876477 (-4.059896)	3.047821 / 2.142072 (0.905748)	0.687749 / 4.805227 (-4.117478)	0.158174 / 6.500664 (-6.342490)	0.073329 / 0.075469 (-0.002140)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.601346 / 1.841788 (-0.240441)	23.712210 / 8.074308 (15.637902)	16.567272 / 10.191392 (6.375880)	0.224745 / 0.680424 (-0.455679)	0.021662 / 0.534201 (-0.512539)	0.471427 / 0.579283 (-0.107856)	0.498751 / 0.434364 (0.064387)	0.572047 / 0.540337 (0.031710)	0.821868 / 1.386936 (-0.565068)

github-actions · 2023-07-14T12:24:08Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006371 / 0.011353 (-0.004981)	0.003749 / 0.011008 (-0.007259)	0.084155 / 0.038508 (0.045647)	0.072450 / 0.023109 (0.049340)	0.308002 / 0.275898 (0.032104)	0.340471 / 0.323480 (0.016991)	0.005054 / 0.007986 (-0.002931)	0.003176 / 0.004328 (-0.001152)	0.064867 / 0.004250 (0.060616)	0.054305 / 0.037052 (0.017252)	0.321047 / 0.258489 (0.062558)	0.345999 / 0.293841 (0.052158)	0.030507 / 0.128546 (-0.098039)	0.008299 / 0.075646 (-0.067347)	0.287682 / 0.419271 (-0.131590)	0.052048 / 0.043533 (0.008515)	0.308322 / 0.255139 (0.053183)	0.333220 / 0.283200 (0.050020)	0.022698 / 0.141683 (-0.118985)	1.474033 / 1.452155 (0.021879)	1.544790 / 1.492716 (0.052074)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.200612 / 0.018006 (0.182606)	0.450934 / 0.000490 (0.450445)	0.005383 / 0.000200 (0.005183)	0.000200 / 0.000054 (0.000145)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.027759 / 0.037411 (-0.009652)	0.080935 / 0.014526 (0.066409)	0.093041 / 0.176557 (-0.083516)	0.148643 / 0.737135 (-0.588492)	0.093463 / 0.296338 (-0.202876)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.381653 / 0.215209 (0.166444)	3.810699 / 2.077655 (1.733044)	1.866858 / 1.504120 (0.362738)	1.716985 / 1.541195 (0.175790)	1.788071 / 1.468490 (0.319581)	0.481130 / 4.584777 (-4.103647)	3.529798 / 3.745712 (-0.215914)	3.982037 / 5.269862 (-1.287824)	2.324866 / 4.565676 (-2.240811)	0.056767 / 0.424275 (-0.367508)	0.007306 / 0.007607 (-0.000301)	0.459472 / 0.226044 (0.233428)	4.602808 / 2.268929 (2.333879)	2.332014 / 55.444624 (-53.112610)	2.044858 / 6.876477 (-4.831619)	2.204165 / 2.142072 (0.062093)	0.577946 / 4.805227 (-4.227281)	0.130900 / 6.500664 (-6.369764)	0.059054 / 0.075469 (-0.016415)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.245211 / 1.841788 (-0.596576)	19.176397 / 8.074308 (11.102089)	13.995280 / 10.191392 (3.803888)	0.171743 / 0.680424 (-0.508681)	0.018038 / 0.534201 (-0.516163)	0.392338 / 0.579283 (-0.186945)	0.419370 / 0.434364 (-0.014994)	0.477829 / 0.540337 (-0.062508)	0.677409 / 1.386936 (-0.709527)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006513 / 0.011353 (-0.004840)	0.003984 / 0.011008 (-0.007024)	0.064516 / 0.038508 (0.026008)	0.070504 / 0.023109 (0.047395)	0.384509 / 0.275898 (0.108611)	0.410564 / 0.323480 (0.087084)	0.005310 / 0.007986 (-0.002675)	0.003268 / 0.004328 (-0.001061)	0.064684 / 0.004250 (0.060433)	0.055367 / 0.037052 (0.018315)	0.399108 / 0.258489 (0.140619)	0.422740 / 0.293841 (0.128900)	0.031624 / 0.128546 (-0.096922)	0.008617 / 0.075646 (-0.067030)	0.070929 / 0.419271 (-0.348342)	0.049146 / 0.043533 (0.005613)	0.385492 / 0.255139 (0.130353)	0.407434 / 0.283200 (0.124234)	0.021972 / 0.141683 (-0.119711)	1.496135 / 1.452155 (0.043980)	1.533739 / 1.492716 (0.041023)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.226218 / 0.018006 (0.208211)	0.443176 / 0.000490 (0.442686)	0.000376 / 0.000200 (0.000176)	0.000055 / 0.000054 (0.000000)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.030315 / 0.037411 (-0.007097)	0.086416 / 0.014526 (0.071890)	0.097725 / 0.176557 (-0.078831)	0.150407 / 0.737135 (-0.586728)	0.099914 / 0.296338 (-0.196424)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.409807 / 0.215209 (0.194598)	4.099086 / 2.077655 (2.021431)	2.103160 / 1.504120 (0.599040)	1.927927 / 1.541195 (0.386733)	1.977751 / 1.468490 (0.509261)	0.476995 / 4.584777 (-4.107781)	3.521835 / 3.745712 (-0.223877)	3.237695 / 5.269862 (-2.032167)	1.995953 / 4.565676 (-2.569724)	0.056208 / 0.424275 (-0.368068)	0.007660 / 0.007607 (0.000053)	0.483537 / 0.226044 (0.257492)	4.833974 / 2.268929 (2.565046)	2.589115 / 55.444624 (-52.855510)	2.228076 / 6.876477 (-4.648401)	2.395271 / 2.142072 (0.253198)	0.577534 / 4.805227 (-4.227694)	0.131432 / 6.500664 (-6.369232)	0.060999 / 0.075469 (-0.014471)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.356043 / 1.841788 (-0.485745)	19.470401 / 8.074308 (11.396093)	14.091266 / 10.191392 (3.899874)	0.166809 / 0.680424 (-0.513615)	0.018782 / 0.534201 (-0.515419)	0.394916 / 0.579283 (-0.184367)	0.411378 / 0.434364 (-0.022986)	0.466886 / 0.540337 (-0.073451)	0.617369 / 1.386936 (-0.769567)

github-actions · 2023-07-14T14:31:09Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007590 / 0.011353 (-0.003762)	0.004068 / 0.011008 (-0.006941)	0.105479 / 0.038508 (0.066971)	0.085614 / 0.023109 (0.062505)	0.384325 / 0.275898 (0.108427)	0.467867 / 0.323480 (0.144387)	0.004652 / 0.007986 (-0.003333)	0.005445 / 0.004328 (0.001117)	0.079604 / 0.004250 (0.075353)	0.066031 / 0.037052 (0.028978)	0.426184 / 0.258489 (0.167695)	0.480712 / 0.293841 (0.186871)	0.037837 / 0.128546 (-0.090709)	0.009765 / 0.075646 (-0.065882)	0.351316 / 0.419271 (-0.067955)	0.063634 / 0.043533 (0.020101)	0.420297 / 0.255139 (0.165158)	0.449169 / 0.283200 (0.165969)	0.030947 / 0.141683 (-0.110736)	1.840184 / 1.452155 (0.388029)	1.934074 / 1.492716 (0.441357)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.223483 / 0.018006 (0.205477)	0.521086 / 0.000490 (0.520596)	0.000379 / 0.000200 (0.000179)	0.000065 / 0.000054 (0.000011)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.032011 / 0.037411 (-0.005400)	0.101474 / 0.014526 (0.086948)	0.108652 / 0.176557 (-0.067904)	0.173340 / 0.737135 (-0.563796)	0.114186 / 0.296338 (-0.182153)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.478020 / 0.215209 (0.262811)	4.645400 / 2.077655 (2.567746)	2.590763 / 1.504120 (1.086643)	2.383002 / 1.541195 (0.841807)	2.482550 / 1.468490 (1.014060)	0.572417 / 4.584777 (-4.012360)	4.233436 / 3.745712 (0.487724)	4.858823 / 5.269862 (-0.411038)	2.838913 / 4.565676 (-1.726764)	0.070010 / 0.424275 (-0.354265)	0.009602 / 0.007607 (0.001995)	0.538735 / 0.226044 (0.312691)	5.534340 / 2.268929 (3.265411)	2.915006 / 55.444624 (-52.529619)	2.625132 / 6.876477 (-4.251345)	2.537838 / 2.142072 (0.395766)	0.667870 / 4.805227 (-4.137357)	0.146330 / 6.500664 (-6.354334)	0.071631 / 0.075469 (-0.003838)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.594686 / 1.841788 (-0.247101)	22.311113 / 8.074308 (14.236804)	17.603983 / 10.191392 (7.412591)	0.195995 / 0.680424 (-0.484428)	0.022254 / 0.534201 (-0.511947)	0.479661 / 0.579283 (-0.099622)	0.463626 / 0.434364 (0.029262)	0.483465 / 0.540337 (-0.056873)	0.676141 / 1.386936 (-0.710795)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006146 / 0.011353 (-0.005207)	0.004856 / 0.011008 (-0.006152)	0.067506 / 0.038508 (0.028998)	0.073968 / 0.023109 (0.050859)	0.470013 / 0.275898 (0.194115)	0.479022 / 0.323480 (0.155542)	0.005972 / 0.007986 (-0.002014)	0.003846 / 0.004328 (-0.000483)	0.075141 / 0.004250 (0.070890)	0.058597 / 0.037052 (0.021544)	0.481454 / 0.258489 (0.222965)	0.515634 / 0.293841 (0.221793)	0.034979 / 0.128546 (-0.093567)	0.010385 / 0.075646 (-0.065261)	0.072649 / 0.419271 (-0.346622)	0.058183 / 0.043533 (0.014650)	0.462138 / 0.255139 (0.206999)	0.476093 / 0.283200 (0.192893)	0.032918 / 0.141683 (-0.108765)	1.820530 / 1.452155 (0.368375)	1.626360 / 1.492716 (0.133644)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.208970 / 0.018006 (0.190964)	0.492478 / 0.000490 (0.491988)	0.005487 / 0.000200 (0.005287)	0.000140 / 0.000054 (0.000086)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.037896 / 0.037411 (0.000484)	0.089752 / 0.014526 (0.075227)	0.107445 / 0.176557 (-0.069111)	0.181260 / 0.737135 (-0.555876)	0.105700 / 0.296338 (-0.190639)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.495031 / 0.215209 (0.279821)	4.806939 / 2.077655 (2.729284)	2.227928 / 1.504120 (0.723808)	2.067117 / 1.541195 (0.525922)	2.348982 / 1.468490 (0.880492)	0.567201 / 4.584777 (-4.017576)	4.166592 / 3.745712 (0.420880)	3.654329 / 5.269862 (-1.615533)	2.331092 / 4.565676 (-2.234584)	0.062212 / 0.424275 (-0.362063)	0.008775 / 0.007607 (0.001168)	0.515413 / 0.226044 (0.289369)	5.449300 / 2.268929 (3.180371)	3.206574 / 55.444624 (-52.238050)	2.600455 / 6.876477 (-4.276022)	3.041162 / 2.142072 (0.899089)	0.681899 / 4.805227 (-4.123328)	0.155400 / 6.500664 (-6.345265)	0.073933 / 0.075469 (-0.001537)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.572329 / 1.841788 (-0.269459)	23.638519 / 8.074308 (15.564211)	17.145663 / 10.191392 (6.954271)	0.232690 / 0.680424 (-0.447734)	0.028620 / 0.534201 (-0.505581)	0.488105 / 0.579283 (-0.091178)	0.490365 / 0.434364 (0.056001)	0.599501 / 0.540337 (0.059164)	0.708101 / 1.386936 (-0.678835)

lhoestq · 2023-07-14T16:47:59Z

src/datasets/data_files.py

-                    if len(data_files) > 0:
-                        non_empty_splits.append(split)
-                        break
-            except FileNotFoundError:
-                pass
+                except FileNotFoundError:
+                    continue
+                if len(data_files) > 0:
+                    non_empty_splits.append(split)
+                    break


this is needed because now resolve_pattern returns FileNotFoundError if it can't resolve at least one file

lhoestq · 2023-07-14T16:50:50Z

src/datasets/load.py

        for filepath in data_files_list[: config.DATA_FILES_MAX_NUMBER_FOR_MODULE_INFERENCE]
-        for suffix in Path(filepath).suffixes
+        for suffix in xbasename(filepath).split(".")[1:]


this kind of changes are needed to support chained fsspec URLs

github-actions · 2023-07-17T09:57:07Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005947 / 0.011353 (-0.005406)	0.003577 / 0.011008 (-0.007431)	0.081631 / 0.038508 (0.043122)	0.058651 / 0.023109 (0.035541)	0.342742 / 0.275898 (0.066843)	0.384130 / 0.323480 (0.060650)	0.004620 / 0.007986 (-0.003366)	0.002885 / 0.004328 (-0.001444)	0.063698 / 0.004250 (0.059448)	0.048953 / 0.037052 (0.011901)	0.367880 / 0.258489 (0.109391)	0.407050 / 0.293841 (0.113209)	0.027242 / 0.128546 (-0.101305)	0.007914 / 0.075646 (-0.067733)	0.262156 / 0.419271 (-0.157116)	0.044750 / 0.043533 (0.001218)	0.351613 / 0.255139 (0.096474)	0.380284 / 0.283200 (0.097084)	0.020080 / 0.141683 (-0.121603)	1.498101 / 1.452155 (0.045946)	1.543608 / 1.492716 (0.050892)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.180014 / 0.018006 (0.162008)	0.436172 / 0.000490 (0.435682)	0.003694 / 0.000200 (0.003494)	0.000071 / 0.000054 (0.000017)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.024389 / 0.037411 (-0.013022)	0.072874 / 0.014526 (0.058348)	0.083469 / 0.176557 (-0.093088)	0.144600 / 0.737135 (-0.592536)	0.084229 / 0.296338 (-0.212110)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.391636 / 0.215209 (0.176427)	3.906941 / 2.077655 (1.829286)	1.901944 / 1.504120 (0.397825)	1.762702 / 1.541195 (0.221507)	1.817970 / 1.468490 (0.349480)	0.500345 / 4.584777 (-4.084432)	3.011351 / 3.745712 (-0.734361)	4.417763 / 5.269862 (-0.852098)	2.689744 / 4.565676 (-1.875933)	0.057765 / 0.424275 (-0.366511)	0.006412 / 0.007607 (-0.001195)	0.468156 / 0.226044 (0.242112)	4.664975 / 2.268929 (2.396047)	2.323355 / 55.444624 (-53.121270)	1.984280 / 6.876477 (-4.892197)	2.165215 / 2.142072 (0.023142)	0.586950 / 4.805227 (-4.218278)	0.124363 / 6.500664 (-6.376301)	0.060702 / 0.075469 (-0.014767)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.238870 / 1.841788 (-0.602917)	18.587360 / 8.074308 (10.513052)	13.831674 / 10.191392 (3.640282)	0.143542 / 0.680424 (-0.536882)	0.016913 / 0.534201 (-0.517288)	0.332314 / 0.579283 (-0.246969)	0.345419 / 0.434364 (-0.088945)	0.381257 / 0.540337 (-0.159081)	0.537844 / 1.386936 (-0.849092)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006294 / 0.011353 (-0.005059)	0.003714 / 0.011008 (-0.007294)	0.062684 / 0.038508 (0.024176)	0.063520 / 0.023109 (0.040411)	0.389591 / 0.275898 (0.113693)	0.444278 / 0.323480 (0.120798)	0.004825 / 0.007986 (-0.003160)	0.003010 / 0.004328 (-0.001318)	0.062767 / 0.004250 (0.058517)	0.051739 / 0.037052 (0.014686)	0.434299 / 0.258489 (0.175810)	0.452003 / 0.293841 (0.158162)	0.027375 / 0.128546 (-0.101171)	0.008135 / 0.075646 (-0.067511)	0.067401 / 0.419271 (-0.351871)	0.042752 / 0.043533 (-0.000780)	0.367633 / 0.255139 (0.112494)	0.433039 / 0.283200 (0.149840)	0.021086 / 0.141683 (-0.120597)	1.488024 / 1.452155 (0.035870)	1.507767 / 1.492716 (0.015050)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.230046 / 0.018006 (0.212040)	0.428085 / 0.000490 (0.427595)	0.002188 / 0.000200 (0.001988)	0.000070 / 0.000054 (0.000015)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.026705 / 0.037411 (-0.010706)	0.082466 / 0.014526 (0.067940)	0.089378 / 0.176557 (-0.087179)	0.147287 / 0.737135 (-0.589849)	0.090426 / 0.296338 (-0.205913)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.430882 / 0.215209 (0.215672)	4.296224 / 2.077655 (2.218569)	2.229982 / 1.504120 (0.725862)	2.048506 / 1.541195 (0.507311)	2.129514 / 1.468490 (0.661024)	0.502964 / 4.584777 (-4.081813)	3.048125 / 3.745712 (-0.697587)	4.208636 / 5.269862 (-1.061226)	2.594015 / 4.565676 (-1.971661)	0.057967 / 0.424275 (-0.366308)	0.006875 / 0.007607 (-0.000732)	0.513872 / 0.226044 (0.287828)	5.126435 / 2.268929 (2.857506)	2.691278 / 55.444624 (-52.753346)	2.361723 / 6.876477 (-4.514754)	2.511213 / 2.142072 (0.369141)	0.593558 / 4.805227 (-4.211670)	0.129332 / 6.500664 (-6.371332)	0.064051 / 0.075469 (-0.011418)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.289049 / 1.841788 (-0.552739)	18.912363 / 8.074308 (10.838055)	14.226500 / 10.191392 (4.035108)	0.131392 / 0.680424 (-0.549032)	0.016750 / 0.534201 (-0.517451)	0.330078 / 0.579283 (-0.249205)	0.347588 / 0.434364 (-0.086776)	0.383234 / 0.540337 (-0.157103)	0.510967 / 1.386936 (-0.875969)

mariosasko

Great stuff 🙂!

We should also remove the legacy version of HfFileSystem, but this can be done in a subsequent PR.

src/datasets/data_files.py

src/datasets/download/streaming_download_manager.py

Co-authored-by: Mario Šaško <mariosasko777@gmail.com>

github-actions · 2023-07-17T14:01:04Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005974 / 0.011353 (-0.005379)	0.003691 / 0.011008 (-0.007317)	0.079410 / 0.038508 (0.040902)	0.061769 / 0.023109 (0.038660)	0.323310 / 0.275898 (0.047412)	0.354325 / 0.323480 (0.030845)	0.004794 / 0.007986 (-0.003191)	0.002899 / 0.004328 (-0.001430)	0.062104 / 0.004250 (0.057854)	0.048973 / 0.037052 (0.011921)	0.326497 / 0.258489 (0.068008)	0.361347 / 0.293841 (0.067506)	0.026741 / 0.128546 (-0.101805)	0.007936 / 0.075646 (-0.067710)	0.259168 / 0.419271 (-0.160104)	0.044859 / 0.043533 (0.001327)	0.319342 / 0.255139 (0.064203)	0.343711 / 0.283200 (0.060511)	0.022298 / 0.141683 (-0.119384)	1.451595 / 1.452155 (-0.000560)	1.573730 / 1.492716 (0.081014)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.173086 / 0.018006 (0.155080)	0.432400 / 0.000490 (0.431910)	0.003739 / 0.000200 (0.003539)	0.000073 / 0.000054 (0.000019)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.024477 / 0.037411 (-0.012934)	0.073463 / 0.014526 (0.058937)	0.083410 / 0.176557 (-0.093146)	0.144760 / 0.737135 (-0.592376)	0.084199 / 0.296338 (-0.212140)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.388251 / 0.215209 (0.173042)	3.875375 / 2.077655 (1.797720)	1.875515 / 1.504120 (0.371395)	1.729282 / 1.541195 (0.188087)	1.784732 / 1.468490 (0.316242)	0.496985 / 4.584777 (-4.087792)	3.030276 / 3.745712 (-0.715436)	2.813192 / 5.269862 (-2.456669)	1.868647 / 4.565676 (-2.697030)	0.057376 / 0.424275 (-0.366899)	0.006463 / 0.007607 (-0.001144)	0.462153 / 0.226044 (0.236108)	4.586583 / 2.268929 (2.317654)	2.287730 / 55.444624 (-53.156894)	1.972177 / 6.876477 (-4.904299)	2.151592 / 2.142072 (0.009520)	0.587169 / 4.805227 (-4.218058)	0.127063 / 6.500664 (-6.373601)	0.060297 / 0.075469 (-0.015172)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.267651 / 1.841788 (-0.574136)	18.426011 / 8.074308 (10.351703)	14.050470 / 10.191392 (3.859078)	0.148063 / 0.680424 (-0.532361)	0.017112 / 0.534201 (-0.517089)	0.330051 / 0.579283 (-0.249232)	0.358730 / 0.434364 (-0.075634)	0.392365 / 0.540337 (-0.147972)	0.534650 / 1.386936 (-0.852286)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005936 / 0.011353 (-0.005417)	0.003652 / 0.011008 (-0.007356)	0.063066 / 0.038508 (0.024558)	0.060617 / 0.023109 (0.037507)	0.388293 / 0.275898 (0.112395)	0.411422 / 0.323480 (0.087942)	0.004691 / 0.007986 (-0.003295)	0.002857 / 0.004328 (-0.001472)	0.064198 / 0.004250 (0.059947)	0.049124 / 0.037052 (0.012071)	0.403601 / 0.258489 (0.145112)	0.413619 / 0.293841 (0.119778)	0.027279 / 0.128546 (-0.101267)	0.008072 / 0.075646 (-0.067575)	0.067890 / 0.419271 (-0.351381)	0.041866 / 0.043533 (-0.001667)	0.393438 / 0.255139 (0.138299)	0.402865 / 0.283200 (0.119666)	0.023381 / 0.141683 (-0.118302)	1.496324 / 1.452155 (0.044170)	1.538080 / 1.492716 (0.045364)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.212065 / 0.018006 (0.194059)	0.410511 / 0.000490 (0.410021)	0.001236 / 0.000200 (0.001036)	0.000067 / 0.000054 (0.000012)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.026012 / 0.037411 (-0.011399)	0.076592 / 0.014526 (0.062066)	0.085963 / 0.176557 (-0.090594)	0.137803 / 0.737135 (-0.599332)	0.087594 / 0.296338 (-0.208745)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.434283 / 0.215209 (0.219074)	4.345478 / 2.077655 (2.267824)	2.400954 / 1.504120 (0.896834)	2.282024 / 1.541195 (0.740829)	2.414247 / 1.468490 (0.945757)	0.501855 / 4.584777 (-4.082922)	3.059433 / 3.745712 (-0.686279)	2.811288 / 5.269862 (-2.458574)	1.856839 / 4.565676 (-2.708838)	0.058017 / 0.424275 (-0.366258)	0.006844 / 0.007607 (-0.000763)	0.515376 / 0.226044 (0.289332)	5.148775 / 2.268929 (2.879847)	2.930807 / 55.444624 (-52.513817)	2.520532 / 6.876477 (-4.355944)	2.746299 / 2.142072 (0.604227)	0.590102 / 4.805227 (-4.215125)	0.125747 / 6.500664 (-6.374917)	0.061873 / 0.075469 (-0.013597)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.306247 / 1.841788 (-0.535541)	18.366048 / 8.074308 (10.291740)	13.855617 / 10.191392 (3.664225)	0.150124 / 0.680424 (-0.530300)	0.017189 / 0.534201 (-0.517012)	0.336285 / 0.579283 (-0.242998)	0.344985 / 0.434364 (-0.089379)	0.397973 / 0.540337 (-0.142364)	0.536142 / 1.386936 (-0.850794)

github-actions · 2023-07-17T14:07:30Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006401 / 0.011353 (-0.004952)	0.003789 / 0.011008 (-0.007219)	0.079516 / 0.038508 (0.041008)	0.068279 / 0.023109 (0.045170)	0.295691 / 0.275898 (0.019793)	0.327208 / 0.323480 (0.003728)	0.005070 / 0.007986 (-0.002915)	0.003044 / 0.004328 (-0.001285)	0.061411 / 0.004250 (0.057161)	0.053227 / 0.037052 (0.016175)	0.297368 / 0.258489 (0.038879)	0.334740 / 0.293841 (0.040899)	0.029459 / 0.128546 (-0.099087)	0.008080 / 0.075646 (-0.067566)	0.267344 / 0.419271 (-0.151927)	0.049877 / 0.043533 (0.006344)	0.293853 / 0.255139 (0.038714)	0.319819 / 0.283200 (0.036620)	0.022593 / 0.141683 (-0.119089)	1.459054 / 1.452155 (0.006900)	1.471250 / 1.492716 (-0.021466)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.194326 / 0.018006 (0.176320)	0.443565 / 0.000490 (0.443075)	0.003745 / 0.000200 (0.003545)	0.000075 / 0.000054 (0.000021)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.026640 / 0.037411 (-0.010772)	0.077630 / 0.014526 (0.063104)	0.089364 / 0.176557 (-0.087192)	0.147327 / 0.737135 (-0.589809)	0.089603 / 0.296338 (-0.206735)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.373758 / 0.215209 (0.158549)	3.746778 / 2.077655 (1.669123)	1.814991 / 1.504120 (0.310871)	1.645650 / 1.541195 (0.104455)	1.690752 / 1.468490 (0.222262)	0.472117 / 4.584777 (-4.112660)	3.457346 / 3.745712 (-0.288367)	3.138869 / 5.269862 (-2.130993)	1.934924 / 4.565676 (-2.630753)	0.055709 / 0.424275 (-0.368566)	0.006680 / 0.007607 (-0.000927)	0.446874 / 0.226044 (0.220829)	4.458409 / 2.268929 (2.189480)	2.253932 / 55.444624 (-53.190693)	2.007240 / 6.876477 (-4.869237)	2.081687 / 2.142072 (-0.060386)	0.563379 / 4.805227 (-4.241848)	0.128694 / 6.500664 (-6.371970)	0.057409 / 0.075469 (-0.018060)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.212231 / 1.841788 (-0.629556)	18.519121 / 8.074308 (10.444813)	13.582243 / 10.191392 (3.390851)	0.142488 / 0.680424 (-0.537936)	0.017421 / 0.534201 (-0.516780)	0.366864 / 0.579283 (-0.212419)	0.401467 / 0.434364 (-0.032897)	0.443659 / 0.540337 (-0.096679)	0.618854 / 1.386936 (-0.768082)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006121 / 0.011353 (-0.005232)	0.003690 / 0.011008 (-0.007318)	0.060340 / 0.038508 (0.021832)	0.067215 / 0.023109 (0.044106)	0.382846 / 0.275898 (0.106948)	0.415774 / 0.323480 (0.092294)	0.004868 / 0.007986 (-0.003118)	0.003108 / 0.004328 (-0.001221)	0.060572 / 0.004250 (0.056321)	0.050453 / 0.037052 (0.013401)	0.400494 / 0.258489 (0.142005)	0.424368 / 0.293841 (0.130527)	0.030279 / 0.128546 (-0.098267)	0.008151 / 0.075646 (-0.067495)	0.066707 / 0.419271 (-0.352564)	0.046118 / 0.043533 (0.002585)	0.386697 / 0.255139 (0.131558)	0.410156 / 0.283200 (0.126957)	0.020688 / 0.141683 (-0.120995)	1.418162 / 1.452155 (-0.033993)	1.463057 / 1.492716 (-0.029659)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.216081 / 0.018006 (0.198075)	0.440541 / 0.000490 (0.440051)	0.000371 / 0.000200 (0.000171)	0.000054 / 0.000054 (-0.000000)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.027763 / 0.037411 (-0.009648)	0.082316 / 0.014526 (0.067791)	0.094086 / 0.176557 (-0.082471)	0.144738 / 0.737135 (-0.592398)	0.094837 / 0.296338 (-0.201501)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.396277 / 0.215209 (0.181068)	3.958791 / 2.077655 (1.881136)	2.021367 / 1.504120 (0.517247)	1.860112 / 1.541195 (0.318917)	1.886032 / 1.468490 (0.417541)	0.468536 / 4.584777 (-4.116241)	3.417950 / 3.745712 (-0.327762)	4.849991 / 5.269862 (-0.419871)	2.773935 / 4.565676 (-1.791742)	0.055813 / 0.424275 (-0.368462)	0.007053 / 0.007607 (-0.000554)	0.470167 / 0.226044 (0.244122)	4.702969 / 2.268929 (2.434041)	2.474161 / 55.444624 (-52.970464)	2.171256 / 6.876477 (-4.705220)	2.315373 / 2.142072 (0.173301)	0.589195 / 4.805227 (-4.216032)	0.128237 / 6.500664 (-6.372427)	0.058641 / 0.075469 (-0.016828)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.292947 / 1.841788 (-0.548841)	18.851300 / 8.074308 (10.776992)	14.089764 / 10.191392 (3.898372)	0.164853 / 0.680424 (-0.515571)	0.017281 / 0.534201 (-0.516920)	0.359112 / 0.579283 (-0.220171)	0.386696 / 0.434364 (-0.047668)	0.428222 / 0.540337 (-0.112115)	0.568659 / 1.386936 (-0.818277)

github-actions · 2023-07-17T17:09:39Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006051 / 0.011353 (-0.005301)	0.003654 / 0.011008 (-0.007355)	0.080081 / 0.038508 (0.041572)	0.062925 / 0.023109 (0.039815)	0.358097 / 0.275898 (0.082199)	0.405728 / 0.323480 (0.082248)	0.005359 / 0.007986 (-0.002627)	0.002820 / 0.004328 (-0.001508)	0.063108 / 0.004250 (0.058858)	0.049627 / 0.037052 (0.012575)	0.397870 / 0.258489 (0.139381)	0.437157 / 0.293841 (0.143316)	0.027707 / 0.128546 (-0.100839)	0.007911 / 0.075646 (-0.067735)	0.260991 / 0.419271 (-0.158280)	0.044771 / 0.043533 (0.001238)	0.340230 / 0.255139 (0.085091)	0.384925 / 0.283200 (0.101725)	0.021369 / 0.141683 (-0.120314)	1.431439 / 1.452155 (-0.020715)	1.478794 / 1.492716 (-0.013922)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.182626 / 0.018006 (0.164620)	0.435551 / 0.000490 (0.435061)	0.003015 / 0.000200 (0.002815)	0.000064 / 0.000054 (0.000009)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.024703 / 0.037411 (-0.012708)	0.073640 / 0.014526 (0.059114)	0.084598 / 0.176557 (-0.091959)	0.145810 / 0.737135 (-0.591325)	0.085125 / 0.296338 (-0.211213)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.394539 / 0.215209 (0.179330)	3.945882 / 2.077655 (1.868227)	1.947166 / 1.504120 (0.443046)	1.763305 / 1.541195 (0.222111)	1.816208 / 1.468490 (0.347718)	0.498880 / 4.584777 (-4.085897)	3.098283 / 3.745712 (-0.647429)	2.823474 / 5.269862 (-2.446388)	1.873993 / 4.565676 (-2.691684)	0.058097 / 0.424275 (-0.366179)	0.006488 / 0.007607 (-0.001119)	0.466711 / 0.226044 (0.240667)	4.671520 / 2.268929 (2.402592)	2.363381 / 55.444624 (-53.081243)	2.052092 / 6.876477 (-4.824385)	2.209212 / 2.142072 (0.067140)	0.594650 / 4.805227 (-4.210577)	0.125604 / 6.500664 (-6.375060)	0.061511 / 0.075469 (-0.013958)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.226564 / 1.841788 (-0.615224)	18.583605 / 8.074308 (10.509297)	13.993091 / 10.191392 (3.801699)	0.146185 / 0.680424 (-0.534239)	0.016839 / 0.534201 (-0.517362)	0.334116 / 0.579283 (-0.245167)	0.360780 / 0.434364 (-0.073584)	0.386008 / 0.540337 (-0.154329)	0.643278 / 1.386936 (-0.743658)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006174 / 0.011353 (-0.005179)	0.003658 / 0.011008 (-0.007350)	0.063250 / 0.038508 (0.024742)	0.063542 / 0.023109 (0.040433)	0.366845 / 0.275898 (0.090947)	0.409794 / 0.323480 (0.086314)	0.005678 / 0.007986 (-0.002308)	0.003061 / 0.004328 (-0.001268)	0.063561 / 0.004250 (0.059311)	0.052648 / 0.037052 (0.015596)	0.378096 / 0.258489 (0.119607)	0.410706 / 0.293841 (0.116865)	0.027668 / 0.128546 (-0.100878)	0.008045 / 0.075646 (-0.067601)	0.068290 / 0.419271 (-0.350981)	0.042602 / 0.043533 (-0.000930)	0.364976 / 0.255139 (0.109837)	0.395599 / 0.283200 (0.112400)	0.022733 / 0.141683 (-0.118950)	1.522473 / 1.452155 (0.070319)	1.515891 / 1.492716 (0.023175)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.232554 / 0.018006 (0.214547)	0.420702 / 0.000490 (0.420213)	0.002161 / 0.000200 (0.001961)	0.000064 / 0.000054 (0.000009)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.026276 / 0.037411 (-0.011135)	0.078504 / 0.014526 (0.063978)	0.088989 / 0.176557 (-0.087567)	0.144044 / 0.737135 (-0.593091)	0.091074 / 0.296338 (-0.205265)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.420189 / 0.215209 (0.204980)	4.189596 / 2.077655 (2.111941)	2.316425 / 1.504120 (0.812305)	2.186877 / 1.541195 (0.645682)	2.259065 / 1.468490 (0.790575)	0.502827 / 4.584777 (-4.081950)	3.135266 / 3.745712 (-0.610446)	2.838808 / 5.269862 (-2.431053)	1.876519 / 4.565676 (-2.689158)	0.057802 / 0.424275 (-0.366473)	0.006824 / 0.007607 (-0.000784)	0.500213 / 0.226044 (0.274168)	4.999798 / 2.268929 (2.730869)	2.627713 / 55.444624 (-52.816911)	2.344263 / 6.876477 (-4.532214)	2.415449 / 2.142072 (0.273376)	0.593082 / 4.805227 (-4.212145)	0.125787 / 6.500664 (-6.374877)	0.062699 / 0.075469 (-0.012770)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.308219 / 1.841788 (-0.533569)	18.703099 / 8.074308 (10.628791)	13.976234 / 10.191392 (3.784842)	0.144037 / 0.680424 (-0.536387)	0.016592 / 0.534201 (-0.517609)	0.333078 / 0.579283 (-0.246206)	0.342317 / 0.434364 (-0.092047)	0.396837 / 0.540337 (-0.143500)	0.532641 / 1.386936 (-0.854295)

lhoestq and others added 11 commits July 12, 2023 12:44

use new hffs in data_files.py

944a124

add support for storage_options for load_dataset API

2cffe0e

use new HfFileSystem

de4b100

update tests

0157c71

Merge branch 'main' into use-new-hffs

17f43ec

remove get_patterns_and_data_files

0bcfbfc

handle FileNotFoundError when no data files are resolved

3592bb7

fix merge error

109a444

docstring

049bce0

override auth in extend_module_for_streaming

0d8bfb7

update tests

c5a752d

update minimum hfh to 0.14.0

9a717b8

fix

84645f8

bug fixes

7935cd2

lhoestq added 2 commits July 13, 2023 19:56

more fixes

7741410

and more

e7976db

docstring

34d0c90

fix test for windows

601ae6c

again

4a76131

lhoestq marked this pull request as ready for review July 14, 2023 16:28

lhoestq requested a review from mariosasko July 14, 2023 16:28

lhoestq commented Jul 14, 2023

View reviewed changes

minor

d7892be

mariosasko approved these changes Jul 17, 2023

View reviewed changes

src/datasets/data_files.py Outdated Show resolved Hide resolved

src/datasets/data_files.py Outdated Show resolved Hide resolved

src/datasets/download/streaming_download_manager.py Show resolved Hide resolved

lhoestq and others added 2 commits July 17, 2023 15:52

Apply suggestions from code review

1ae24cf

Co-authored-by: Mario Šaško <mariosasko777@gmail.com>

style

563864d

lhoestq merged commit 14f6edd into main Jul 17, 2023
13 checks passed

lhoestq deleted the use-new-hffs branch July 17, 2023 17:01

lhoestq mentioned this pull request Jul 17, 2023

add support for storage_options for load_dataset API #5919

Closed

exs-avianello mentioned this pull request Jul 26, 2023

storage_options provided to load_dataset not fully piping through since datasets 2.14.0 #6071

Closed

albertvillanova mentioned this pull request Aug 1, 2023

Fix error when loading from GCP bucket #6105

Merged

Use new hffs #6028

Use new hffs #6028

Conversation

lhoestq commented Jul 13, 2023 • edited

Implementation details

New features

Breaking changes

HuggingFaceDocBuilderDev commented Jul 13, 2023 • edited

github-actions bot commented Jul 13, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Jul 13, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Jul 13, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Jul 13, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Jul 13, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Jul 13, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Jul 14, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

lhoestq commented Jul 13, 2023 •

edited

HuggingFaceDocBuilderDev commented Jul 13, 2023 •

edited