Skip to content

Commit

Permalink
Free the "hf" filesystem protocol for hffs (#5101)
Browse files Browse the repository at this point in the history
hf:// -> hf-legacy://
  • Loading branch information
lhoestq committed Oct 12, 2022
1 parent bbebe3f commit 9ec6cc7
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion src/datasets/filesystems/hffilesystem.py
Expand Up @@ -12,7 +12,7 @@ class HfFileSystem(AbstractFileSystem):
"""Interface to files in a Hugging face repository"""

root_marker = ""
protocol = "hf"
protocol = "hf-legacy" # "hf://"" is reserved for hffs

def __init__(
self,
Expand Down

1 comment on commit 9ec6cc7

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.009689 / 0.011353 (-0.001664) 0.005530 / 0.011008 (-0.005478) 0.098906 / 0.038508 (0.060398) 0.036179 / 0.023109 (0.013069) 0.300794 / 0.275898 (0.024896) 0.367845 / 0.323480 (0.044366) 0.008321 / 0.007986 (0.000336) 0.004540 / 0.004328 (0.000212) 0.076105 / 0.004250 (0.071854) 0.046798 / 0.037052 (0.009746) 0.308770 / 0.258489 (0.050281) 0.346182 / 0.293841 (0.052341) 0.044416 / 0.128546 (-0.084130) 0.015854 / 0.075646 (-0.059793) 0.337584 / 0.419271 (-0.081687) 0.052304 / 0.043533 (0.008771) 0.298205 / 0.255139 (0.043066) 0.315315 / 0.283200 (0.032116) 0.110239 / 0.141683 (-0.031444) 1.513958 / 1.452155 (0.061804) 1.497724 / 1.492716 (0.005007)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.292055 / 0.018006 (0.274049) 0.511467 / 0.000490 (0.510978) 0.007319 / 0.000200 (0.007119) 0.000102 / 0.000054 (0.000048)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.024447 / 0.037411 (-0.012964) 0.101655 / 0.014526 (0.087129) 0.116639 / 0.176557 (-0.059918) 0.163426 / 0.737135 (-0.573709) 0.121695 / 0.296338 (-0.174644)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.398096 / 0.215209 (0.182887) 3.986374 / 2.077655 (1.908719) 1.851549 / 1.504120 (0.347429) 1.668359 / 1.541195 (0.127164) 1.795167 / 1.468490 (0.326677) 0.697416 / 4.584777 (-3.887361) 3.810773 / 3.745712 (0.065061) 2.153027 / 5.269862 (-3.116835) 1.350133 / 4.565676 (-3.215543) 0.085381 / 0.424275 (-0.338894) 0.012161 / 0.007607 (0.004554) 0.503354 / 0.226044 (0.277309) 5.052066 / 2.268929 (2.783137) 2.307330 / 55.444624 (-53.137295) 1.959908 / 6.876477 (-4.916569) 2.119641 / 2.142072 (-0.022431) 0.843926 / 4.805227 (-3.961301) 0.164990 / 6.500664 (-6.335674) 0.061561 / 0.075469 (-0.013908)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.507499 / 1.841788 (-0.334289) 14.015427 / 8.074308 (5.941119) 25.045568 / 10.191392 (14.854176) 0.926150 / 0.680424 (0.245727) 0.573311 / 0.534201 (0.039110) 0.442273 / 0.579283 (-0.137010) 0.439151 / 0.434364 (0.004787) 0.281950 / 0.540337 (-0.258388) 0.286989 / 1.386936 (-1.099947)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007538 / 0.011353 (-0.003815) 0.005466 / 0.011008 (-0.005542) 0.095802 / 0.038508 (0.057294) 0.034148 / 0.023109 (0.011038) 0.379444 / 0.275898 (0.103545) 0.401728 / 0.323480 (0.078248) 0.006105 / 0.007986 (-0.001881) 0.004385 / 0.004328 (0.000056) 0.073074 / 0.004250 (0.068823) 0.041847 / 0.037052 (0.004795) 0.385479 / 0.258489 (0.126990) 0.425510 / 0.293841 (0.131669) 0.038352 / 0.128546 (-0.090194) 0.012784 / 0.075646 (-0.062862) 0.332626 / 0.419271 (-0.086645) 0.051411 / 0.043533 (0.007878) 0.379052 / 0.255139 (0.123913) 0.397752 / 0.283200 (0.114553) 0.105692 / 0.141683 (-0.035991) 1.455533 / 1.452155 (0.003378) 1.556804 / 1.492716 (0.064088)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.327407 / 0.018006 (0.309401) 0.512158 / 0.000490 (0.511668) 0.006009 / 0.000200 (0.005809) 0.000099 / 0.000054 (0.000045)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.024541 / 0.037411 (-0.012870) 0.100378 / 0.014526 (0.085852) 0.114500 / 0.176557 (-0.062057) 0.158140 / 0.737135 (-0.578995) 0.119904 / 0.296338 (-0.176435)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.423971 / 0.215209 (0.208762) 4.200842 / 2.077655 (2.123187) 2.036322 / 1.504120 (0.532202) 1.841557 / 1.541195 (0.300363) 1.903635 / 1.468490 (0.435145) 0.707997 / 4.584777 (-3.876780) 3.812709 / 3.745712 (0.066997) 2.165952 / 5.269862 (-3.103910) 1.359960 / 4.565676 (-3.205716) 0.092841 / 0.424275 (-0.331434) 0.011987 / 0.007607 (0.004380) 0.529151 / 0.226044 (0.303107) 5.255795 / 2.268929 (2.986866) 2.501193 / 55.444624 (-52.943431) 2.158599 / 6.876477 (-4.717878) 2.341377 / 2.142072 (0.199305) 0.857324 / 4.805227 (-3.947904) 0.180381 / 6.500664 (-6.320283) 0.062789 / 0.075469 (-0.012681)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.533043 / 1.841788 (-0.308745) 13.946954 / 8.074308 (5.872646) 12.845415 / 10.191392 (2.654023) 0.931454 / 0.680424 (0.251030) 0.595597 / 0.534201 (0.061396) 0.419140 / 0.579283 (-0.160143) 0.418518 / 0.434364 (-0.015846) 0.244721 / 0.540337 (-0.295617) 0.263180 / 1.386936 (-1.123756)

CML watermark

Please sign in to comment.