Use soundfile for mp3 decoding instead of torchaudio #5573

polinaeterna · 2023-02-23T19:19:44Z

I've removed torchaudio completely and switched to use soundfile for everything. With the new version of soundfile package this should work smoothly because the libsndfile C library is bundled, in Linux wheels too.

Let me know if you think it's too harsh and we should continue to support torchaudio decoding.

I decided that we can drop it completely because:

it's always something wrong with torchaudio (for example recently Error loading MP3 files from CommonVoice #5488 )
the results of mp3 decoding are different depending on torchaudio version
soundfile is slightly faster then the latest torchaudio
anyway users can pass any custom decoding function with any library they want if needed (worth putting a snippet in the docs).

cc @sanchit-gandhi @vaibhavad

HuggingFaceDocBuilderDev · 2023-02-23T19:23:56Z

The documentation is not available anymore as the PR was closed or merged.

…aries installation

…ibrary

…isleading

mariosasko

Good stuff!

docs/source/installation.md

setup.py

mariosasko · 2023-02-27T18:15:34Z

src/datasets/features/audio.py

@@ -243,128 +236,52 @@ def path_to_bytes(path):
        storage = pa.StructArray.from_arrays([bytes_array, path_array], ["bytes", "path"], mask=bytes_array.is_null())
        return array_cast(storage, self.pa_type)

-    def _decode_non_mp3_path_like(
-        self, path, format=None, token_per_repo_id: Optional[Dict[str, Union[str, bool, None]]] = None
+    def _decode_example(


I think we can put all this logic in decode_example and structure it the same way as in Image.decode_example

mariosasko · 2023-02-27T18:18:32Z

src/datasets/features/audio.py

        try:
            import librosa
            import soundfile as sf
        except ImportError as err:
            raise ImportError("To support decoding audio files, please install 'librosa' and 'soundfile'.") from err

        if format == "opus":
-            if version.parse(sf.__libsndfile_version__) < version.parse("1.0.30"):
+            if version.parse(sf.__libsndfile_version__) < version.parse("1.0.31"):


Perhaps we can introduce 2 module-level variables for these checks:

IS_OPUS_SUPPORTED = version.parse(sf.__libsndfile_version__) < version.parse("1.0.31") IS_MP3_SUPPORTED = version.parse(sf.__libsndfile_version__) < version.parse("1.0.31")

polinaeterna · 2023-02-27T18:38:28Z

@mariosasko thank you for the review! do you have any idea why test_hash_torch_tensor fails on "ubuntu-latest deps-minimum"? I removed the torchaudio<0.12.0 test dependency so it uses the latest torch now, might it be connected?

stevhliu

Nice, thanks for also updating the docs!

docs/source/installation.md

Co-authored-by: Mario Šaško <mariosasko777@gmail.com>

…for-mp3

mariosasko · 2023-02-27T20:29:39Z

@polinaeterna The failure is due to torch.from_numpy not being picklable in newer versions of PyTorch. You can replace the current definition of _save_tensor in utils/py_utils.py with the following one to fix it:

@pklregister(obj_type)
def _save_tensor(pickler, obj):
    # `torch.from_numpy` is not picklable in `torch>=1.11.0`
    def _create_tensor(np_array):
        return torch.from_numpy(np_array)

    dill_log(pickler, f"To: {obj}")
    args = (obj.detach().cpu().numpy(),)
    pickler.save_reduce(_create_tensor, args, obj=obj)
    dill_log(pickler, "# To")
    return

lhoestq

Thanks a lot ! you can merge when the CI is green (either with mario's fix or by skipping the torch test for recent torch versions if you think we need to fix it in another PR)

lhoestq · 2023-02-28T17:30:13Z

(doing a patch release now - please wait before merging ^^)

mariosasko

Some additional comments

src/datasets/features/audio.py

Co-authored-by: Mario Šaško <mariosasko777@gmail.com>

polinaeterna · 2023-02-28T18:05:19Z

@mariosasko génial, merci!! i've integrated all your changes, can you pls take a look one more time?

lhoestq · 2023-02-28T18:13:49Z

Patch release is done (I did it from another branch than main anyway)

mariosasko

Looks all good now!

…for-mp3

github-actions · 2023-02-28T20:25:13Z

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.010927 / 0.011353 (-0.000426)	0.006232 / 0.011008 (-0.004776)	0.119815 / 0.038508 (0.081307)	0.034138 / 0.023109 (0.011029)	0.349945 / 0.275898 (0.074047)	0.404967 / 0.323480 (0.081487)	0.008672 / 0.007986 (0.000687)	0.005010 / 0.004328 (0.000681)	0.091931 / 0.004250 (0.087680)	0.042534 / 0.037052 (0.005482)	0.374701 / 0.258489 (0.116212)	0.401027 / 0.293841 (0.107186)	0.053523 / 0.128546 (-0.075024)	0.019704 / 0.075646 (-0.055942)	0.384207 / 0.419271 (-0.035064)	0.065350 / 0.043533 (0.021817)	0.375074 / 0.255139 (0.119935)	0.390458 / 0.283200 (0.107259)	0.110549 / 0.141683 (-0.031134)	1.719812 / 1.452155 (0.267657)	1.748906 / 1.492716 (0.256190)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.210051 / 0.018006 (0.192045)	0.546503 / 0.000490 (0.546013)	0.004078 / 0.000200 (0.003878)	0.000111 / 0.000054 (0.000056)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.030212 / 0.037411 (-0.007199)	0.121845 / 0.014526 (0.107319)	0.136309 / 0.176557 (-0.040247)	0.204667 / 0.737135 (-0.532468)	0.157327 / 0.296338 (-0.139012)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.672548 / 0.215209 (0.457339)	6.239409 / 2.077655 (4.161754)	2.462441 / 1.504120 (0.958322)	2.063985 / 1.541195 (0.522791)	2.098858 / 1.468490 (0.630368)	1.262600 / 4.584777 (-3.322177)	5.478462 / 3.745712 (1.732750)	5.454672 / 5.269862 (0.184810)	2.991866 / 4.565676 (-1.573810)	0.153415 / 0.424275 (-0.270861)	0.015061 / 0.007607 (0.007454)	0.796115 / 0.226044 (0.570071)	8.206858 / 2.268929 (5.937930)	3.226395 / 55.444624 (-52.218229)	2.503522 / 6.876477 (-4.372955)	2.547489 / 2.142072 (0.405417)	1.504776 / 4.805227 (-3.300451)	0.256536 / 6.500664 (-6.244128)	0.078543 / 0.075469 (0.003073)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.591109 / 1.841788 (-0.250678)	18.153317 / 8.074308 (10.079008)	20.465684 / 10.191392 (10.274292)	0.229808 / 0.680424 (-0.450616)	0.045263 / 0.534201 (-0.488938)	0.556760 / 0.579283 (-0.022524)	0.614985 / 0.434364 (0.180622)	0.635675 / 0.540337 (0.095337)	0.729817 / 1.386936 (-0.657119)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.011247 / 0.011353 (-0.000106)	0.006823 / 0.011008 (-0.004185)	0.101989 / 0.038508 (0.063481)	0.036077 / 0.023109 (0.012968)	0.413469 / 0.275898 (0.137571)	0.505560 / 0.323480 (0.182080)	0.007506 / 0.007986 (-0.000480)	0.006369 / 0.004328 (0.002040)	0.099597 / 0.004250 (0.095346)	0.058115 / 0.037052 (0.021063)	0.414735 / 0.258489 (0.156246)	0.466801 / 0.293841 (0.172960)	0.064771 / 0.128546 (-0.063775)	0.021100 / 0.075646 (-0.054546)	0.135407 / 0.419271 (-0.283864)	0.068784 / 0.043533 (0.025251)	0.410467 / 0.255139 (0.155328)	0.465993 / 0.283200 (0.182794)	0.119404 / 0.141683 (-0.022279)	1.767107 / 1.452155 (0.314952)	1.938342 / 1.492716 (0.445626)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.227038 / 0.018006 (0.209032)	0.511389 / 0.000490 (0.510899)	0.006723 / 0.000200 (0.006523)	0.000118 / 0.000054 (0.000064)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.033078 / 0.037411 (-0.004333)	0.133159 / 0.014526 (0.118633)	0.147928 / 0.176557 (-0.028629)	0.214005 / 0.737135 (-0.523130)	0.151655 / 0.296338 (-0.144683)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.634829 / 0.215209 (0.419620)	6.578640 / 2.077655 (4.500985)	2.673598 / 1.504120 (1.169478)	2.338671 / 1.541195 (0.797476)	2.389104 / 1.468490 (0.920614)	1.274938 / 4.584777 (-3.309839)	5.746524 / 3.745712 (2.000812)	5.992084 / 5.269862 (0.722222)	3.092090 / 4.565676 (-1.473587)	0.150375 / 0.424275 (-0.273900)	0.015470 / 0.007607 (0.007863)	0.792962 / 0.226044 (0.566918)	8.057491 / 2.268929 (5.788563)	3.483966 / 55.444624 (-51.960659)	2.715038 / 6.876477 (-4.161438)	2.747186 / 2.142072 (0.605114)	1.532951 / 4.805227 (-3.272276)	0.262214 / 6.500664 (-6.238450)	0.081308 / 0.075469 (0.005839)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.698448 / 1.841788 (-0.143340)	18.590002 / 8.074308 (10.515694)	20.584508 / 10.191392 (10.393116)	0.227237 / 0.680424 (-0.453187)	0.028445 / 0.534201 (-0.505756)	0.527874 / 0.579283 (-0.051409)	0.602844 / 0.434364 (0.168480)	0.672948 / 0.540337 (0.132611)	0.788103 / 1.386936 (-0.598833)

polinaeterna added 3 commits February 23, 2023 20:18

use soundfile for mp3 decoding instead of torchaudio

d7b0c3c

fix some tests

58e05e9

remove torch and torchaudio from library's requirements

80a47c6

polinaeterna added 9 commits February 24, 2023 20:00

refactor audio decoding, decode everything with soundfile

0f6130c

remove torchaudio latest test ci stage, remove libsndfile and sox bin…

a1c4915

…aries installation

get back torch test dependency

5dc4331

remove installing system audio dependencies (libsndfile, sox) from ci

1ab2bf9

remove checks for libsndfile in tests since it's bundeled in python l…

1b6978c

…ibrary

remove instructions about installing via package manager since it's m…

427951d

…isleading

pin soundfile version to the latest

365db5c

update documentation

13891ef

fix setup

c4cbdb3

polinaeterna requested review from lhoestq, albertvillanova and stevhliu February 27, 2023 16:50

polinaeterna marked this pull request as ready for review February 27, 2023 16:53

mariosasko reviewed Feb 27, 2023

View reviewed changes

stevhliu approved these changes Feb 27, 2023

View reviewed changes

docs/source/installation.md Outdated Show resolved Hide resolved

polinaeterna and others added 6 commits February 27, 2023 19:41

Update docs/source/installation.md

df53313

Co-authored-by: Mario Šaško <mariosasko777@gmail.com>

Merge branch 'huggingface:main' into remove-torchaudio-use-soundfile-…

4a6b319

…for-mp3

refactor decoding: move all the code under the main decode_example func

85d441d

get audio format with os.path instead of string split

b6c3d4d

add module config variables for opus and mp3 support

9838ef4

apply steven's suggestion to installation docs

985d467

lhoestq approved these changes Feb 28, 2023

View reviewed changes

wrap torch.from_numpy in a func to avoid torch.from_numpy pickling error

a5d040f

mariosasko reviewed Feb 28, 2023

View reviewed changes

src/datasets/features/audio.py Outdated Show resolved Hide resolved

src/datasets/features/audio.py Outdated Show resolved Hide resolved

polinaeterna and others added 3 commits February 28, 2023 18:35

Apply suggestions from code review

9809db5

Co-authored-by: Mario Šaško <mariosasko777@gmail.com>

fix code style

b7562f4

import xsplitext

1598206

polinaeterna requested a review from mariosasko February 28, 2023 18:33

mariosasko approved these changes Feb 28, 2023

View reviewed changes

Merge branch 'huggingface:main' into remove-torchaudio-use-soundfile-…

cb7af00

…for-mp3

polinaeterna merged commit f965477 into huggingface:main Feb 28, 2023

polinaeterna deleted the remove-torchaudio-use-soundfile-for-mp3 branch February 28, 2023 20:16

sanchit-gandhi mentioned this pull request Mar 22, 2023

[Audio] Soundfile/libsndfile requirements too stringent for decoding mp3 files #5659

Closed

severo mentioned this pull request Mar 31, 2023

Update datasets to 2.11.0 huggingface/dataset-viewer#1002

Closed

6 tasks

sanchit-gandhi mentioned this pull request May 22, 2023

TypeError: 'type' object is not subscriptable huggingface/transformers#23472

Closed

4 tasks

albertvillanova mentioned this pull request Jun 1, 2023

Use soundfile for mp3 decoding instead of torchaudio huggingface/dataset-viewer#1280

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use soundfile for mp3 decoding instead of torchaudio #5573

Use soundfile for mp3 decoding instead of torchaudio #5573

polinaeterna commented Feb 23, 2023 •

edited

HuggingFaceDocBuilderDev commented Feb 23, 2023 •

edited

mariosasko left a comment

mariosasko Feb 27, 2023

mariosasko Feb 27, 2023

polinaeterna commented Feb 27, 2023

stevhliu left a comment

mariosasko commented Feb 27, 2023 •

edited

lhoestq left a comment •

edited

lhoestq commented Feb 28, 2023

mariosasko left a comment

polinaeterna commented Feb 28, 2023

lhoestq commented Feb 28, 2023

mariosasko left a comment

github-actions bot commented Feb 28, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Use soundfile for mp3 decoding instead of torchaudio #5573

Use soundfile for mp3 decoding instead of torchaudio #5573

Conversation

polinaeterna commented Feb 23, 2023 • edited

HuggingFaceDocBuilderDev commented Feb 23, 2023 • edited

mariosasko left a comment

Choose a reason for hiding this comment

mariosasko Feb 27, 2023

Choose a reason for hiding this comment

mariosasko Feb 27, 2023

Choose a reason for hiding this comment

polinaeterna commented Feb 27, 2023

stevhliu left a comment

Choose a reason for hiding this comment

mariosasko commented Feb 27, 2023 • edited

lhoestq left a comment • edited

Choose a reason for hiding this comment

lhoestq commented Feb 28, 2023

mariosasko left a comment

Choose a reason for hiding this comment

polinaeterna commented Feb 28, 2023

lhoestq commented Feb 28, 2023

mariosasko left a comment

Choose a reason for hiding this comment

github-actions bot commented Feb 28, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

polinaeterna commented Feb 23, 2023 •

edited

HuggingFaceDocBuilderDev commented Feb 23, 2023 •

edited

mariosasko commented Feb 27, 2023 •

edited

lhoestq left a comment •

edited