Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use soundfile for mp3 decoding instead of torchaudio #5573

Conversation

polinaeterna
Copy link
Contributor

@polinaeterna polinaeterna commented Feb 23, 2023

I've removed torchaudio completely and switched to use soundfile for everything. With the new version of soundfile package this should work smoothly because the libsndfile C library is bundled, in Linux wheels too.

Let me know if you think it's too harsh and we should continue to support torchaudio decoding.

I decided that we can drop it completely because:

  1. it's always something wrong with torchaudio (for example recently Error loading MP3 files from CommonVoice #5488 )
  2. the results of mp3 decoding are different depending on torchaudio version
  3. soundfile is slightly faster then the latest torchaudio
  4. anyway users can pass any custom decoding function with any library they want if needed (worth putting a snippet in the docs).

cc @sanchit-gandhi @vaibhavad

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Feb 23, 2023

The documentation is not available anymore as the PR was closed or merged.

@polinaeterna polinaeterna marked this pull request as ready for review February 27, 2023 16:53
Copy link
Collaborator

@mariosasko mariosasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff!

docs/source/installation.md Outdated Show resolved Hide resolved
setup.py Show resolved Hide resolved
@@ -243,128 +236,52 @@ def path_to_bytes(path):
storage = pa.StructArray.from_arrays([bytes_array, path_array], ["bytes", "path"], mask=bytes_array.is_null())
return array_cast(storage, self.pa_type)

def _decode_non_mp3_path_like(
self, path, format=None, token_per_repo_id: Optional[Dict[str, Union[str, bool, None]]] = None
def _decode_example(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can put all this logic in decode_example and structure it the same way as in Image.decode_example

try:
import librosa
import soundfile as sf
except ImportError as err:
raise ImportError("To support decoding audio files, please install 'librosa' and 'soundfile'.") from err

if format == "opus":
if version.parse(sf.__libsndfile_version__) < version.parse("1.0.30"):
if version.parse(sf.__libsndfile_version__) < version.parse("1.0.31"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can introduce 2 module-level variables for these checks:

IS_OPUS_SUPPORTED = version.parse(sf.__libsndfile_version__) < version.parse("1.0.31")
IS_MP3_SUPPORTED = version.parse(sf.__libsndfile_version__) < version.parse("1.0.31")

@polinaeterna
Copy link
Contributor Author

@mariosasko thank you for the review! do you have any idea why test_hash_torch_tensor fails on "ubuntu-latest deps-minimum"? I removed the torchaudio<0.12.0 test dependency so it uses the latest torch now, might it be connected?

Copy link
Member

@stevhliu stevhliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for also updating the docs!

docs/source/installation.md Outdated Show resolved Hide resolved
@mariosasko
Copy link
Collaborator

mariosasko commented Feb 27, 2023

@polinaeterna The failure is due to torch.from_numpy not being picklable in newer versions of PyTorch. You can replace the current definition of _save_tensor in utils/py_utils.py with the following one to fix it:

@pklregister(obj_type)
def _save_tensor(pickler, obj):
    # `torch.from_numpy` is not picklable in `torch>=1.11.0`
    def _create_tensor(np_array):
        return torch.from_numpy(np_array)

    dill_log(pickler, f"To: {obj}")
    args = (obj.detach().cpu().numpy(),)
    pickler.save_reduce(_create_tensor, args, obj=obj)
    dill_log(pickler, "# To")
    return

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot ! you can merge when the CI is green (either with mario's fix or by skipping the torch test for recent torch versions if you think we need to fix it in another PR)

@lhoestq
Copy link
Member

lhoestq commented Feb 28, 2023

(doing a patch release now - please wait before merging ^^)

Copy link
Collaborator

@mariosasko mariosasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some additional comments

src/datasets/features/audio.py Outdated Show resolved Hide resolved
src/datasets/features/audio.py Outdated Show resolved Hide resolved
polinaeterna and others added 3 commits February 28, 2023 18:35
Co-authored-by: Mario Šaško <mariosasko777@gmail.com>
@polinaeterna
Copy link
Contributor Author

@mariosasko génial, merci!! i've integrated all your changes, can you pls take a look one more time?

@lhoestq
Copy link
Member

lhoestq commented Feb 28, 2023

Patch release is done (I did it from another branch than main anyway)

Copy link
Collaborator

@mariosasko mariosasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks all good now!

@polinaeterna polinaeterna merged commit f965477 into huggingface:main Feb 28, 2023
@polinaeterna polinaeterna deleted the remove-torchaudio-use-soundfile-for-mp3 branch February 28, 2023 20:16
@github-actions
Copy link

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.010927 / 0.011353 (-0.000426) 0.006232 / 0.011008 (-0.004776) 0.119815 / 0.038508 (0.081307) 0.034138 / 0.023109 (0.011029) 0.349945 / 0.275898 (0.074047) 0.404967 / 0.323480 (0.081487) 0.008672 / 0.007986 (0.000687) 0.005010 / 0.004328 (0.000681) 0.091931 / 0.004250 (0.087680) 0.042534 / 0.037052 (0.005482) 0.374701 / 0.258489 (0.116212) 0.401027 / 0.293841 (0.107186) 0.053523 / 0.128546 (-0.075024) 0.019704 / 0.075646 (-0.055942) 0.384207 / 0.419271 (-0.035064) 0.065350 / 0.043533 (0.021817) 0.375074 / 0.255139 (0.119935) 0.390458 / 0.283200 (0.107259) 0.110549 / 0.141683 (-0.031134) 1.719812 / 1.452155 (0.267657) 1.748906 / 1.492716 (0.256190)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.210051 / 0.018006 (0.192045) 0.546503 / 0.000490 (0.546013) 0.004078 / 0.000200 (0.003878) 0.000111 / 0.000054 (0.000056)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.030212 / 0.037411 (-0.007199) 0.121845 / 0.014526 (0.107319) 0.136309 / 0.176557 (-0.040247) 0.204667 / 0.737135 (-0.532468) 0.157327 / 0.296338 (-0.139012)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.672548 / 0.215209 (0.457339) 6.239409 / 2.077655 (4.161754) 2.462441 / 1.504120 (0.958322) 2.063985 / 1.541195 (0.522791) 2.098858 / 1.468490 (0.630368) 1.262600 / 4.584777 (-3.322177) 5.478462 / 3.745712 (1.732750) 5.454672 / 5.269862 (0.184810) 2.991866 / 4.565676 (-1.573810) 0.153415 / 0.424275 (-0.270861) 0.015061 / 0.007607 (0.007454) 0.796115 / 0.226044 (0.570071) 8.206858 / 2.268929 (5.937930) 3.226395 / 55.444624 (-52.218229) 2.503522 / 6.876477 (-4.372955) 2.547489 / 2.142072 (0.405417) 1.504776 / 4.805227 (-3.300451) 0.256536 / 6.500664 (-6.244128) 0.078543 / 0.075469 (0.003073)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.591109 / 1.841788 (-0.250678) 18.153317 / 8.074308 (10.079008) 20.465684 / 10.191392 (10.274292) 0.229808 / 0.680424 (-0.450616) 0.045263 / 0.534201 (-0.488938) 0.556760 / 0.579283 (-0.022524) 0.614985 / 0.434364 (0.180622) 0.635675 / 0.540337 (0.095337) 0.729817 / 1.386936 (-0.657119)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.011247 / 0.011353 (-0.000106) 0.006823 / 0.011008 (-0.004185) 0.101989 / 0.038508 (0.063481) 0.036077 / 0.023109 (0.012968) 0.413469 / 0.275898 (0.137571) 0.505560 / 0.323480 (0.182080) 0.007506 / 0.007986 (-0.000480) 0.006369 / 0.004328 (0.002040) 0.099597 / 0.004250 (0.095346) 0.058115 / 0.037052 (0.021063) 0.414735 / 0.258489 (0.156246) 0.466801 / 0.293841 (0.172960) 0.064771 / 0.128546 (-0.063775) 0.021100 / 0.075646 (-0.054546) 0.135407 / 0.419271 (-0.283864) 0.068784 / 0.043533 (0.025251) 0.410467 / 0.255139 (0.155328) 0.465993 / 0.283200 (0.182794) 0.119404 / 0.141683 (-0.022279) 1.767107 / 1.452155 (0.314952) 1.938342 / 1.492716 (0.445626)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.227038 / 0.018006 (0.209032) 0.511389 / 0.000490 (0.510899) 0.006723 / 0.000200 (0.006523) 0.000118 / 0.000054 (0.000064)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.033078 / 0.037411 (-0.004333) 0.133159 / 0.014526 (0.118633) 0.147928 / 0.176557 (-0.028629) 0.214005 / 0.737135 (-0.523130) 0.151655 / 0.296338 (-0.144683)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.634829 / 0.215209 (0.419620) 6.578640 / 2.077655 (4.500985) 2.673598 / 1.504120 (1.169478) 2.338671 / 1.541195 (0.797476) 2.389104 / 1.468490 (0.920614) 1.274938 / 4.584777 (-3.309839) 5.746524 / 3.745712 (2.000812) 5.992084 / 5.269862 (0.722222) 3.092090 / 4.565676 (-1.473587) 0.150375 / 0.424275 (-0.273900) 0.015470 / 0.007607 (0.007863) 0.792962 / 0.226044 (0.566918) 8.057491 / 2.268929 (5.788563) 3.483966 / 55.444624 (-51.960659) 2.715038 / 6.876477 (-4.161438) 2.747186 / 2.142072 (0.605114) 1.532951 / 4.805227 (-3.272276) 0.262214 / 6.500664 (-6.238450) 0.081308 / 0.075469 (0.005839)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.698448 / 1.841788 (-0.143340) 18.590002 / 8.074308 (10.515694) 20.584508 / 10.191392 (10.393116) 0.227237 / 0.680424 (-0.453187) 0.028445 / 0.534201 (-0.505756) 0.527874 / 0.579283 (-0.051409) 0.602844 / 0.434364 (0.168480) 0.672948 / 0.540337 (0.132611) 0.788103 / 1.386936 (-0.598833)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants