Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The processing efficiency and sampling rate problem of OPUS files #1149

Open
yangb05 opened this issue Sep 15, 2023 · 5 comments
Open

The processing efficiency and sampling rate problem of OPUS files #1149

yangb05 opened this issue Sep 15, 2023 · 5 comments

Comments

@yangb05
Copy link
Contributor

yangb05 commented Sep 15, 2023

I'am trying to process a large dataset with .wav and .opus files recently, and found that the processing of .wav files is nearly 6 times faster than the processing of .opus files, specifically in the generation of recordings and supervisions. After debugging, I found the difference is that .wav file is processed with torchaudio and .opus file is processed with ffmpeg.
The read_opus function in lhotse/audio/backend.py is:

def read_opus(
    path: Pathlike,
    offset: Seconds = 0.0,
    duration: Optional[Seconds] = None,
    force_opus_sampling_rate: Optional[int] = None,
) -> Tuple[np.ndarray, int]:
    """
    Reads OPUS files either using torchaudio or ffmpeg.
    Torchaudio is faster, but if unavailable for some reason,
    we fallback to a slower ffmpeg-based implementation.

    :return: a tuple of audio samples and the sampling rate.
    """
    # TODO: Revisit using torchaudio backend for OPUS
    #       once it's more thoroughly benchmarked against ffmpeg
    #       and has a competitive I/O speed.
    #       See: https://github.com/pytorch/audio/issues/1994
    # try:
    #     return read_opus_torchaudio(
    #         path=path,
    #         offset=offset,
    #         duration=duration,
    #         force_opus_sampling_rate=force_opus_sampling_rate,
    #     )
    # except:
    return read_opus_ffmpeg(
        path=path,
        offset=offset,
        duration=duration,
        force_opus_sampling_rate=force_opus_sampling_rate,
    )

Althought the note says ffmpeg is faster, but in my case, torchaudio is better. I just use the read_opus_torchaudio in the above code, then the speedup appears.
pytorch: 1.13
ffmpeg:
Untitled
torchaudio:
1694747753603

Also, there is another problem when using the read_opus_ffmpeg function:

def read_opus_ffmpeg(
    path: Pathlike,
    offset: Seconds = 0.0,
    duration: Optional[Seconds] = None,
    force_opus_sampling_rate: Optional[int] = None,
) -> Tuple[np.ndarray, int]:
    """
    Reads OPUS files using ffmpeg in a shell subprocess.
    Unlike audioread, correctly supports offsets and durations for reading short chunks.
    Optionally, we can force ffmpeg to resample to the true sampling rate (if we know it up-front).

    :return: a tuple of audio samples and the sampling rate.
    """
    # Construct the ffmpeg command depending on the arguments passed.
    cmd = "ffmpeg -threads 1"
    sampling_rate = 48000
    # Note: we have to add offset and duration options (-ss and -t) BEFORE specifying the input
    #       (-i), otherwise ffmpeg will decode everything and trim afterwards...
    if offset > 0:
        cmd += f" -ss {offset}"
    if duration is not None:
        cmd += f" -t {duration}"
    # Add the input specifier after offset and duration.
    cmd += f" -i {path}"
    # Optionally resample the output.
    if force_opus_sampling_rate is not None:
        cmd += f" -ar {force_opus_sampling_rate}"
        sampling_rate = force_opus_sampling_rate
    # Read audio samples directly as float32.
    cmd += " -f f32le -threads 1 pipe:1"
    # Actual audio reading.
    proc = run(cmd, shell=True, stdout=PIPE, stderr=PIPE)
    raw_audio = proc.stdout
    audio = np.frombuffer(raw_audio, dtype=np.float32)
    # Determine if the recording is mono or stereo and decode accordingly.
    try:
        channel_string = parse_channel_from_ffmpeg_output(proc.stderr)
        if channel_string == "stereo":
            new_audio = np.empty((2, audio.shape[0] // 2), dtype=np.float32)
            new_audio[0, :] = audio[::2]
            new_audio[1, :] = audio[1::2]
            audio = new_audio
        elif channel_string == "mono":
            audio = audio.reshape(1, -1)
        else:
            raise NotImplementedError(
                f"Unknown channel description from ffmpeg: {channel_string}"
            )
    except ValueError as e:
        raise AudioLoadingError(
            f"{e}\nThe ffmpeg command for which the program failed is: '{cmd}', error code: {proc.returncode}"
        )
    return audio, sampling_rate

It assumes all the .opus files have sampling_rate 48000,that will be a problem if the dataset is not so normal, for example, in my case, it could be 16000. Then, the recorded sampling_rate will be 48000 while the file is read with actual sampling_rate 16000 if the force_opus_sampling_rate is not specified, which will affect the following computation of num_samples and features.
I think just set the cmd with '-ar sampling_rate ' will solve the problem, for example:

def read_opus_ffmpeg(
    path: Pathlike,
    offset: Seconds = 0.0,
    duration: Optional[Seconds] = None,
    force_opus_sampling_rate: Optional[int] = None,
) -> Tuple[np.ndarray, int]:
    """
    Reads OPUS files using ffmpeg in a shell subprocess.
    Unlike audioread, correctly supports offsets and durations for reading short chunks.
    Optionally, we can force ffmpeg to resample to the true sampling rate (if we know it up-front).

    :return: a tuple of audio samples and the sampling rate.
    """
    # Construct the ffmpeg command depending on the arguments passed.
    cmd = "ffmpeg -threads 1"
    sampling_rate = 48000
    # Note: we have to add offset and duration options (-ss and -t) BEFORE specifying the input
    #       (-i), otherwise ffmpeg will decode everything and trim afterwards...
    if offset > 0:
        cmd += f" -ss {offset}"
    if duration is not None:
        cmd += f" -t {duration}"
    # Add the input specifier after offset and duration.
    cmd += f" -i {path}"
    # Optionally resample the output.
    if force_opus_sampling_rate is not None:
        sampling_rate = force_opus_sampling_rate
    cmd += f" -ar {sampling_rate}"
    # Read audio samples directly as float32.
    cmd += " -f f32le -threads 1 pipe:1"
    # Actual audio reading.
    proc = run(cmd, shell=True, stdout=PIPE, stderr=PIPE)
    raw_audio = proc.stdout
    audio = np.frombuffer(raw_audio, dtype=np.float32)
    # Determine if the recording is mono or stereo and decode accordingly.
    try:
        channel_string = parse_channel_from_ffmpeg_output(proc.stderr)
        if channel_string == "stereo":
            new_audio = np.empty((2, audio.shape[0] // 2), dtype=np.float32)
            new_audio[0, :] = audio[::2]
            new_audio[1, :] = audio[1::2]
            audio = new_audio
        elif channel_string == "mono":
            audio = audio.reshape(1, -1)
        else:
            raise NotImplementedError(
                f"Unknown channel description from ffmpeg: {channel_string}"
            )
    except ValueError as e:
        raise AudioLoadingError(
            f"{e}\nThe ffmpeg command for which the program failed is: '{cmd}', error code: {proc.returncode}"
        )
    return audio, sampling_rate
@pzelasko
Copy link
Collaborator

Hmm, I remember disabling it because I found the reverse to be true on some systems. I think the best way forward would be to expose the control over this to the user. I'll aim to make a PR to enable this later as I was recently refactoring some of this code, it should be easily doable.

@pzelasko
Copy link
Collaborator

Regarding 48kHz vs 16kHz, I'm not sure I got your point. OPUS is always decoded to 48kHz even if the original audio had smaller sampling rate, unless I missed something.

@yangb05
Copy link
Contributor Author

yangb05 commented Sep 19, 2023

Regarding 48kHz vs 16kHz, I'm not sure I got your point. OPUS is always decoded to 48kHz even if the original audio had smaller sampling rate, unless I missed something.

For example, I have a .opus file in my dataset, if I use torchaudio.info() to get the sampling rate, it shows 16kHz. Also, if I use ffmpeg to read it, the information shows the input sampling rate is 16kHz. If the param force_opus_sampling_rate is not passed to read_opus_ffmpeg, then the number of samples will be read in 16kHz(actual) while with the sampling rate 48kHz(default) in the recording.
Assume read_opus_ffmpeg reads 30,000 samples in this .opus file, and the recorded sampling rate is 48kHz. When I try to resample it to 16kHz in the cut set, the recorded number of samples will reduced to 10,000 from 30,000. Now,

The recorded info: {sampling rate: 16kHz, num_samples: 10000}
The actual info: {sampling rate: 16kHz, num_samples: 30000}

It will cause a mismatch in the subsequent computations.

@pzelasko
Copy link
Collaborator

If the file has 16kHz, that makes sense. I just never encountered an OPUS file that actually has a sampling rate other than 48kHz, even when I encoded WAV data into OPUS that had a smaller SR...

I think your proposed changes make sense, could you make a PR?

@yangb05
Copy link
Contributor Author

yangb05 commented Sep 20, 2023

OK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants