Pass PyDub AudioSegment into Whisper Transcribe() #2048

tddouglas · 2024-02-26T05:23:13Z

tddouglas
Feb 26, 2024

Hello!
I know from the transcribe signature a numpy array should be valid input, but I am unable to get it to work.

I'm currently reading in a .wav file, passing it into audio_transcribe, slicing it, and attempting to transcribe the individual chunks:

def audio_transcribe(audio: AudioSegment, audio_time_start: float, audio_time_end: float):
    trimmed_audio = audio[(audio_time_start * 1000): (audio_time_end * 1000)]  # convert seconds to MS
    raw_audio = trimmed_audio.raw_data
    loaded_audio = load_audio(raw_audio)

    model = whisper.load_model("base")
    result = model.transcribe(loaded_audio)
    return result["text"]

I've taken the load_audio function from this discussion

def load_audio(file: (str, bytes), sr: int = 16000):
    """
    Open an audio file and read as mono waveform, resampling as necessary

    Parameters
    ----------
    file: (str, bytes)
        The audio file to open or bytes of audio file

    sr: int
        The sample rate to resample the audio if necessary

    Returns
    -------
    A NumPy array containing the audio waveform, in float32 dtype.
    """

    if isinstance(file, bytes):
        inp = file
        file = 'pipe:'
    else:
        inp = None

    try:
        # This launches a subprocess to decode audio while down-mixing and resampling as necessary.
        # Requires the ffmpeg CLI and `ffmpeg-python` package to be installed.
        out, _ = (
            ffmpeg.input(file, threads=0)
            .output("-", format="s16le", acodec="pcm_s16le", ac=1, ar=sr)
            .run(cmd="ffmpeg", capture_stdout=True, capture_stderr=True, input=inp)
        )
    except ffmpeg.Error as e:
        raise RuntimeError(f"Failed to load audio:\n {e.stderr.decode()}") from e

    return np.frombuffer(out, np.int16).flatten().astype(np.float32) / 32768.0

But I'm unable to get it to work. I always run into this exception when trying to use ffmpeg to preprocess the audio:

  File "/Users/tyler/Documents/dev/Playground/checkspod/audio_to_text.py", line 84, in audio_transcribe
    loaded_audio = load_audio(raw_audio)
  File "/Users/tyler/Documents/dev/Playground/checkspod/audio_file_manipulator.py", line 77, in load_audio
    raise RuntimeError(f"Failed to load audio:\n {e.stderr.decode()}") from e
RuntimeError: Failed to load audio:
 ffmpeg version 6.0 Copyright (c) 2000-2023 the FFmpeg developers
  built with Apple clang version 15.0.0 (clang-1500.0.40.1)
  configuration: --prefix=/usr/local/Cellar/ffmpeg/6.0_1 --enable-shared --enable-pthreads --enable-version3 --cc=clang --host-cflags= --host-ldflags='-Wl,-ld_classic' --enable-ffplay --enable-gnutls --enable-gpl --enable-libaom --enable-libaribb24 --enable-libbluray --enable-libdav1d --enable-libjxl --enable-libmp3lame --enable-libopus --enable-librav1e --enable-librist --enable-librubberband --enable-libsnappy --enable-libsrt --enable-libsvtav1 --enable-libtesseract --enable-libtheora --enable-libvidstab --enable-libvmaf --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxml2 --enable-libxvid --enable-lzma --enable-libfontconfig --enable-libfreetype --enable-frei0r --enable-libass --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libopenjpeg --enable-libspeex --enable-libsoxr --enable-libzmq --enable-libzimg --disable-libjack --disable-indev=jack --enable-videotoolbox --enable-audiotoolbox
  libavutil      58.  2.100 / 58.  2.100
  libavcodec     60.  3.100 / 60.  3.100
  libavformat    60.  3.100 / 60.  3.100
  libavdevice    60.  1.100 / 60.  1.100
  libavfilter     9.  3.100 /  9.  3.100
  libswscale      7.  1.100 /  7.  1.100
  libswresample   4. 10.100 /  4. 10.100
  libpostproc    57.  1.100 / 57.  1.100
pipe:: Invalid data found when processing input

I know the audio is valid after slicing (I can export to a chunk file and listen to it to make sure). So it seems load_data doesn't play nicely with AudioSegment raw_data, but I have no idea why that might be. ffmpeg should work with bytes and .raw_data property is a byte string.

Note:
Audio I'm working with 48 khz sample rate / 16 bits per sample.

Answered by Purfview

Feb 26, 2024

As your input is raw from pipe then you need to describe input audio. For example:

ffmpeg.input('pipe:', format="s16le", acodec="pcm_s16le", ac=1, ar=48000)

View full answer

Purfview · 2024-02-26T12:04:28Z

Purfview
Feb 26, 2024

As your input is raw from pipe then you need to describe input audio. For example:

ffmpeg.input('pipe:', format="s16le", acodec="pcm_s16le", ac=1, ar=48000)

0 replies

tddouglas · 2024-02-27T04:10:59Z

tddouglas
Feb 27, 2024
Author

Thank you so much! I didn't think about needing the specify the input audio format as it doesn't look like the default whisper load_audio does that. Now I'm assuming ffmpeg is able to pull the relevant info from the file metadata or has another mechanism.

Ended up solving it with the below:

out, _ = (
            ffmpeg.input(file, format="s16le", acodec="pcm_s16le", ac=2, ar=48000)
            .output("-", format="s16le", acodec="pcm_s16le", ac=1, ar=sr)
            .run(cmd="ffmpeg", capture_stdout=True, capture_stderr=True, input=inp)
        )

Importantly I was getting gibberish translations until running ffprobe -show_streams {filename} and realizing it was 2 channel audio and I needed to set ac=2

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pass PyDub AudioSegment into Whisper Transcribe() #2048

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Pass PyDub AudioSegment into Whisper Transcribe() #2048

Uh oh!

Uh oh!

tddouglas Feb 26, 2024

Replies: 2 comments

Uh oh!

Purfview Feb 26, 2024

Uh oh!

tddouglas Feb 27, 2024 Author

tddouglas
Feb 26, 2024

Purfview
Feb 26, 2024

tddouglas
Feb 27, 2024
Author