Pad or Trim defaults to 30 seconds #609

dteok · 2022-11-29T13:17:01Z

dteok
Nov 29, 2022

Hi! Running the code base on README.md, I have tried to modify audio.pyhard-coded audio hyperparameters hoping to increase the 30 seconds sample by default. Noted that N_FRAMES must have a modulus of zero, but any permutations to SAMPLE_RATE, HOP_LENGTH, OR CHUNK_LENGTH will result in model.py complaining about incorrect audio shape.

How to fix this?

And, is it possible to perform model.transcribe("audio.mp3") with prompt e.g. class DecodingOptions() ?

wwbnjsace · 2022-12-01T10:06:49Z

wwbnjsace
Dec 1, 2022

i want to know how to make streaming asr

0 replies

luigisaetta · 2022-12-01T10:54:35Z

luigisaetta
Dec 1, 2022

Hi! Running the code base on README.md, I have tried to modify audio.pyhard-coded audio hyperparameters hoping to increase the 30 seconds sample by default. Noted that N_FRAMES must have a modulus of zero, but any permutations to SAMPLE_RATE, HOP_LENGTH, OR CHUNK_LENGTH will result in model.py complaining about incorrect audio shape.

How to fix this?

And, is it possible to perform model.transcribe("audio.mp3") with prompt e.g. class DecodingOptions() ?

well, every model based on Transformers architecture, like Whisper, must have a limited max_length for the sequence, otherwise it would go, for example, in out of memory (the NN is huge).

But actually the Whisper codebase is really good in processing long audio files (I have tested it). The approach, as far as I know is based on chunking with strides (see: https://huggingface.co/blog/asr-chunking). So, it is not a limitation. The value for N_FRAMES translate in the input to the encoder part. That is fixed by the architecture. You cannot change it.

3 replies

dteok Dec 2, 2022
Author

So, the long and short of it -- only 30 seconds. Can't be changed.

lv1974 Jun 24, 2023

So, how is that a problem? The higher-level API calls use decode functionality fior very long texts......

If you want to use the low-level API calls, you can do something like this:

import whisper
import numpy as np

CHUNK_LIM = 480000 # On my system....

model = whisper.load_model("base")

audios = []
audio = whisper.load_audio("audio.ogg")

if len(audio) <= CHUNK_LIM:
    audios.append(audio)
else:
    for i in range(0, len(audio), CHUNK_LIM):
        chunk = audio[i:i + CHUNK_LIM]
        chunk_index = len(chunk)
        if chunk_index < CHUNK_LIM:
            padding = [0] * (CHUNK_LIM - chunk_index)
            array1 = np.array(chunk)
            array2 = np.array(padding)
            concat =  np.concatenate((array1, array2))
            chunk = concat.astype(np.float32)
        audios.append(chunk)

results = ""

for chunk in audios:
    
    # make log-Mel spectrogram and move to the same device as the model
    mel = whisper.log_mel_spectrogram(chunk).to(model.device)
    
    # decode the audio
    options = whisper.DecodingOptions(fp16=False)
    result = whisper.decode(model, mel, options)
    results += result.text

# print the recognized text
print(results)

csaben May 5, 2024

slight modification

  audios = []
  audio = whisper.load_audio(path)

  # if smaller than 30 sec, move on
  if len(audio) <= CHUNK_LIM:
      audio = whisper.pad_or_trim(audio)
      audios.append(audio)

  # if larger than 30 sec, chunk it and pad last piece
  else:

      for i in range(0, len(audio), CHUNK_LIM):
          chunk = audio[i : i + CHUNK_LIM]
          chunk_index = len(chunk)
          if chunk_index < CHUNK_LIM:
              chunk = whisper.pad_or_trim(chunk)
          audios.append(chunk)

  results = ""

  for chunk in audios:
      print(chunk.shape)
      # chunk = whisper.pad_or_trim(chunk)
      # make log-Mel spectrogram and move to the same device as the model
      mel = whisper.log_mel_spectrogram(chunk).to(model.device)

      # decode the audio
      options = whisper.DecodingOptions(fp16=False)
      result = whisper.decode(model, mel, options)
      print(result)
      results += result.text

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pad or Trim defaults to 30 seconds #609

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Pad or Trim defaults to 30 seconds #609

dteok Nov 29, 2022

Replies: 2 comments · 3 replies

wwbnjsace Dec 1, 2022

luigisaetta Dec 1, 2022

dteok Dec 2, 2022 Author

lv1974 Jun 24, 2023

csaben May 5, 2024

dteok
Nov 29, 2022

Replies: 2 comments 3 replies

wwbnjsace
Dec 1, 2022

luigisaetta
Dec 1, 2022

dteok Dec 2, 2022
Author