Select or exclude audio parts before transcribing #467
Replies: 3 comments 3 replies
-
This is a very good idea @octimot . I just looked at the colab. Well constructed, and easy to follow. One immediate use case that I can see would be for transcription of large audio/video where each chapter/scene might need different Whisper parameters/options. I work on Japanese anime (I'm not a techie but am trying to follow the latest tech possibilities). It has been a challenge for anime and game community outside Japan to translate those media through ocr and audio-to-text technologies. Whisper seems to have a potential there, I think. Your interval based approach can help to adjust Whsiper parameters based on each scene's characteristics and to automate some of the work flow, I think. |
Beta Was this translation helpful? Give feedback.
-
I've put together a tool for FCPX that is less polished than yours and works through FCPXML - tag files in FCPX, export an XML, and it uses Whisper (and/or WhisperX!) to transcribe and create SRTs and use bolded ranges in .docx transcript files to publish favorited ranges back to FCPX. But the biggest problem is some of our audio files have no audio on the first channel or no audio on the left side of the first stream - ffmpeg by default only uses the first channel of audio it finds in the first stream with the most channels. Which is great for transcoding movies to mkv, but terrible for video production. I've put together some pretty messy esoteric ways of iterating through audio channels using ffprobe etc, and I can at least merge ALL audio channels together to avoid having no result at all but I can't figure out how to reliably pull a given audio channel from a video file - there are catches with amerge, map, amix, pan etc etc etc in ffmpeg. @octimot do you know how to do this? |
Beta Was this translation helpful? Give feedback.
-
Thanks for putting this together. A couple of questions though:
|
Beta Was this translation helpful? Give feedback.
-
While working on our tool for Resolve (https://github.com/octimot/StoryToolkitAI), I realized that, besides a few proposals made by people in the community, there is no standardized way to select or exclude certain parts of the audio file to send to Whisper.
So, I've created a notebook as a proposal on how to select or exclude audio file parts without splitting or merging the file itself, but simply by loading the audio as an array via librosa and passing that to Whisper. This is quite efficient and I haven't seen any real performance hits.
OpenAI Whisper - transcribe only certain time intervals.ipynb
https://colab.research.google.com/drive/17cTsmfVJmpDDMURGcu8hUu1zHNAYbfa5?usp=sharing
What one can do in the notebook:
An important element of the process is that after the transcription, the script adds the time offsets to the results, so that their start and end times can be used to reference the correct portions of the original audio file.
The selecting/excluding of certain parts of the audio file could be useful for many applications, but in particular for:
Feel free to use the code in any way you want and let us know about possible optimizations.
Beta Was this translation helpful? Give feedback.
All reactions