Select or exclude audio parts before transcribing #467

octimot · 2022-11-04T12:24:41Z

octimot
Nov 4, 2022

While working on our tool for Resolve (https://github.com/octimot/StoryToolkitAI), I realized that, besides a few proposals made by people in the community, there is no standardized way to select or exclude certain parts of the audio file to send to Whisper.

So, I've created a notebook as a proposal on how to select or exclude audio file parts without splitting or merging the file itself, but simply by loading the audio as an array via librosa and passing that to Whisper. This is quite efficient and I haven't seen any real performance hits.

OpenAI Whisper - transcribe only certain time intervals.ipynb
https://colab.research.google.com/drive/17cTsmfVJmpDDMURGcu8hUu1zHNAYbfa5?usp=sharing

What one can do in the notebook:

Load an audio file
Tell the script which time intervals to transcribe
Tell the script which time intervals not to transcribe
Pass only the resulting audio segments to whisper for transcription
Get the results with the correct offsets relative to the audio file

An important element of the process is that after the transcription, the script adds the time offsets to the results, so that their start and end times can be used to reference the correct portions of the original audio file.

The selecting/excluding of certain parts of the audio file could be useful for many applications, but in particular for:

pre-removing silent parts to prevent hallucinations
post-adding speaker diarization
re-transcribing certain parts of the audio file (for eg. if multiple languages are used in the audio)
etc.

Feel free to use the code in any way you want and let us know about possible optimizations.

dgoryeo · 2023-01-01T21:14:50Z

dgoryeo
Jan 1, 2023

This is a very good idea @octimot . I just looked at the colab. Well constructed, and easy to follow.

One immediate use case that I can see would be for transcription of large audio/video where each chapter/scene might need different Whisper parameters/options.

I work on Japanese anime (I'm not a techie but am trying to follow the latest tech possibilities). It has been a challenge for anime and game community outside Japan to translate those media through ocr and audio-to-text technologies. Whisper seems to have a potential there, I think.

Your interval based approach can help to adjust Whsiper parameters based on each scene's characteristics and to automate some of the work flow, I think.

0 replies

jon-jett · 2023-03-04T04:14:27Z

jon-jett
Mar 4, 2023

I've put together a tool for FCPX that is less polished than yours and works through FCPXML - tag files in FCPX, export an XML, and it uses Whisper (and/or WhisperX!) to transcribe and create SRTs and use bolded ranges in .docx transcript files to publish favorited ranges back to FCPX.

But the biggest problem is some of our audio files have no audio on the first channel or no audio on the left side of the first stream - ffmpeg by default only uses the first channel of audio it finds in the first stream with the most channels. Which is great for transcoding movies to mkv, but terrible for video production.

I've put together some pretty messy esoteric ways of iterating through audio channels using ffprobe etc, and I can at least merge ALL audio channels together to avoid having no result at all but I can't figure out how to reliably pull a given audio channel from a video file - there are catches with amerge, map, amix, pan etc etc etc in ffmpeg.

@octimot do you know how to do this?

1 reply

octimot Mar 4, 2023
Author

I've never considered pre-processing the audio for our tool, given that users are rendering from Resolve directly to mono. But this sounds like an interesting topic and maybe something that should be done... Also, FCPXML sounds intriguing! :-D

Have you tried using map_channel with ffmpeg? For eg. (off the top of my head):
ffmpeg -i input.wav -map_channel 0.0.0 -map_channel 0.1.0 -ac 1 output.wav - this should select stream 0, channel 0 + stream 0, channel 1 and render them to mono...

For a more pythonic implementation, I'd probably use librosa audioread and do some maths on the resulting array.

jeremymatt · 2023-03-21T03:31:25Z

jeremymatt
Mar 21, 2023

Thanks for putting this together. A couple of questions though:

Is there any particular reason you didn't use the built-in librosa load function to extract the segments you want to transcribe? e.g., from stop to start:

[sub_sample, FSample] = librosa.load(source_file,offset=start,duration=stop-start)

How does whisper handle different sample rates? Does it require a specific rate (e.g., 16,000)?

2 replies

octimot Mar 21, 2023
Author

Is there any particular reason you didn't use the built-in librosa load function to extract the segments you want to transcribe? e.g., from stop to start

For my needs, it's easier to work with the array down the line (for eg. when you have to re-merge the whole audio for other operations etc.) instead of calling librosa for each segment. It feels like it makes more sense like that...

How does whisper handle different sample rates? Does it require a specific rate (e.g., 16,000)?

Not sure what's the capability of the latest whisper package, but it wasn't able to handle anything else than 16 000 when I made this implementation.

jeremymatt Mar 21, 2023

Gotcha, thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Select or exclude audio parts before transcribing #467

{{title}}

Replies: 3 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Select or exclude audio parts before transcribing #467

octimot Nov 4, 2022

Replies: 3 comments · 3 replies

dgoryeo Jan 1, 2023

jon-jett Mar 4, 2023

octimot Mar 4, 2023 Author

jeremymatt Mar 21, 2023

octimot Mar 21, 2023 Author

jeremymatt Mar 21, 2023

octimot
Nov 4, 2022

Replies: 3 comments 3 replies

dgoryeo
Jan 1, 2023

jon-jett
Mar 4, 2023

octimot Mar 4, 2023
Author

jeremymatt
Mar 21, 2023

octimot Mar 21, 2023
Author