Replies: 6 comments
-
This is an incredible post. I was actually thinking the same thing about a solution that would produce FCC and DCMP-compliant captions. I love how well you’ve laid this out, and I hope to see some responses from those who might be able to answer your question and/or contribute. |
Beta Was this translation helpful? Give feedback.
-
Closest one so far is #435 and you might want to reach out to them. For .srt there has been integration into a software wrapper already but I don't know how much automation they are looking to add: ggerganov/whisper.cpp#159 |
Beta Was this translation helpful? Give feedback.
-
你的邮件我收到了,稍后给你回复。
|
Beta Was this translation helpful? Give feedback.
-
Re: Subtitle line length. Adding a link to a recent improvement from #1184 and the relevant discussion #314. The PR implements |
Beta Was this translation helpful? Give feedback.
-
How do I disable the profanity filters tho? |
Beta Was this translation helpful? Give feedback.
-
Not allowing to disable a profanity filter is a bummer. |
Beta Was this translation helpful? Give feedback.
-
I wanted to start a discussion to understand how researchers or app-developers are wrapping Whisper for generating Closed Captioning & SDH Subtitles, since I imagine that accessibility as well as transcription is a common use case.
Whisper generates SRT & WebVTT transcripts by default, producing Pop-on subtitles.
I see that there are some folks pulling word-by-word #3 (reply in thread), which I assume is for Paint-on captions.
Can anyone share any lessons-learned or research projects that could be used to generate broadcast-grade captions with Whisper? When I say broadcast-grade, I mean not just SRT or ASS transcription or translation for home anime consumption, but captions that aim to adhere to the transcription components of Title 47 which describes the US FCC's guidelines for accuracy, synchronicity, completeness and DCMP-quality captions or subtitles, where the DCMP include recommendations for markup, presentation rate, time-on-screen with a focus for educational use. The NIDCD also include guidelines for both quality and accessibility.
Obviously, there are limits to how well an AI transcription module can address the accessibility requirements of Closed Captioning (CC) and Subtitles for the Deaf and Hard of Hearing (SDH), but accessibility goes beyond pure transcription.
I see there are active discussions and potential improvements around timing accuracy and timing offset in the repo.
Many of the challenges of the accessibility or standards compatibility elements of Closed Captioning and SDH fall into the domain of a wrapper app, and the requirements get quite complex, quite quickly...
>>
to indicate a change of speaker, whereas the DCMP encourage parenthesis and discourage characters (speakername) for identification.charset-normalizer
. A safe filtering option is to use--suppress_tokens -1
, but broadcast-grade EIA-608 require Extended Characters.HH:MM:SS;ff
(SMPTE drop frame),HH:MM:SS:ff
(SMPTE non drop frame) rather than a tick or millisecond currency ofHH:MM:SS.mmm
. These would need to be handled in any external conversion libraries, such as pycaption. Edit: Adding link to discussion topic. Swap timestamps for SMPTE standard timecode #1214infile.en.vtt
, thereby following the conventions established by mpv and vlc. It is assumed that Whisper's language detection is superior to langdetect. Language can also be included in a WebVTT header indicating that the timed text track type is captions and the language is English, egWEBVTT - This file has cues.; Kind: captions; Language: en
.All of the above fall outside the direct domain of Whisper, but would fall within the domain of a wrapper service for accessibility for educational broadcast purposes. Generation of Closed Captioning for educational purposes should aim for the same standards as the regulated broadcast television industry, since one of the tenets of open-education is to increase distribution (through services like over-the-top television) as well as accessibility and inclusiveness. The open source and FOSS community often enable these solutions without the barrier-to-entry of commercial solutions.
The accuracy of OpenAI's Whisper seems to be some of the best out there, rivaling the current commercial solutions such as AppTek Appliances, Amazon Rekognition, IBM Watson, Google Speech and Microsoft Azure Speech Service. Whisper seems far easier to get going compared to Mozilla DeepSpeech, Sphinx or PocketSphinx. Whisper's transcription accuracy and Proper Noun detection seems to be quite excellent, even with the small models.
Is anyone from research, academic or app-developer communities actively developing Closed Captioning & SDH Subtitle solutions that would improve accessibility for the general public and the community as a whole?
Thanks in advance for sharing any Whisper projects which fall into the domain of Closed Captioning & Accessibility beyond that of transcription. And a huge thanks to the developers and sponsors that have contributed to an excellent engine so far. It will be exciting to watch this project develop and serve the needs of communities who rely on accessibility for educational and entertainment needs.
Beta Was this translation helpful? Give feedback.
All reactions