Closed Captioning & Accessible SDH Subtitles #458

bbgdzxng1 · 2022-11-03T00:04:22Z

bbgdzxng1
Nov 3, 2022

I wanted to start a discussion to understand how researchers or app-developers are wrapping Whisper for generating Closed Captioning & SDH Subtitles, since I imagine that accessibility as well as transcription is a common use case.

Whisper generates SRT & WebVTT transcripts by default, producing Pop-on subtitles.

I see that there are some folks pulling word-by-word #3 (reply in thread), which I assume is for Paint-on captions.

Can anyone share any lessons-learned or research projects that could be used to generate broadcast-grade captions with Whisper? When I say broadcast-grade, I mean not just SRT or ASS transcription or translation for home anime consumption, but captions that aim to adhere to the transcription components of Title 47 which describes the US FCC's guidelines for accuracy, synchronicity, completeness and DCMP-quality captions or subtitles, where the DCMP include recommendations for markup, presentation rate, time-on-screen with a focus for educational use. The NIDCD also include guidelines for both quality and accessibility.

Obviously, there are limits to how well an AI transcription module can address the accessibility requirements of Closed Captioning (CC) and Subtitles for the Deaf and Hard of Hearing (SDH), but accessibility goes beyond pure transcription.

I see there are active discussions and potential improvements around timing accuracy and timing offset in the repo.

Many of the challenges of the accessibility or standards compatibility elements of Closed Captioning and SDH fall into the domain of a wrapper app, and the requirements get quite complex, quite quickly...

The common broadcast captioning mechanisms are:
- Pop-on (line by line, which requires buffering for sentence synchronization)
- Roll-up
- Paint-on (word-by-word timecodes)
I assume that piping/near-realtime of live is unlikely to achieve the level of contextual accuracy and that near-realtime solutions would be limited to adding a tape-delay buffer to allow a look-ahead buffer. Whisper is more suited to offline, non-realtime use-cases, although some folks have already attempted realtime here and here.
Closed Captioning Accessibility Best Practice requires a minimum of Speaker Separation by newline (and ideally speaker markup). This is being addressed in the Transcription and diarization (speaker identification) #264 thread. Some broadcasters follow a convention of using >> to indicate a change of speaker, whereas the DCMP encourage parenthesis and discourage characters (speakername) for identification.
Presentation rate, where the words per minute may need to be rate-limited or buffered for readability. The DCMP Captioning Key recommends a configurable range of {130-160} words per minute.
Transmission rate, where legacy CC protocols such as SCC & EIA-608 have a hard limit in characters per second/video field
Character set limitations, where protocols such as EIA-608 are limited to the Title 47 Basic, Special and Extended character sets which may not directly align with ASCII mappings produced by charset-normalizer. A safe filtering option is to use --suppress_tokens -1, but broadcast-grade EIA-608 require Extended Characters.
External libraries for distillation / translation / simplification to "Easy-Reader" Subtitles, where language is simplified to "Third Grade level for language learners". I'm sure there are some python projects out there that could be used in post process.
Various timecode formats, which may need to operate in a video frame-based currency, such as HH:MM:SS;ff (SMPTE drop frame), HH:MM:SS:ff (SMPTE non drop frame) rather than a tick or millisecond currency of HH:MM:SS.mmm. These would need to be handled in any external conversion libraries, such as pycaption. Edit: Adding link to discussion topic. Swap timestamps for SMPTE standard timecode #1214
Integration of profanity detection and filtering
Rule-based linting'n'hinting / validation for the many pro/Broadcast formats such as SCC Sonic Scenarist Captions (the text-based representation of EIA-608), MCC TeleStream MacCaption (the file-based representation of EIA-708), the various XML-based interchange formats (TTML, SMPTE-TT, EBU-TT ) and finally, Whisper's existing webVTT export should be validated by a wrapper app prior to publishing. The consumer formats of SRT and SSA/ASS have more limited integral accessibility capabilities.
Proper Noun and dictionary import, without the need for model re-training.
Language Identification, discussed here and including the language tag in the output filename, infile.en.vtt, thereby following the conventions established by mpv and vlc. It is assumed that Whisper's language detection is superior to langdetect. Language can also be included in a WebVTT header indicating that the timed text track type is captions and the language is English, eg WEBVTT - This file has cues.; Kind: captions; Language: en.

All of the above fall outside the direct domain of Whisper, but would fall within the domain of a wrapper service for accessibility for educational broadcast purposes. Generation of Closed Captioning for educational purposes should aim for the same standards as the regulated broadcast television industry, since one of the tenets of open-education is to increase distribution (through services like over-the-top television) as well as accessibility and inclusiveness. The open source and FOSS community often enable these solutions without the barrier-to-entry of commercial solutions.

The accuracy of OpenAI's Whisper seems to be some of the best out there, rivaling the current commercial solutions such as AppTek Appliances, Amazon Rekognition, IBM Watson, Google Speech and Microsoft Azure Speech Service. Whisper seems far easier to get going compared to Mozilla DeepSpeech, Sphinx or PocketSphinx. Whisper's transcription accuracy and Proper Noun detection seems to be quite excellent, even with the small models.

Is anyone from research, academic or app-developer communities actively developing Closed Captioning & SDH Subtitle solutions that would improve accessibility for the general public and the community as a whole?

Thanks in advance for sharing any Whisper projects which fall into the domain of Closed Captioning & Accessibility beyond that of transcription. And a huge thanks to the developers and sponsors that have contributed to an excellent engine so far. It will be exciting to watch this project develop and serve the needs of communities who rely on accessibility for educational and entertainment needs.

mindofsteel · 2022-11-17T07:04:41Z

mindofsteel
Nov 17, 2022

This is an incredible post. I was actually thinking the same thing about a solution that would produce FCC and DCMP-compliant captions.

I love how well you’ve laid this out, and I hope to see some responses from those who might be able to answer your question and/or contribute.

0 replies

Baenwort · 2022-12-11T21:28:04Z

Baenwort
Dec 11, 2022

Closest one so far is #435 and you might want to reach out to them.

For .srt there has been integration into a software wrapper already but I don't know how much automation they are looking to add: ggerganov/whisper.cpp#159

0 replies

yqlizeao · 2022-12-28T23:27:47Z

yqlizeao
Dec 28, 2022

你的邮件我收到了，稍后给你回复。

0 replies

bbgdzxng1 · 2023-04-11T02:09:21Z

bbgdzxng1
Apr 11, 2023
Author

Re: Subtitle line length. Adding a link to a recent improvement from #1184 and the relevant discussion #314. The PR implements --max_line_width & --max_line_count, which will enable a user to limit the line length to 32 character, 4 line ~~captions~~ subtitles, which helps subsequent conversion from SRT/VTT to SCC or EIA-608.

0 replies

BigShotPunchAndJudy · 2023-05-20T10:14:06Z

BigShotPunchAndJudy
May 20, 2023

How do I disable the profanity filters tho?

0 replies

skol101 · 2024-02-28T09:02:32Z

skol101
Feb 28, 2024

Not allowing to disable a profanity filter is a bummer.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Closed Captioning & Accessible SDH Subtitles #458

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Closed Captioning & Accessible SDH Subtitles #458

bbgdzxng1 Nov 3, 2022

Replies: 6 comments

mindofsteel Nov 17, 2022

Baenwort Dec 11, 2022

yqlizeao Dec 28, 2022

bbgdzxng1 Apr 11, 2023 Author

BigShotPunchAndJudy May 20, 2023

skol101 Feb 28, 2024

bbgdzxng1
Nov 3, 2022

mindofsteel
Nov 17, 2022

Baenwort
Dec 11, 2022

yqlizeao
Dec 28, 2022

bbgdzxng1
Apr 11, 2023
Author

BigShotPunchAndJudy
May 20, 2023

skol101
Feb 28, 2024