Allow specifying initial_prompt for transcription #2

johnislarry · 2023-04-21T06:35:31Z

The initial_prompt passed to whisper for each audio segment is hardcoded to None:

https://github.com/meronym/speaker-transcription/blob/master/predict.py#L105

It would be really great if we could provide a prompt via the replicate api input to use for all transcriptions.

I've seen that transcribing audio with proper nouns (in particular names) doesn't work well unless a prompt is provided that spells them out.

What do you think? If you're busy I could put up a PR, however I don't have a good setup to test an actual inference.

meronym · 2023-04-21T10:23:25Z

@johnislarry a PR would be great! I'll test it and push an update to Replicate over the week-end.

johnislarry · 2023-04-24T05:03:30Z

@meronym just put up a PR!

#3

Let me know what you think when you get a chance

arnab · 2023-04-26T22:26:18Z

I just came to ask for the same feature (mechanism to pass an input_prompt to whisper). Thanks for creating the issue and the PR, @johnislarry.

meronym · 2023-04-27T19:04:24Z

Merged #3 🎉

I'll add a note here FYI @johnislarry @arnab - the mechanism by which Whisper handles the input prompt internally is to use it as context for the first transcription frame. This pipeline runs an independent transcription for each detected speaker segment, so for long compact speech segments I would expect the attention paid by the model to this prompt would diminish quite a bit by the end of the segment.

If this becomes a limitation, we could look into splitting large segments in smaller chunks (to make sure the input prompt is injected often enough), but this might impact in-paragraph coherence. Feel free to open another issue if you find a better way to deal with the input_prompt.

In any case, I suspect most out-of-the-box Whisper implementations probably suffer from the same 'vanishing attention' problem with the prompt.

arnab · 2023-04-28T01:09:59Z

This pipeline runs an independent transcription for each detected speaker segment, so for long compact speech segments I would expect the attention paid by the model to this prompt would diminish quite a bit by the end of the segment.

Thanks for the details. Does this pipeline feed the previous segment as context/prompt for the subsequent ones? Otherwise, do you think it would improve the transcription quality of long form audio by doing something like that?

meronym closed this as completed Apr 27, 2023

arnab mentioned this issue Apr 28, 2023

Allow specifying whisper model #4

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow specifying initial_prompt for transcription #2

Allow specifying initial_prompt for transcription #2

johnislarry commented Apr 21, 2023

meronym commented Apr 21, 2023

johnislarry commented Apr 24, 2023

arnab commented Apr 26, 2023

meronym commented Apr 27, 2023 •

edited

Loading

arnab commented Apr 28, 2023

Allow specifying initial_prompt for transcription #2

Allow specifying initial_prompt for transcription #2

Comments

johnislarry commented Apr 21, 2023

meronym commented Apr 21, 2023

johnislarry commented Apr 24, 2023

arnab commented Apr 26, 2023

meronym commented Apr 27, 2023 • edited Loading

arnab commented Apr 28, 2023

meronym commented Apr 27, 2023 •

edited

Loading