Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow specifying initial_prompt for transcription #2

Closed
johnislarry opened this issue Apr 21, 2023 · 5 comments
Closed

Allow specifying initial_prompt for transcription #2

johnislarry opened this issue Apr 21, 2023 · 5 comments

Comments

@johnislarry
Copy link
Contributor

The initial_prompt passed to whisper for each audio segment is hardcoded to None:

https://github.com/meronym/speaker-transcription/blob/master/predict.py#L105

It would be really great if we could provide a prompt via the replicate api input to use for all transcriptions.

I've seen that transcribing audio with proper nouns (in particular names) doesn't work well unless a prompt is provided that spells them out.

What do you think? If you're busy I could put up a PR, however I don't have a good setup to test an actual inference.

@meronym
Copy link
Owner

meronym commented Apr 21, 2023

@johnislarry a PR would be great! I'll test it and push an update to Replicate over the week-end.

@johnislarry
Copy link
Contributor Author

@meronym just put up a PR!

#3

Let me know what you think when you get a chance

@arnab
Copy link

arnab commented Apr 26, 2023

I just came to ask for the same feature (mechanism to pass an input_prompt to whisper). Thanks for creating the issue and the PR, @johnislarry.

@meronym
Copy link
Owner

meronym commented Apr 27, 2023

Merged #3 🎉

I'll add a note here FYI @johnislarry @arnab - the mechanism by which Whisper handles the input prompt internally is to use it as context for the first transcription frame. This pipeline runs an independent transcription for each detected speaker segment, so for long compact speech segments I would expect the attention paid by the model to this prompt would diminish quite a bit by the end of the segment.

If this becomes a limitation, we could look into splitting large segments in smaller chunks (to make sure the input prompt is injected often enough), but this might impact in-paragraph coherence. Feel free to open another issue if you find a better way to deal with the input_prompt.

In any case, I suspect most out-of-the-box Whisper implementations probably suffer from the same 'vanishing attention' problem with the prompt.

@meronym meronym closed this as completed Apr 27, 2023
@arnab
Copy link

arnab commented Apr 28, 2023

This pipeline runs an independent transcription for each detected speaker segment, so for long compact speech segments I would expect the attention paid by the model to this prompt would diminish quite a bit by the end of the segment.

Thanks for the details. Does this pipeline feed the previous segment as context/prompt for the subsequent ones? Otherwise, do you think it would improve the transcription quality of long form audio by doing something like that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants