fix(asr): avoid duplicating full final transcripts#85
Conversation
Problem: Some ASR providers may emit the full final transcript more than once. The transcript aggregator previously appended every Final event, which could duplicate the spoken text in the pasted result and history. Reproduction: Use DoubaoIME with LLM correction disabled, speak a sentence such as 'hello world', and observe duplicated text when the provider emits repeated Final events. Fix: Treat AsrEvent::Final as the best full transcript seen so far and replace the previous final text instead of appending. Update the regression tests to use neutral English examples.
|
Thanks for the PR! I haven't been able to reproduce the duplication issue on my side, so before merging I'd like to ask you to do one more test: Please try dictating at least three sentences in a row with DoubaoIME and see whether earlier sentences get swallowed / dropped with this change applied. I want to make sure switching For context, see the upstream bug: https://github.com/starccy/doubaoime-asr/issues/2 — there's a known issue where DoubaoIME can truncate earlier content, and I want to be sure this PR doesn't interact badly with it. If three+ sentences come through intact, I'm happy to merge. Thanks! |
Thanks for the follow-up. I tested this with DoubaoIME on this branch. When I dictated four sentences continuously in one go, the transcript came through intact and the earlier sentences were preserved, so changing I did find one related edge case, though: if I say the first sentence, pause briefly, and then continue with the next sentences, DoubaoIME can still produce duplicated / garbled output. That seems consistent with the upstream segmentation issue in So based on my testing, this PR does not appear to introduce a new regression for three+ continuous sentences, but the existing upstream pause/segmentation issue is still present. |
|
I tested it and found that if you pause in speaking, you will indeed encounter the repetitive problem you mentioned. I think we should solve this problem completely. If it is an upstream problem, can we try to determine whether there are repeated sentences for cleaning (because normal people generally do not say the same sentences repeatedly). |
DoubaoIME emits Final ambiguously: a refreshed full transcript within one utterance, but a new segment after a pause that may also replay earlier content. Pure replace drops earlier sentences; pure append duplicates them. Merge by prefix check, stale-replay skip, and longest suffix/prefix overlap trimming to handle all three cases.
Problem: Some ASR providers may emit the full final transcript more than once. The transcript aggregator previously appended every Final event, which could duplicate the spoken text in the pasted result and history.
Reproduction: Use DoubaoIME with LLM correction disabled, speak a sentence such as 'hello world', and observe duplicated text when the provider emits repeated Final events.
Fix: Treat AsrEvent::Final as the best full transcript seen so far and replace the previous final text instead of appending. Update the regression tests to use neutral English examples.