When using the Gemini Realtime model with input_audio_transcription enabled, the user input immediately preceding a tool call is not being added to the transcript.
Cause
The RealtimeSession._handle_tool_calls method always marks the current generation as done, effectively erasing the current (non-final) user input transcription. Since this handler runs before the Gemini model sends the generation_complete message, the transcribed input_audio_transcription_completed event is never emitted with is_final set to True, and the message is never added to the transcript.
Thoughts
I tested removing the _mark_current_generation_done() call at the end of the RealtimeSession._handle_tool_calls() method, and this did fix the issue, without raising any other problems that I could see. Since Gemini sends the generation_complete message after the tool call is complete, isn't it redundant to mark the generation done in the tool call handler anyways?
Thanks for your time, and please let me know if there's an opportunity for a PR.
When using the Gemini Realtime model with
input_audio_transcriptionenabled, the user input immediately preceding a tool call is not being added to the transcript.Cause
The
RealtimeSession._handle_tool_callsmethod always marks the current generation as done, effectively erasing the current (non-final) user input transcription. Since this handler runs before the Gemini model sends thegeneration_completemessage, the transcribedinput_audio_transcription_completedevent is never emitted withis_finalset toTrue, and the message is never added to the transcript.Thoughts
I tested removing the
_mark_current_generation_done()call at the end of theRealtimeSession._handle_tool_calls()method, and this did fix the issue, without raising any other problems that I could see. Since Gemini sends thegeneration_completemessage after the tool call is complete, isn't it redundant to mark the generation done in the tool call handler anyways?Thanks for your time, and please let me know if there's an opportunity for a PR.