Add Support for Google Cloud Speech-To-Text V2 library in mod_google_transcribe #23

ajgolledge · 2024-03-20T15:01:55Z

This PR addresses #149 from the earlier drachtio incarnation of this repository. I noticed that the mod_google_transcribe directory in this repository and in the drachtio repository are identical so I took the PR from the old repository and grafted it onto this one, hope that's OK.

This PR offers support for the v2 version of the Speech-To-Text library whilst still supporting v1 simultaneously. The default behaviour is to use the v1 version of the library where everything works identically to the way it did in the previous version. In order to use v2 the FreeSWITCH variable GOOGLE_SPEECH_CLOUD_SERVICES_VERSION must be set to the value "v2". Setting it to "v1" or not setting it at all results in the default behaviour.

If the variable is used then it is essential to provide a so called recognizer parent path in the GOOGLE_SPEECH_RECOGNIZER_PARENT FreeSWITCH variable. Failure to do so will result in a failure to construct the GStreamer class. Recognizers allow commonly used streaming recognition parameters to be stored in the cloud. These stored values can be overridden with parameters passed at runtime but it is essential to provide a recognizer to v2 streaming recognition invocations. If you happen to have already created a recognizer in your Google Cloud account, its id can be passed using the GOOGLE_SPEECH_RECOGNIZER_ID variable. If this is not set then mod_google_transcribe will just use the so called wildcard recognizer id ( the "_" character) and a recognizer will be created on the fly and not stored for future use. Note that even if a persistent recognizer is not required, it is always necessary to provide at least the parent id of the recognizer in GOOGLE_SPEECH_RECOGNIZER_PARENT, otherwise even the wildcard recognizer cannot be created. This parent id is a path string which consists of the google cloud project id which was used to create the google credentials file used, and a geographical location. For more details about recognizers, see https://cloud.google.com/speech-to-text/v2/docs/recognizers

As long as GOOGLE_SPEECH_CLOUD_SERVICES_VERSION is set to "v2" and GOOGLE_SPEECH_RECOGNIZER_PARENT is also set to a valid recognizer parent id then the "v2" library will be used and calls to uuid_google_transcribe should function as it did previously and any configuration parameters provided at runtime will override anything already defined in a predefined recognizer.

Differences between v1 and v2

No single utterances in v2. That is to say that it is no longer required to specify this as a parameter. Instead it is taken to be implicit from the model selected. If single utterance behaviour is required then this is supported by the short model, for example. To see more details on models see https://cloud.google.com/speech-to-text/v2/docs/streaming-recognize.
Speaker diarization does not seem to be supported yet. The code to perform this is still there in mod_google_transcribe for v2 but I didn't manage to stumble across a combination of model, language and location which supports this. See https://stackoverflow.com/questions/76779418/speaker-diarization-is-disabled-even-for-supported-languages-in-google-speech-to
Multiple Language Support. If you provide up to a maximum of three languages to the recognition request, the speech engine will determine which of the three languages is most likely to have been spoken, automatically.

There are sure to be many more differences but these are the main things I found so far.

Some Notes on the Code and Building

To avoid code duplication we placed v1 specific code in google_glue_v1.cpp and the v2 specific stuff in google_glue_v2.cpp. Generic code used by both libraries now resides in generic_google_glue.h. We use our own docker image to build the FreeSWITCH modules but our make file is based on this one:
https://github.com/drachtio/docker-drachtio-freeswitch-base/blob/main/files/Makefile.am.extra
In order to compile and link the v2 stuff we had to add the following lines to the nodist_libfreeswitch_libgoogleapis_la_SOURCES assignment:

libs/googleapis/gens/google/api/policy.pb.cc \
libs/googleapis/gens/google/cloud/speech/v1/resource.pb.cc \
libs/googleapis/gens/google/cloud/speech/v1/resource.grpc.pb.cc \
libs/googleapis/gens/google/cloud/speech/v2/cloud_speech.pb.cc \
libs/googleapis/gens/google/cloud/speech/v2/cloud_speech.grpc.pb.cc \

If you don't do this, you'll most likely get some problems linking.

Signed-off-by: Andrew Golledge andreas.golledge@gmail.com

davehorton · 2024-03-22T12:35:02Z

@ajgolledge can you review our contributor rules and if you are ok to proceed the sign-off on this PR or commit?

Signed-off-by: Andrew Golledge <andreas.golledge@gmail.com>

ajgolledge · 2024-03-22T13:10:40Z

Is this OK or do I need to actually modify the previous commit? Thanks for the email btw.

davehorton · 2024-03-22T13:11:53Z

please just make another commit with the -s flag and push that

ajgolledge · 2024-03-22T13:15:51Z

I thought I'd done that. There are now two commits.

davehorton · 2024-03-22T13:17:09Z

sorry, you are right. Thanks!

davehorton · 2024-03-22T13:17:24Z

BTW have you tested this PR yourself?

ajgolledge · 2024-03-22T13:21:34Z

Yes we have been successfully running this PR on our development servers for about a month now. The way we build it is based on the way the drachtio code was built so that might differ from the way you build jambonz now.

davehorton · 2024-04-02T14:51:04Z

@ajgolledge I am having some issues in my testing of this PR, and I wonder if you can provide some insight. I am using the google credentials that work fine for V1, now I am using this as the recognizer parent:

projects/drachtio-cpaas/locations/global/recognizers/_

However, when I connect I immediately get "operation canceled from google"

{"type":"error","error_code":1,"error_message":"The operation was cancelled."}.

Any idea what might be causing this? Have you tested with using a recognizer created on the fly like this successfully?

ajgolledge · 2024-04-02T15:29:18Z

@davehorton Yes this has been tested successfully, using recognizers on the fly. Just to be clear, in case you hadn't already, you should set the GOOGLE_SPEECH_RECOGNIZER_PARENT to be:

projects/drachtio-cpaas/locations/global

The /recognizers/ part of the path is then appended followed by either the "_" wildcard or the recognizer ID, if provided. In order to be able to create recognizers you need to have certain roles or permissions to be able to do this. I'm not sure whether this is also necessary when using the wildcard recognizer because I did not test with multiple credentials. However if you're still having problems it might be worth checking what kind of permissions your credentials allow you.

ajgolledge · 2024-04-02T17:26:26Z

At a guess I'd say that it's grpc_read_thread function exiting without having experienced max_duration_exceeded or no_audio.

If you're still having trouble with the recognizer, I can try to reproduce the problem. Do you have any log output which you can share? Do you see this in the log output, for example:

using recognizer: projects/drachtio-cpaas/locations/global/recognizers/_

or does it not get this far?

davehorton · 2024-04-02T18:44:19Z

actually the problem seems to be in testing setting a the speech start and end timeout, and on a branch I added this code

    if (var = switch_channel_get_variable(channel, "GOOGLE_SPEECH_START_TIMEOUT_MS")) {
      auto ms = atoi(var);
      streaming_config->mutable_streaming_features()->mutable_voice_activity_timeout()->mutable_speech_start_timeout()->set_nanos(ms * 1000000);
      switch_log_printf(SWITCH_CHANNEL_SESSION_LOG(m_session), SWITCH_LOG_DEBUG, "setting speech_start_timeout to %d milliseconds\n", ms);
    }

    if (var = switch_channel_get_variable(channel, "GOOGLE_SPEECH_END_TIMEOUT_MS")) {
      auto ms = atoi(var);
      streaming_config->mutable_streaming_features()->mutable_voice_activity_timeout()->mutable_speech_end_timeout()->set_nanos(ms * 1000000);
      switch_log_printf(SWITCH_CHANNEL_SESSION_LOG(m_session), SWITCH_LOG_DEBUG, "setting speech_end_timeout to %d milliseconds\n", ms);
    }

Seems proper but intermittently when I set these timers I get the immediate cancel error. This is on the "feat/new_params_google_v2" branch if you want to try to recreate

ajgolledge · 2024-04-03T21:03:51Z

I also tried this branch and I always get the error if I set either the GOOGLE_SPEECH_START_TIMEOUT_MS or GOOGLE_SPEECH_END_TIMEOUT_MS values. I tried with various models and various other combinations of parameters but I always get the same error. Enabling the voice activity events works for me, however.
I suspect that other configuration parameters must be set in order for these configuration parameters to be accepted. Unfortunately I don't see any examples where they are used so a bit of experimentation is probably necessary.
Were you also able to get this to work if neither of the two timeout values are set?

davehorton · 2024-04-03T21:07:15Z

yes if neither is set then it does work for me. So I guess its just a matter of figuring out how to properly use those two parameters.

Introduce Google Speech-To-Text V2 library

b0d849b

Add sign-off to previous commit

b0015d5

Signed-off-by: Andrew Golledge <andreas.golledge@gmail.com>

davehorton approved these changes Mar 23, 2024

View reviewed changes

davehorton merged commit 4e57f73 into jambonz:main Mar 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Support for Google Cloud Speech-To-Text V2 library in mod_google_transcribe #23

Add Support for Google Cloud Speech-To-Text V2 library in mod_google_transcribe #23

ajgolledge commented Mar 20, 2024

davehorton commented Mar 22, 2024

ajgolledge commented Mar 22, 2024

davehorton commented Mar 22, 2024

ajgolledge commented Mar 22, 2024

davehorton commented Mar 22, 2024

davehorton commented Mar 22, 2024

ajgolledge commented Mar 22, 2024

davehorton commented Apr 2, 2024

ajgolledge commented Apr 2, 2024

ajgolledge commented Apr 2, 2024 •

edited

Loading

davehorton commented Apr 2, 2024

ajgolledge commented Apr 3, 2024

davehorton commented Apr 3, 2024

Add Support for Google Cloud Speech-To-Text V2 library in mod_google_transcribe #23

Add Support for Google Cloud Speech-To-Text V2 library in mod_google_transcribe #23

Conversation

ajgolledge commented Mar 20, 2024

Differences between v1 and v2

Some Notes on the Code and Building

davehorton commented Mar 22, 2024

ajgolledge commented Mar 22, 2024

davehorton commented Mar 22, 2024

ajgolledge commented Mar 22, 2024

davehorton commented Mar 22, 2024

davehorton commented Mar 22, 2024

ajgolledge commented Mar 22, 2024

davehorton commented Apr 2, 2024

ajgolledge commented Apr 2, 2024

ajgolledge commented Apr 2, 2024 • edited Loading

davehorton commented Apr 2, 2024

ajgolledge commented Apr 3, 2024

davehorton commented Apr 3, 2024

ajgolledge commented Apr 2, 2024 •

edited

Loading