Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Support for Google Cloud Speech-To-Text V2 library in mod_google_transcribe #23

Merged
merged 2 commits into from
Mar 23, 2024

Conversation

ajgolledge
Copy link
Contributor

This PR addresses #149 from the earlier drachtio incarnation of this repository. I noticed that the mod_google_transcribe directory in this repository and in the drachtio repository are identical so I took the PR from the old repository and grafted it onto this one, hope that's OK.

This PR offers support for the v2 version of the Speech-To-Text library whilst still supporting v1 simultaneously. The default behaviour is to use the v1 version of the library where everything works identically to the way it did in the previous version. In order to use v2 the FreeSWITCH variable GOOGLE_SPEECH_CLOUD_SERVICES_VERSION must be set to the value "v2". Setting it to "v1" or not setting it at all results in the default behaviour.

If the variable is used then it is essential to provide a so called recognizer parent path in the GOOGLE_SPEECH_RECOGNIZER_PARENT FreeSWITCH variable. Failure to do so will result in a failure to construct the GStreamer class. Recognizers allow commonly used streaming recognition parameters to be stored in the cloud. These stored values can be overridden with parameters passed at runtime but it is essential to provide a recognizer to v2 streaming recognition invocations. If you happen to have already created a recognizer in your Google Cloud account, its id can be passed using the GOOGLE_SPEECH_RECOGNIZER_ID variable. If this is not set then mod_google_transcribe will just use the so called wildcard recognizer id ( the "_" character) and a recognizer will be created on the fly and not stored for future use. Note that even if a persistent recognizer is not required, it is always necessary to provide at least the parent id of the recognizer in GOOGLE_SPEECH_RECOGNIZER_PARENT, otherwise even the wildcard recognizer cannot be created. This parent id is a path string which consists of the google cloud project id which was used to create the google credentials file used, and a geographical location. For more details about recognizers, see https://cloud.google.com/speech-to-text/v2/docs/recognizers

As long as GOOGLE_SPEECH_CLOUD_SERVICES_VERSION is set to "v2" and GOOGLE_SPEECH_RECOGNIZER_PARENT is also set to a valid recognizer parent id then the "v2" library will be used and calls to uuid_google_transcribe should function as it did previously and any configuration parameters provided at runtime will override anything already defined in a predefined recognizer.

Differences between v1 and v2

There are sure to be many more differences but these are the main things I found so far.

Some Notes on the Code and Building

To avoid code duplication we placed v1 specific code in google_glue_v1.cpp and the v2 specific stuff in google_glue_v2.cpp. Generic code used by both libraries now resides in generic_google_glue.h. We use our own docker image to build the FreeSWITCH modules but our make file is based on this one:
https://github.com/drachtio/docker-drachtio-freeswitch-base/blob/main/files/Makefile.am.extra
In order to compile and link the v2 stuff we had to add the following lines to the nodist_libfreeswitch_libgoogleapis_la_SOURCES assignment:

libs/googleapis/gens/google/api/policy.pb.cc \
libs/googleapis/gens/google/cloud/speech/v1/resource.pb.cc \
libs/googleapis/gens/google/cloud/speech/v1/resource.grpc.pb.cc \
libs/googleapis/gens/google/cloud/speech/v2/cloud_speech.pb.cc \
libs/googleapis/gens/google/cloud/speech/v2/cloud_speech.grpc.pb.cc \

If you don't do this, you'll most likely get some problems linking.

Signed-off-by: Andrew Golledge andreas.golledge@gmail.com

@davehorton
Copy link
Contributor

@ajgolledge can you review our contributor rules and if you are ok to proceed the sign-off on this PR or commit?

Signed-off-by: Andrew Golledge <andreas.golledge@gmail.com>
@ajgolledge
Copy link
Contributor Author

Is this OK or do I need to actually modify the previous commit? Thanks for the email btw.

@davehorton
Copy link
Contributor

please just make another commit with the -s flag and push that

@ajgolledge
Copy link
Contributor Author

I thought I'd done that. There are now two commits.

@davehorton
Copy link
Contributor

sorry, you are right. Thanks!

@davehorton
Copy link
Contributor

BTW have you tested this PR yourself?

@ajgolledge
Copy link
Contributor Author

Yes we have been successfully running this PR on our development servers for about a month now. The way we build it is based on the way the drachtio code was built so that might differ from the way you build jambonz now.

@davehorton davehorton merged commit 4e57f73 into jambonz:main Mar 23, 2024
@davehorton
Copy link
Contributor

@ajgolledge I am having some issues in my testing of this PR, and I wonder if you can provide some insight. I am using the google credentials that work fine for V1, now I am using this as the recognizer parent:

projects/drachtio-cpaas/locations/global/recognizers/_

However, when I connect I immediately get "operation canceled from google"

{"type":"error","error_code":1,"error_message":"The operation was cancelled."}.

Any idea what might be causing this? Have you tested with using a recognizer created on the fly like this successfully?

@ajgolledge
Copy link
Contributor Author

@davehorton Yes this has been tested successfully, using recognizers on the fly. Just to be clear, in case you hadn't already, you should set the GOOGLE_SPEECH_RECOGNIZER_PARENT to be:

projects/drachtio-cpaas/locations/global

The /recognizers/ part of the path is then appended followed by either the "_" wildcard or the recognizer ID, if provided. In order to be able to create recognizers you need to have certain roles or permissions to be able to do this. I'm not sure whether this is also necessary when using the wildcard recognizer because I did not test with multiple credentials. However if you're still having problems it might be worth checking what kind of permissions your credentials allow you.

@ajgolledge
Copy link
Contributor Author

ajgolledge commented Apr 2, 2024

At a guess I'd say that it's grpc_read_thread function exiting without having experienced max_duration_exceeded or no_audio.

If you're still having trouble with the recognizer, I can try to reproduce the problem. Do you have any log output which you can share? Do you see this in the log output, for example:

using recognizer: projects/drachtio-cpaas/locations/global/recognizers/_

or does it not get this far?

@davehorton
Copy link
Contributor

actually the problem seems to be in testing setting a the speech start and end timeout, and on a branch I added this code

    if (var = switch_channel_get_variable(channel, "GOOGLE_SPEECH_START_TIMEOUT_MS")) {
      auto ms = atoi(var);
      streaming_config->mutable_streaming_features()->mutable_voice_activity_timeout()->mutable_speech_start_timeout()->set_nanos(ms * 1000000);
      switch_log_printf(SWITCH_CHANNEL_SESSION_LOG(m_session), SWITCH_LOG_DEBUG, "setting speech_start_timeout to %d milliseconds\n", ms);
    }

    if (var = switch_channel_get_variable(channel, "GOOGLE_SPEECH_END_TIMEOUT_MS")) {
      auto ms = atoi(var);
      streaming_config->mutable_streaming_features()->mutable_voice_activity_timeout()->mutable_speech_end_timeout()->set_nanos(ms * 1000000);
      switch_log_printf(SWITCH_CHANNEL_SESSION_LOG(m_session), SWITCH_LOG_DEBUG, "setting speech_end_timeout to %d milliseconds\n", ms);
    }

Seems proper but intermittently when I set these timers I get the immediate cancel error. This is on the "feat/new_params_google_v2" branch if you want to try to recreate

@ajgolledge
Copy link
Contributor Author

I also tried this branch and I always get the error if I set either the GOOGLE_SPEECH_START_TIMEOUT_MS or GOOGLE_SPEECH_END_TIMEOUT_MS values. I tried with various models and various other combinations of parameters but I always get the same error. Enabling the voice activity events works for me, however.
I suspect that other configuration parameters must be set in order for these configuration parameters to be accepted. Unfortunately I don't see any examples where they are used so a bit of experimentation is probably necessary.
Were you also able to get this to work if neither of the two timeout values are set?

@davehorton
Copy link
Contributor

yes if neither is set then it does work for me. So I guess its just a matter of figuring out how to properly use those two parameters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants