Updating language maps for Azure STT and Translation. Updating translation lang mapping behavior. #360

hhuangMITRE · 2024-04-05T00:35:40Z

Issues:

Azure Speech and Translation: Update supported language mappings openmpf#1788

This change is

…ation mapping behavior.

jrobble

Reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @hhuangMITRE)

a discussion (no related file):
Azure Speech supports:

        {
          "name": "LANGUAGE",
          "description": "The language/locale to use for transcription.",
          "type": "STRING",
          "defaultValue": "en-US"
        },

Azure Translation supports:

       {
          "name": "TO_LANGUAGE",
          "description": "The BCP-47 language code for language that the properties should be translated to.",
          "type": "STRING",
          "defaultValue": "en"
        },
        {
          "name": "FROM_LANGUAGE",
          "description": "Optional property that indicates the source language of the text. When provided, it disables automatic language detection. If the text isn't actually the specified FROM_LANGUAGE, the translation endpoint returns the text unmodified.",
          "type": "STRING",
          "defaultValue": ""
        },
        {
          "name": "SUGGESTED_FROM_LANGUAGE",
          "description": "Optional property that indicates the fallback source language to use when automatic language detection fails. The value from this property is only used when automatic language detection fails.",
          "type": "STRING",
          "defaultValue": ""
        },

Each description should state the BCP / ISO format(s) that it accepts, and if no default value is provided, provide an example in the description. Among these, TO_LANGUAGE may be fine since it has both pieces of information.

If it's not already clear in the READMEs, please make sure they mention the BCP / ISO format(s) that each property accepts as well.

python/AzureSpeechDetection/acs_speech_component/azure_utils.py line 33 at r1 (raw file):

# Supported languages can be found here:
# https://learn.microsoft.com/en-us/azure/ai-services/speech-service/language-support?tabs=stt
ISO6393_TO_BCP47 = dict(

This component accepts:

"LANGUAGE": "es-US"

But throws an error with:

"LANGUAGE": "es-us"

Do we need to be case sensitive? If not, let's ignore case on user input. The same goes for Azure translation.

python/AzureSpeechDetection/acs_speech_component/azure_utils.py line 42 at r1 (raw file):

    azj=["az-AZ"], # North Azerbaijani
    azb=["az-AZ"], # South Azerbaijani
    # bel=["be-BY"], # Depreciated

You switch back and forth between "deprecated" and "depreciated". Is that intentional, or do you always want to use "deprecated"?

python/AzureTranslation/acs_translation_component/convert_language_code.py line 39 at r1 (raw file):

# These cover conflicting 639-2 codes and less common variants
# A warning will be issued if these are used.
ISO639_VAR_TO_BCP47 = dict(

I think it makes sense to put the ISO639_VAR_TO_BCP47 dict after ISO6393_TO_BCP47. ISO639_VAR_TO_BCP47 is only used if there is no entry in ISO6393_TO_BCP47.

python/AzureTranslation/acs_translation_component/convert_language_code.py line 321 at r1 (raw file):

        return language_code.lower()

    lang_code = language_code.upper().strip()

Why only strip() this and not language_code when you used it above?

python/AzureTranslation/acs_translation_component/convert_language_code.py line 337 at r1 (raw file):

    elif bcp_code_var := ISO639_VAR_TO_BCP47.get(lang_code):
        logger.warning(
            f"Unable to find direct a BCP code match for {language_code}\n"

Instead of newlines, separate sentences using periods. Add a space between sentences.

hhuangMITRE

Reviewable status: 0 of 9 files reviewed, 6 unresolved discussions (waiting on @jrobble)

a discussion (no related file):

Previously, jrobble (Jeff Robble) wrote…

Azure Speech supports:

        {
          "name": "LANGUAGE",
          "description": "The language/locale to use for transcription.",
          "type": "STRING",
          "defaultValue": "en-US"
        },

Azure Translation supports:

       {
          "name": "TO_LANGUAGE",
          "description": "The BCP-47 language code for language that the properties should be translated to.",
          "type": "STRING",
          "defaultValue": "en"
        },
        {
          "name": "FROM_LANGUAGE",
          "description": "Optional property that indicates the source language of the text. When provided, it disables automatic language detection. If the text isn't actually the specified FROM_LANGUAGE, the translation endpoint returns the text unmodified.",
          "type": "STRING",
          "defaultValue": ""
        },
        {
          "name": "SUGGESTED_FROM_LANGUAGE",
          "description": "Optional property that indicates the fallback source language to use when automatic language detection fails. The value from this property is only used when automatic language detection fails.",
          "type": "STRING",
          "defaultValue": ""
        },

Each description should state the BCP / ISO format(s) that it accepts, and if no default value is provided, provide an example in the description. Among these, TO_LANGUAGE may be fine since it has both pieces of information.

If it's not already clear in the READMEs, please make sure they mention the BCP / ISO format(s) that each property accepts as well.

Done! Based on the code, seems BCP-47 is used for all properties. The ISO code check is used for feed-forward language detection (which generally are in ISO format) from other components.

Note: BCP-47 (Best Current Practice) codes can be variable: it uses combinations of ISO language and script codes to distinguish particular languages/locales.

For each property, I've also added a recommendation for users to check the README/Azure service for supported languages. I've updated the list of supported codes in the README for the STT service, the Translation service contains instructions to query Azure for supported codes.

python/AzureSpeechDetection/acs_speech_component/azure_utils.py line 33 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

This component accepts:
"LANGUAGE": "es-US"
But throws an error with:
"LANGUAGE": "es-us"
Do we need to be case sensitive? If not, let's ignore case on user input. The same goes for Azure translation.

I don't believe it is necessary but the BCP-47 code page does recommend cases used for certain subtags (https://en.wikipedia.org/wiki/IETF_language_tag#:~:text=An%20IETF%20BCP%2047%20language,the%20IANA%20Language%20Subtag%20Registry).

"Subtags are not case-sensitive, but the specification recommends using the same case as in the Language Subtag Registry, where region subtags are UPPERCASE, script subtags are Title Case, and all other subtags are lowercase. This capitalization follows the recommendations of the underlying ISO standards."

That said, we can't guarantee that other components or users follow the recommendation (and the BCP overview noted that subtags aren't truly case sensitive), so I still agree with making the codes case insensitive. Azure translation already is case insensitive, and I've updated the STT component to behave similarly. I've adjusted the case of the LANGUAGE= property for the Azure STT tests, they passed after the changes submitted.

python/AzureSpeechDetection/acs_speech_component/azure_utils.py line 42 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

You switch back and forth between "deprecated" and "depreciated". Is that intentional, or do you always want to use "deprecated"?

Good catch, they should all be deprecated, I've scanned and updated the lines.

python/AzureTranslation/acs_translation_component/convert_language_code.py line 39 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

I think it makes sense to put the ISO639_VAR_TO_BCP47 dict after ISO6393_TO_BCP47. ISO639_VAR_TO_BCP47 is only used if there is no entry in ISO6393_TO_BCP47.

Done, thanks!

python/AzureTranslation/acs_translation_component/convert_language_code.py line 321 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Why only strip() this and not language_code when you used it above?

Ah, good point. I've moved the strip() call to the top so both checks use it.

python/AzureTranslation/acs_translation_component/convert_language_code.py line 337 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Instead of newlines, separate sentences using periods. Add a space between sentences.

Fixed, thanks!

* Update supported version of Azure STT.

jrobble

Reviewed 9 of 9 files at r2, 4 of 4 files at r3, all commit messages.
Dismissed @jrobble from a discussion.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @hhuangMITRE)

python/AzureSpeechDetection/acs_speech_component/azure_utils.py line 33 at r1 (raw file):

Previously, hhuangMITRE (Howard W Huang) wrote…

I don't believe it is necessary but the BCP-47 code page does recommend cases used for certain subtags (https://en.wikipedia.org/wiki/IETF_language_tag#:~:text=An%20IETF%20BCP%2047%20language,the%20IANA%20Language%20Subtag%20Registry).

"Subtags are not case-sensitive, but the specification recommends using the same case as in the Language Subtag Registry, where region subtags are UPPERCASE, script subtags are Title Case, and all other subtags are lowercase. This capitalization follows the recommendations of the underlying ISO standards."

That said, we can't guarantee that other components or users follow the recommendation (and the BCP overview noted that subtags aren't truly case sensitive), so I still agree with making the codes case insensitive. Azure translation already is case insensitive, and I've updated the STT component to behave similarly. I've adjusted the case of the LANGUAGE= property for the Azure STT tests, they passed after the changes submitted.

Thanks for the explanation!

Updating language maps for Azure STT and Translation. Updating transl…

d8f17b3

…ation mapping behavior.

hhuangMITRE requested a review from jrobble April 5, 2024 00:35

Adding info, minor cleanup and lang update.

4a317ba

jrobble requested changes Apr 6, 2024

View reviewed changes

hhuangMITRE and others added 3 commits April 7, 2024 04:54

Documentation update and language check improvements.

aa1e811

Update descriptor.json

4680b04

Update descriptor.json

73c0a4d

hhuangMITRE requested a review from jrobble April 8, 2024 05:46

hhuangMITRE commented Apr 8, 2024

View reviewed changes

Formatting changes.

2ef2ed0

* Update supported version of Azure STT.

jrobble reviewed Apr 8, 2024

View reviewed changes

jrobble merged commit 70efb21 into master Apr 8, 2024

jrobble deleted the feature/azure-stt-translation-language-map-updates branch April 8, 2024 15:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Updating language maps for Azure STT and Translation. Updating translation lang mapping behavior. #360

Updating language maps for Azure STT and Translation. Updating translation lang mapping behavior. #360

Uh oh!

hhuangMITRE commented Apr 5, 2024 •

edited by jrobble

Loading

Uh oh!

jrobble left a comment

Uh oh!

hhuangMITRE left a comment

Uh oh!

jrobble left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Updating language maps for Azure STT and Translation. Updating translation lang mapping behavior. #360

Updating language maps for Azure STT and Translation. Updating translation lang mapping behavior. #360

Uh oh!

Conversation

hhuangMITRE commented Apr 5, 2024 • edited by jrobble Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jrobble left a comment

Choose a reason for hiding this comment

Uh oh!

hhuangMITRE left a comment

Choose a reason for hiding this comment

Uh oh!

jrobble left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hhuangMITRE commented Apr 5, 2024 •

edited by jrobble

Loading