Skip to content

Conversation

hhuangMITRE
Copy link
Contributor

@hhuangMITRE hhuangMITRE commented Apr 5, 2024

@hhuangMITRE hhuangMITRE requested a review from jrobble April 5, 2024 00:35
Copy link
Member

@jrobble jrobble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @hhuangMITRE)

a discussion (no related file):
Azure Speech supports:

        {
          "name": "LANGUAGE",
          "description": "The language/locale to use for transcription.",
          "type": "STRING",
          "defaultValue": "en-US"
        },

Azure Translation supports:

       {
          "name": "TO_LANGUAGE",
          "description": "The BCP-47 language code for language that the properties should be translated to.",
          "type": "STRING",
          "defaultValue": "en"
        },
        {
          "name": "FROM_LANGUAGE",
          "description": "Optional property that indicates the source language of the text. When provided, it disables automatic language detection. If the text isn't actually the specified FROM_LANGUAGE, the translation endpoint returns the text unmodified.",
          "type": "STRING",
          "defaultValue": ""
        },
        {
          "name": "SUGGESTED_FROM_LANGUAGE",
          "description": "Optional property that indicates the fallback source language to use when automatic language detection fails. The value from this property is only used when automatic language detection fails.",
          "type": "STRING",
          "defaultValue": ""
        },

Each description should state the BCP / ISO format(s) that it accepts, and if no default value is provided, provide an example in the description. Among these, TO_LANGUAGE may be fine since it has both pieces of information.

If it's not already clear in the READMEs, please make sure they mention the BCP / ISO format(s) that each property accepts as well.



python/AzureSpeechDetection/acs_speech_component/azure_utils.py line 33 at r1 (raw file):

# Supported languages can be found here:
# https://learn.microsoft.com/en-us/azure/ai-services/speech-service/language-support?tabs=stt
ISO6393_TO_BCP47 = dict(

This component accepts:

"LANGUAGE": "es-US"

But throws an error with:

"LANGUAGE": "es-us"

Do we need to be case sensitive? If not, let's ignore case on user input. The same goes for Azure translation.


python/AzureSpeechDetection/acs_speech_component/azure_utils.py line 42 at r1 (raw file):

    azj=["az-AZ"], # North Azerbaijani
    azb=["az-AZ"], # South Azerbaijani
    # bel=["be-BY"], # Depreciated

You switch back and forth between "deprecated" and "depreciated". Is that intentional, or do you always want to use "deprecated"?


python/AzureTranslation/acs_translation_component/convert_language_code.py line 39 at r1 (raw file):

# These cover conflicting 639-2 codes and less common variants
# A warning will be issued if these are used.
ISO639_VAR_TO_BCP47 = dict(

I think it makes sense to put the ISO639_VAR_TO_BCP47 dict after ISO6393_TO_BCP47. ISO639_VAR_TO_BCP47 is only used if there is no entry in ISO6393_TO_BCP47.


python/AzureTranslation/acs_translation_component/convert_language_code.py line 321 at r1 (raw file):

        return language_code.lower()

    lang_code = language_code.upper().strip()

Why only strip() this and not language_code when you used it above?


python/AzureTranslation/acs_translation_component/convert_language_code.py line 337 at r1 (raw file):

    elif bcp_code_var := ISO639_VAR_TO_BCP47.get(lang_code):
        logger.warning(
            f"Unable to find direct a BCP code match for {language_code}\n"

Instead of newlines, separate sentences using periods. Add a space between sentences.

@hhuangMITRE hhuangMITRE requested a review from jrobble April 8, 2024 05:46
Copy link
Contributor Author

@hhuangMITRE hhuangMITRE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 9 files reviewed, 6 unresolved discussions (waiting on @jrobble)

a discussion (no related file):

Previously, jrobble (Jeff Robble) wrote…

Azure Speech supports:

        {
          "name": "LANGUAGE",
          "description": "The language/locale to use for transcription.",
          "type": "STRING",
          "defaultValue": "en-US"
        },

Azure Translation supports:

       {
          "name": "TO_LANGUAGE",
          "description": "The BCP-47 language code for language that the properties should be translated to.",
          "type": "STRING",
          "defaultValue": "en"
        },
        {
          "name": "FROM_LANGUAGE",
          "description": "Optional property that indicates the source language of the text. When provided, it disables automatic language detection. If the text isn't actually the specified FROM_LANGUAGE, the translation endpoint returns the text unmodified.",
          "type": "STRING",
          "defaultValue": ""
        },
        {
          "name": "SUGGESTED_FROM_LANGUAGE",
          "description": "Optional property that indicates the fallback source language to use when automatic language detection fails. The value from this property is only used when automatic language detection fails.",
          "type": "STRING",
          "defaultValue": ""
        },

Each description should state the BCP / ISO format(s) that it accepts, and if no default value is provided, provide an example in the description. Among these, TO_LANGUAGE may be fine since it has both pieces of information.

If it's not already clear in the READMEs, please make sure they mention the BCP / ISO format(s) that each property accepts as well.

Done! Based on the code, seems BCP-47 is used for all properties. The ISO code check is used for feed-forward language detection (which generally are in ISO format) from other components.

Note: BCP-47 (Best Current Practice) codes can be variable: it uses combinations of ISO language and script codes to distinguish particular languages/locales.

For each property, I've also added a recommendation for users to check the README/Azure service for supported languages. I've updated the list of supported codes in the README for the STT service, the Translation service contains instructions to query Azure for supported codes.



python/AzureSpeechDetection/acs_speech_component/azure_utils.py line 33 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

This component accepts:

"LANGUAGE": "es-US"

But throws an error with:

"LANGUAGE": "es-us"

Do we need to be case sensitive? If not, let's ignore case on user input. The same goes for Azure translation.

I don't believe it is necessary but the BCP-47 code page does recommend cases used for certain subtags (https://en.wikipedia.org/wiki/IETF_language_tag#:~:text=An%20IETF%20BCP%2047%20language,the%20IANA%20Language%20Subtag%20Registry).

"Subtags are not case-sensitive, but the specification recommends using the same case as in the Language Subtag Registry, where region subtags are UPPERCASE, script subtags are Title Case, and all other subtags are lowercase. This capitalization follows the recommendations of the underlying ISO standards."

That said, we can't guarantee that other components or users follow the recommendation (and the BCP overview noted that subtags aren't truly case sensitive), so I still agree with making the codes case insensitive. Azure translation already is case insensitive, and I've updated the STT component to behave similarly. I've adjusted the case of the LANGUAGE= property for the Azure STT tests, they passed after the changes submitted.


python/AzureSpeechDetection/acs_speech_component/azure_utils.py line 42 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

You switch back and forth between "deprecated" and "depreciated". Is that intentional, or do you always want to use "deprecated"?

Good catch, they should all be deprecated, I've scanned and updated the lines.


python/AzureTranslation/acs_translation_component/convert_language_code.py line 39 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

I think it makes sense to put the ISO639_VAR_TO_BCP47 dict after ISO6393_TO_BCP47. ISO639_VAR_TO_BCP47 is only used if there is no entry in ISO6393_TO_BCP47.

Done, thanks!


python/AzureTranslation/acs_translation_component/convert_language_code.py line 321 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Why only strip() this and not language_code when you used it above?

Ah, good point. I've moved the strip() call to the top so both checks use it.


python/AzureTranslation/acs_translation_component/convert_language_code.py line 337 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Instead of newlines, separate sentences using periods. Add a space between sentences.

Fixed, thanks!

* Update supported version of Azure STT.
Copy link
Member

@jrobble jrobble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 9 of 9 files at r2, 4 of 4 files at r3, all commit messages.
Dismissed @jrobble from a discussion.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @hhuangMITRE)


python/AzureSpeechDetection/acs_speech_component/azure_utils.py line 33 at r1 (raw file):

Previously, hhuangMITRE (Howard W Huang) wrote…

I don't believe it is necessary but the BCP-47 code page does recommend cases used for certain subtags (https://en.wikipedia.org/wiki/IETF_language_tag#:~:text=An%20IETF%20BCP%2047%20language,the%20IANA%20Language%20Subtag%20Registry).

"Subtags are not case-sensitive, but the specification recommends using the same case as in the Language Subtag Registry, where region subtags are UPPERCASE, script subtags are Title Case, and all other subtags are lowercase. This capitalization follows the recommendations of the underlying ISO standards."

That said, we can't guarantee that other components or users follow the recommendation (and the BCP overview noted that subtags aren't truly case sensitive), so I still agree with making the codes case insensitive. Azure translation already is case insensitive, and I've updated the STT component to behave similarly. I've adjusted the case of the LANGUAGE= property for the Azure STT tests, they passed after the changes submitted.

Thanks for the explanation!

@jrobble jrobble merged commit 70efb21 into master Apr 8, 2024
@jrobble jrobble deleted the feature/azure-stt-translation-language-map-updates branch April 8, 2024 15:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants