Update Azure Translation character limits. Add NLP TextSplitting Models. Update TextSplitter for input langs. #359

hhuangMITRE · 2024-04-02T19:43:12Z

Issues:

Azure Translation: Identify character limits. Split text using NLP Text Splitter. openmpf#1771

This change is

…ls. Update TextSplitter for input langs.

hhuangMITRE

Reviewable status: 0 of 28 files reviewed, 18 unresolved discussions (waiting on @brosenberg42 and @jrobble)

python/AzureTranslation/LICENSE line 1 at r3 (raw file):

/*****************************************************************************

(Jeff, Howard): I need to add WtP and spaCy to the LICENSE file

python/AzureTranslation/acs_translation_component/acs_translation_component.py line 430 at r1 (raw file):

class BreakSentenceClient:
    """
    Class to interact with Azure's "/breaksentence" endpoint. It is only used when the text to

(Old Code, Reviewed by Jeff):

Please remove this

formerly interacts with Azure's "/breaksentence" endpoint.

There's no need to mention what the component used to do. It's all there in the git history if we ever want to go back.

python/AzureTranslation/acs_translation_component/acs_translation_component.py line 441 at r2 (raw file):

class BreakSentenceClient:

(Jeff) Looking at this:

 class BreakSentenceClient:

"Client" is no longer accurate as this class no longer interacts with a server. Rename to "SentenceSplitter". Also, rename the "break-sentence" dir to "sentence-splitter". Look for other usages of BreakSentence and update them.

python/AzureTranslation/acs_translation_component/acs_translation_component.py line 468 at r2 (raw file):

    def split_input_text(self, text: str, from_lang: Optional[str],
                         from_lang_confidence: Optional[float]) -> SplitTextResult:
        """

(Jeff) Looking at this:

       Breaks up the given text in to chunks that are under TRANSLATION_MAX_CHARS. Each chunk
        will contain one or more complete sentences as reported by the break sentence endpoint.

There is no longer a "break sentence endpoint". Please update this comment.

python/AzureTranslation/acs_translation_component/acs_translation_component.py line 496 at r2 (raw file):

                self._sentence_model)

        chunks = list(divided_text_list)

(Jeff) Looking at this:

        log.warning(f'Broke text up in to {len(chunks)} chunks. Each chunk will be sent to '
                    'the translation endpoint.')

        log.info('Grouped sentences into %s chunks.', len(chunks))

I don't think you need the warning and the info log statements. Remove the warning.

python/AzureTranslation/acs_translation_component/acs_translation_component.py line 606 at r2 (raw file):

class NewLineBehavior:
    """
    The Azure translation service treats newlines a separator between sentences. This results in

(Jeff): “The Azure translation service treats newlines a separator between”

Update to: "treats newlines as a separator"

python/AzureTranslation/nlp_text_splitter/text_splitter.py line 64 at r2 (raw file):

                self._update_spacy_model(model_name)
                self.split = self._split_spacy
                log.info(f"Setup up spaCy model: {model_name}")

Jeff:

Typo in these. Remove "up" from "Setup up":

            log.info(f"Setup up WtP model: {model_name}")
            log.info(f"Setup up spaCy model: {model_name}")

python/AzureTranslation/nlp_text_splitter/text_splitter.py line 99 at r2 (raw file):

            iso_lang = WtpLanguageSettings.convert_to_iso(lang)
            if iso_lang:
                return self.wtp_model.split(text, lang_code=iso_lang) # type: ignore

(Jeff) Looking at this code:

    def _split_wtp(self, text: str, lang: Optional[str] = None) -> List[str]:
        if lang:
            iso_lang = WTPLanguageSettings.convert_to_iso(lang)
            if iso_lang:
                return self.wtp_model.split(text, lang_code=iso_lang) # type: ignore
        return self.wtp_model.split(text) # type: ignore

If iso_lang is returned as None, should we log a warning?

python/AzureTranslation/nlp_text_splitter/text_splitter.py line 100 at r2 (raw file):

            if iso_lang:
                return self.wtp_model.split(text, lang_code=iso_lang) # type: ignore
        return self.wtp_model.split(text) # type: ignore

Looking at this code:

        return self.wtp_model.split(text, lang_code=iso_lang) # type: ignore
        return self.wtp_model.split(text) # type: ignore

What do you mean by "type: ignore" ?

python/AzureTranslation/nlp_text_splitter/text_splitter.py line 134 at r2 (raw file):

            self.set_text(text)

    def set_text(self, text: str):

(Jeff):
I'm trying to make sense of the units for the last line of code shown here:

    def set_text(self, text: str):
        self._text = text
        self._text_full_size = self._get_text_size(text)
        chars_per_size = len(text) / self._text_full_size
        self._num_overhead_bytes = self._get_text_size('')

        self._soft_limit = int(self._limit * chars_per_size) - self._num_overhead_bytes

<…>
How can we subtract bytes from chars? Are we assuming one byte per char since we only work with UTF-8?

[For this PR] we do this:

        actual = list(TextSplitter.split(input_text, # text
            BreakSentenceClient.TRANSLATION_MAX_CHARS, # limit
            BreakSentenceClient.TRANSLATION_MAX_CHARS, # num_boundary_chars
            get_azure_char_count, # _get_text_size() --> For azure, emojis count as 2 chars
            self.wtp_model))

So here's what we have:

        self._soft_limit = int(self._limit * chars_per_size) - self._num_overhead_bytes
		              = int(AZURE_TOKENS * CHARS/AZURE_TOKENS) - CHARS
                         = int(CHARS) - CHARS

Here, self._num_overhead_bytes is a bit misleading since we're dealing with a char count, not bytes. Although one UTF-8 char is one byte. Should we rename this to overhead_size?

Related, we use _get_text_size() in the following lines:

    self._text_full_size = self._get_text_size(text)
	
    self._num_overhead_bytes = self._get_text_size('')

But the first variable ends with _size while the second ends with _bytes.

python/AzureTranslation/nlp_text_splitter/text_splitter.py line 145 at r2 (raw file):
(Jeff)
Looking at this code:

        # Recalculate overhead bytes with chars_per_size weighting.
        self._soft_limit = max(1,
                               int((self._limit - self._num_overhead_bytes) * chars_per_size))

Here you're recalculating the soft_limit so the comment seems wrong. Consider:

Recalculate soft limit by subtracting overhead from limit before applying chars_per_size weighting.

python/AzureTranslation/nlp_text_splitter/wtp_lang_settings.py line 29 at r3 (raw file):

from typing import Optional

class WtpLanguageSettings:

(Old Code, Reviewed by Jeff): In camelcase this should be WtpLanguageSettings.

python/AzureTranslation/plugin-files/descriptor/descriptor.json line 86 at r2 (raw file):

        {
          "name": "SENTENCE_SPLITTER_CHAR_COUNT",
          "description": "Integer value specifying maximum number of characters to process through sentence splitter. Defaults to 500 characters.",

(From Jeff):
No need to mention the default in the description. Please remove it.

python/AzureTranslation/plugin-files/descriptor/descriptor.json line 92 at r3 (raw file):

        {
          "name": "SENTENCE_MODEL",
          "description": "Name of sentence segmentation model. Supported options are spaCy's multilingual `xx_sent_ud_sm` model and the Where's the Point (WtP) `wtp-bert-mini` model.",

(Old Code, reviewed by Jeff): One could read this and think that wtp-bert-mini is a spaCy model Say:

Supported options are spaCy's multilingual 'xx_sent_ud_sm' model and the Where's the Point (WtP) 'wtp-bert-mini' model.

python/AzureTranslation/tests/test_acs_translation.py line 515 at r2 (raw file):

                dict(ACS_URL='http://example.com/test?suggestedFrom=ru&category=whatever'),
                'en', None, {'suggestedFrom': 'ru', 'category': 'whatever'})

Rename test_guess_breaks_all_types to test_split_short_punctuation.
Rename test_guess_breaks_actual_sentence to test_split_simple_sentence.
Rename test_sentence_end_punctuation to test_split_sentence_end_punctuation.
Rename test_split_text to test_split_wtp_known_language.
Rename test_split_text_check_wtp_unusual_lang to test_split_wtp_unknown_lang.

python/AzureTranslation/tests/test_acs_translation.py line 614 at r2 (raw file):

        self.assertAlmostEqual(1.0,
            float(detection_props['TRANSLATION SOURCE LANGUAGE CONFIDENCE']))

def test_split_text used to have expected_chunk_lengths = [142, 137, 141], and those lengths were used to check the length of the translation requests. Your tests no longer check those lengths, only that the beginning and ending text matches what's expected. Is there a reason we're no longer checking the lengths?

Related, the test previously did self.assertEqual(sum(expected_chunk_lengths), len(break_sentence_request_text)) to ensure the input text length matched the sum of the chunk lengths. It seems like a good idea to do that.

If you reintroduce the above things, I don't think you'll need to do:

    behavior = NewLineBehavior.get({})
    actual = list(TextSplitter.split(behavior(text, 'zh-Hant'),
        BreakSentenceClient.TRANSLATION_MAX_CHARS,
        BreakSentenceClient.TRANSLATION_MAX_CHARS,
        get_azure_char_count,
        self.wtp_model,
        'zh-Hant'))

    self.assertEqual(4, len(actual))

That check is redundant. I think it's better to just check what's in the request bodies.

Note that test_split_text_check_wtp_unusual_lang should have the same kind of checks and asserts as def test_split_text.

python/AzureTranslation/tests/test_acs_translation.py line 851 at r2 (raw file):

    @mock.patch.object(BreakSentenceClient, 'TRANSLATION_MAX_CHARS', new_callable=lambda: 5)

Remove these lines from the tests as they are no longer needed: "@mock.patch.object(BreakSentenceClient, 'TRANSLATION_MAX_CHARS', "

You can simply set the value directly in your invocation rather than rely on the mock. For example, instead of:

@mock.patch.object(BreakSentenceClient, 'TRANSLATION_MAX_CHARS', new_callable=lambda: 5)

    actual = list(TextSplitter.split(input_text,
        BreakSentenceClient.TRANSLATION_MAX_CHARS,
        BreakSentenceClient.TRANSLATION_MAX_CHARS,
        get_azure_char_count,
        self.wtp_model))

Do:

    actual = list(TextSplitter.split(input_text,
        5,
        5,
        get_azure_char_count,
        self.wtp_model))

python/AzureTranslation/tests/test_acs_translation.py line 864 at r2 (raw file):

        self.assertEqual(6, len(actual))

        # a.bc,

What do comments mean above each line here?

        # a.bc,
        self.assertEqual('a.bc,', actual[0])
        # bc,d.
        self.assertEqual('d.efg', actual[1])
        # efg,h
        self.assertEqual(',hij ', actual[2])
        # hij k
        self.assertEqual('kl\n\n', actual[3])
        # kl\n\nm
        self.assertEqual('mnopq', actual[4])
        # mnopq
        self.assertEqual('rs.tu', actual[5])

I can understand "# a.bc," since that is the first 5 chars the splitter is working on; however, I don't understand "# bc,d.". It wouldn't go back and look at "bc," again since it already made that part of the first split, right?

You reuse the same line comments for the spaCy test right after WtP one.

hhuangMITRE

Reviewable status: 0 of 28 files reviewed, 18 unresolved discussions (waiting on @brosenberg42 and @jrobble)

python/AzureTranslation/nlp_text_splitter/text_splitter.py line 100 at r2 (raw file):
(Brian)

# type: ignore

That disables the static type checker for that line. When libraries do not include type annotations, the static type checker infers them. Sometimes it isn't able to correctly determine the exact type so it will say that a function returns something more general than it does to be safe. It is also bad at inferring generics. For self.wtp_model.split(text) the inferred return value is:

List[Unknown] | Generator[List[Unknown], Any, None]

The actual return value is:

List[str] | Generator[List[str], Any, None]

If WTP.split were properly annotated it would end up being:

class WTP: @typing.overload def split(text: str) -> List[str]: ... @typing.overload def split(texts: Iterable[str]) -> Generator[List[str], Any, None]: ...

We only ever pass a single string to WTP.split, so we know that it will always return a List[str]. A slightly better option than the suppression comment is to use:

def _split_wtp(self, text: str) -> List[str]: return typing.cast(List[str], self.wtp_model.split(text))

It doesn't matter though because it is only for the static type checker and has no effect at runtime.

python/AzureTranslation/nlp_text_splitter/text_splitter.py line 134 at r2 (raw file):

Previously, hhuangMITRE (Howard W Huang) wrote…

(Jeff):
I'm trying to make sense of the units for the last line of code shown here:
    def set_text(self, text: str):
        self._text = text
        self._text_full_size = self._get_text_size(text)
        chars_per_size = len(text) / self._text_full_size
        self._num_overhead_bytes = self._get_text_size('')

        self._soft_limit = int(self._limit * chars_per_size) - self._num_overhead_bytes
<…>
How can we subtract bytes from chars? Are we assuming one byte per char since we only work with UTF-8?

[For this PR] we do this:
        actual = list(TextSplitter.split(input_text, # text
            BreakSentenceClient.TRANSLATION_MAX_CHARS, # limit
            BreakSentenceClient.TRANSLATION_MAX_CHARS, # num_boundary_chars
            get_azure_char_count, # _get_text_size() --> For azure, emojis count as 2 chars
            self.wtp_model))
So here's what we have:
        self._soft_limit = int(self._limit * chars_per_size) - self._num_overhead_bytes
		              = int(AZURE_TOKENS * CHARS/AZURE_TOKENS) - CHARS
                         = int(CHARS) - CHARS
Here, self._num_overhead_bytes is a bit misleading since we're dealing with a char count, not bytes. Although one UTF-8 char is one byte. Should we rename this to overhead_size?

Related, we use _get_text_size() in the following lines:
    self._text_full_size = self._get_text_size(text)
	
    self._num_overhead_bytes = self._get_text_size('')
But the first variable ends with _size while the second ends with _bytes.

Summarizing follow-up:

Azure sometimes will treat certain characters as having size 2 (i.e. emojis often are counted as 2 characters).

We agreed that we should rename num_overhead_bytes to overhead_size for clarity. Then:

self._num_overhead_bytes = self._get_text_size('')

Becomes:

self._overhead_size = self._get_text_size('')

Which pairs nicely with:

self._text_full_size = self._get_text_size(text)

python/AzureTranslation/tests/test_acs_translation.py line 864 at r2 (raw file):

Previously, hhuangMITRE (Howard W Huang) wrote…

What do comments mean above each line here?
        # a.bc,
        self.assertEqual('a.bc,', actual[0])
        # bc,d.
        self.assertEqual('d.efg', actual[1])
        # efg,h
        self.assertEqual(',hij ', actual[2])
        # hij k
        self.assertEqual('kl\n\n', actual[3])
        # kl\n\nm
        self.assertEqual('mnopq', actual[4])
        # mnopq
        self.assertEqual('rs.tu', actual[5])
I can understand "# a.bc," since that is the first 5 chars the splitter is working on; however, I don't understand "# bc,d.". It wouldn't go back and look at "bc," again since it already made that part of the first split, right?

You reuse the same line comments for the spaCy test right after WtP one.

(Brian)

What do comments mean above each line here? # a.bc,

In the original test, the comments showed the input to the sentence break guesser. The test is setup so that the break sentence endpoint accepts only up to 5 characters, so the comments always contain 5 characters.

In the original test, I check that the sentence break guesser decided that the first sentence was "a." because it should favor breaking at the sentence ending punctuation.

The updated test in the PR essentially verifies that sentence splitting is NOT working because it always outputs the 5 character string that is received as input.

hhuangMITRE

Reviewable status: 0 of 28 files reviewed, 18 unresolved discussions (waiting on @brosenberg42 and @jrobble)

python/AzureTranslation/LICENSE line 1 at r3 (raw file):

Previously, hhuangMITRE (Howard W Huang) wrote…

(Jeff, Howard): I need to add WtP and spaCy to the LICENSE file

Made the requested change, thanks!

python/AzureTranslation/acs_translation_component/acs_translation_component.py line 430 at r1 (raw file):

Previously, hhuangMITRE (Howard W Huang) wrote…

(Old Code, Reviewed by Jeff):

Please remove this

formerly interacts with Azure's "/breaksentence" endpoint.

There's no need to mention what the component used to do. It's all there in the git history if we ever want to go back.

Changes made, thank you!

python/AzureTranslation/acs_translation_component/acs_translation_component.py line 496 at r2 (raw file):

Previously, hhuangMITRE (Howard W Huang) wrote…

(Jeff) Looking at this:
        log.warning(f'Broke text up in to {len(chunks)} chunks. Each chunk will be sent to '
                    'the translation endpoint.')

        log.info('Grouped sentences into %s chunks.', len(chunks))
I don't think you need the warning and the info log statements. Remove the warning.

Warning removed.

python/AzureTranslation/acs_translation_component/acs_translation_component.py line 606 at r2 (raw file):

Previously, hhuangMITRE (Howard W Huang) wrote…

(Jeff): “The Azure translation service treats newlines a separator between”

Update to: "treats newlines as a separator"

Fixed, thank you.

python/AzureTranslation/nlp_text_splitter/text_splitter.py line 64 at r2 (raw file):

Previously, hhuangMITRE (Howard W Huang) wrote…

Jeff:

Typo in these. Remove "up" from "Setup up":

            log.info(f"Setup up WtP model: {model_name}")
            log.info(f"Setup up spaCy model: {model_name}")

Fixed, thanks.

python/AzureTranslation/nlp_text_splitter/text_splitter.py line 99 at r2 (raw file):

Previously, hhuangMITRE (Howard W Huang) wrote…

(Jeff) Looking at this code:

    def _split_wtp(self, text: str, lang: Optional[str] = None) -> List[str]:
        if lang:
            iso_lang = WTPLanguageSettings.convert_to_iso(lang)
            if iso_lang:
                return self.wtp_model.split(text, lang_code=iso_lang) # type: ignore
        return self.wtp_model.split(text) # type: ignore

If iso_lang is returned as None, should we log a warning?

Added the log.warning below, also asked users to consider spaCy's model if WtP is not working well.

python/AzureTranslation/nlp_text_splitter/text_splitter.py line 100 at r2 (raw file):

Previously, hhuangMITRE (Howard W Huang) wrote…

(Brian)

# type: ignore

That disables the static type checker for that line. When libraries do not include type annotations, the static type checker infers them. Sometimes it isn't able to correctly determine the exact type so it will say that a function returns something more general than it does to be safe. It is also bad at inferring generics. For self.wtp_model.split(text) the inferred return value is:

List[Unknown] | Generator[List[Unknown], Any, None]

The actual return value is:

List[str] | Generator[List[str], Any, None]

If WTP.split were properly annotated it would end up being:

class WTP: @typing.overload def split(text: str) -> List[str]: ... @typing.overload def split(texts: Iterable[str]) -> Generator[List[str], Any, None]: ...

We only ever pass a single string to WTP.split, so we know that it will always return a List[str]. A slightly better option than the suppression comment is to use:

def _split_wtp(self, text: str) -> List[str]: return typing.cast(List[str], self.wtp_model.split(text))

It doesn't matter though because it is only for the static type checker and has no effect at runtime.

Thanks Brian!

In that case, I think it's best to leave out the type: ignore comment, I feel that the ideal solution would be for WTP.split to have proper annotation.

As our changes have no effect at runtime, I'm leaning towards keepingself.wtp_model.split(text) as is, but I can also add in the suppression comment if that's preferred.

python/AzureTranslation/nlp_text_splitter/text_splitter.py line 134 at r2 (raw file):

Previously, hhuangMITRE (Howard W Huang) wrote…

Summarizing follow-up:

Azure sometimes will treat certain characters as having size 2 (i.e. emojis often are counted as 2 characters).

We agreed that we should rename num_overhead_bytes to overhead_size for clarity. Then:

self._num_overhead_bytes = self._get_text_size('')

Becomes:

self._overhead_size = self._get_text_size('')

Which pairs nicely with:

self._text_full_size = self._get_text_size(text)

Changes make, thanks everyone for following up on this!

python/AzureTranslation/nlp_text_splitter/text_splitter.py line 145 at r2 (raw file):

Previously, hhuangMITRE (Howard W Huang) wrote…

(Jeff)
Looking at this code:
        # Recalculate overhead bytes with chars_per_size weighting.
        self._soft_limit = max(1,
                               int((self._limit - self._num_overhead_bytes) * chars_per_size))
Here you're recalculating the soft_limit so the comment seems wrong. Consider:

Recalculate soft limit by subtracting overhead from limit before applying chars_per_size weighting.

Updated the description here and below.

python/AzureTranslation/nlp_text_splitter/wtp_lang_settings.py line 29 at r3 (raw file):

Previously, hhuangMITRE (Howard W Huang) wrote…

(Old Code, Reviewed by Jeff): In camelcase this should be WtpLanguageSettings.

Fixed, thank you!

python/AzureTranslation/plugin-files/descriptor/descriptor.json line 86 at r2 (raw file):

Previously, hhuangMITRE (Howard W Huang) wrote…

(From Jeff):
No need to mention the default in the description. Please remove it.

Removed.

python/AzureTranslation/plugin-files/descriptor/descriptor.json line 92 at r3 (raw file):

Previously, hhuangMITRE (Howard W Huang) wrote…

(Old Code, reviewed by Jeff): One could read this and think that wtp-bert-mini is a spaCy model Say:

Supported options are spaCy's multilingual 'xx_sent_ud_sm' model and the Where's the Point (WtP) 'wtp-bert-mini' model.

Fixed, thank you!

python/AzureTranslation/tests/test_acs_translation.py line 515 at r2 (raw file):

Previously, hhuangMITRE (Howard W Huang) wrote…

Rename test_guess_breaks_all_types to test_split_short_punctuation.
Rename test_guess_breaks_actual_sentence to test_split_simple_sentence.
Rename test_sentence_end_punctuation to test_split_sentence_end_punctuation.
Rename test_split_text to test_split_wtp_known_language.
Rename test_split_text_check_wtp_unusual_lang to test_split_wtp_unknown_lang.

Changes complete.

python/AzureTranslation/tests/test_acs_translation.py line 614 at r2 (raw file):

Previously, hhuangMITRE (Howard W Huang) wrote…

def test_split_text used to have expected_chunk_lengths = [142, 137, 141], and those lengths were used to check the length of the translation requests. Your tests no longer check those lengths, only that the beginning and ending text matches what's expected. Is there a reason we're no longer checking the lengths?

Related, the test previously did self.assertEqual(sum(expected_chunk_lengths), len(break_sentence_request_text)) to ensure the input text length matched the sum of the chunk lengths. It seems like a good idea to do that.

If you reintroduce the above things, I don't think you'll need to do:
    behavior = NewLineBehavior.get({})
    actual = list(TextSplitter.split(behavior(text, 'zh-Hant'),
        BreakSentenceClient.TRANSLATION_MAX_CHARS,
        BreakSentenceClient.TRANSLATION_MAX_CHARS,
        get_azure_char_count,
        self.wtp_model,
        'zh-Hant'))

    self.assertEqual(4, len(actual))
That check is redundant. I think it's better to just check what's in the request bodies.

Note that test_split_text_check_wtp_unusual_lang should have the same kind of checks and asserts as def test_split_text.

Reintroduced the additional checks, and cleaned up the redundant text split for both tests. Thanks!

python/AzureTranslation/tests/test_acs_translation.py line 851 at r2 (raw file):

Previously, hhuangMITRE (Howard W Huang) wrote…

Remove these lines from the tests as they are no longer needed: "@mock.patch.object(BreakSentenceClient, 'TRANSLATION_MAX_CHARS', "

You can simply set the value directly in your invocation rather than rely on the mock. For example, instead of:
@mock.patch.object(BreakSentenceClient, 'TRANSLATION_MAX_CHARS', new_callable=lambda: 5)

    actual = list(TextSplitter.split(input_text,
        BreakSentenceClient.TRANSLATION_MAX_CHARS,
        BreakSentenceClient.TRANSLATION_MAX_CHARS,
        get_azure_char_count,
        self.wtp_model))
Do:
    actual = list(TextSplitter.split(input_text,
        5,
        5,
        get_azure_char_count,
        self.wtp_model))

Completed!

python/AzureTranslation/tests/test_acs_translation.py line 864 at r2 (raw file):

Previously, hhuangMITRE (Howard W Huang) wrote…

(Brian)

What do comments mean above each line here? # a.bc,

In the original test, the comments showed the input to the sentence break guesser. The test is setup so that the break sentence endpoint accepts only up to 5 characters, so the comments always contain 5 characters.

In the original test, I check that the sentence break guesser decided that the first sentence was "a." because it should favor breaking at the sentence ending punctuation.

The updated test in the PR essentially verifies that sentence splitting is NOT working because it always outputs the 5 character string that is received as input.

Hmm...in this case, it seems that both SpaCy and WtP are also checking sentence lengths to ensure that edge cases involving punctuation within a single sentence are properly handled.

As a result, these 5 character strings are not long enough to trigger splitting by a noticeable amount. Although WtP and spaCy do seem to split for one or two of the fragments.

I've updated the test to check for the following:

A sentence containing a period within it (Dr. Test).
A sentence that uses a different punctuation.
A sentence that uses an ellipsis ( ... ) and another following some newlines.

In general, expecting to see splits less than 30 chars in length.

jrobble

Reviewed 10 of 20 files at r1, 9 of 14 files at r3.
Reviewable status: 19 of 28 files reviewed, 9 unresolved discussions (waiting on @brosenberg42 and @hhuangMITRE)

a discussion (no related file):
Please merge develop into this so that it includes your hotfix.

python/AzureTranslation/README.md line 65 at r5 (raw file):

# Text Splitter Job Properties

I pushed an update to improve the formatting of this file.

python/AzureTranslation/README.md line 74 at r5 (raw file):

WtP models are trained to split up multilingual text by sentence without the need of an
input language tag. The disadvantage is that the most accurate WtP models will need ~3.5

You mention GPU models here and in the description of SENTENCE_MODEL, but there's no way to use them with this component since "cpu" is hard-coded. This is confusing. Consider exposing a property to use GPU or CPU, but default to CPU.

python/AzureTranslation/nlp_text_splitter/text_splitter.py line 100 at r2 (raw file):

Previously, hhuangMITRE (Howard W Huang) wrote…

Thanks Brian!

In that case, I think it's best to leave out the type: ignore comment, I feel that the ideal solution would be for WTP.split to have proper annotation.

As our changes have no effect at runtime, I'm leaning towards keepingself.wtp_model.split(text) as is, but I can also add in the suppression comment if that's preferred.

I don't have a strong preference either way. Now that I know why those comments are there I'm fine with leaving them if it helps prevent the static type checker from reporting non-issues. It's probably not worth modifying our code much to make the type checker happy.

python/AzureTranslation/nlp_text_splitter/text_splitter.py line 101 at r4 (raw file):

                return self.wtp_model.split(text, lang_code=iso_lang)
            else:
                log.warning(f"Warning: Language {lang} was not used to train WtP model."

Let's rephrase this as:

"Language {lang} was not used to train WtP model. "
"If text splitting is not working well with WtP, "
"consider trying spaCy's sentence detection model."

You don't need to include "Warning: " since the line will be logged using the warning level. I added a space after the first period. Also, let's not get too specific about how to use spaCy. That may change in the future.

python/AzureTranslation/nlp_text_splitter/text_splitter.py line 143 at r4 (raw file):

        self._text_full_size = self._get_text_size(text)
        chars_per_size = len(text) / self._text_full_size
        self._overhead_size = self._get_text_size('')

Note that changes to nlp_text_splitter also need to be applied to other components that use the text splitter.

python/AzureTranslation/tests/test_acs_translation.py line 618 at r4 (raw file):

        self.assertEqual(sum(expected_chunk_lengths), len(text))

        # WtP will split by newlines, so some of the chunks

It also doesn't seem to care about the wide period character.

python/AzureTranslation/tests/test_acs_translation.py line 855 at r4 (raw file):

    def test_guess_split_edge_cases(self):

Let's move this after test_guess_split_simple_sentence and test_split_sentence_end_punctuation since those are simpler tests.

jrobble

Reviewed 2 of 20 files at r1, 2 of 3 files at r2, 2 of 14 files at r3, 2 of 2 files at r4, 1 of 1 files at r5, all commit messages.
Reviewable status: all files reviewed, 10 unresolved discussions (waiting on @brosenberg42 and @hhuangMITRE)

python/AzureTranslation/README.md line 77 at r4 (raw file):

  Through preliminary investigation, we identified the [WtP library ("Where's the Point")](https://github.com/bminixhofer/wtpsplit)
  and spaCy's multilingual sentence detection model for identifying sentence breaks
  in a large section of text.

You linked to WtP. Also link to spaCy.

…t-and-char-limit

hhuangMITRE

Reviewable status: 23 of 28 files reviewed, 9 unresolved discussions (waiting on @brosenberg42 and @jrobble)

a discussion (no related file):

Previously, jrobble (Jeff Robble) wrote…

Please merge develop into this so that it includes your hotfix.

Done!

python/AzureTranslation/README.md line 77 at r4 (raw file):

Previously, jrobble (Jeff Robble) wrote…

You linked to WtP. Also link to spaCy.

Done

python/AzureTranslation/README.md line 65 at r5 (raw file):

Previously, jrobble (Jeff Robble) wrote…

I pushed an update to improve the formatting of this file.

Thank you, I've transferred the changes in. I did another scan to make sure these changes were not overwritten by the merge from develop.

python/AzureTranslation/README.md line 74 at r5 (raw file):

Previously, jrobble (Jeff Robble) wrote…

You mention GPU models here and in the description of SENTENCE_MODEL, but there's no way to use them with this component since "cpu" is hard-coded. This is confusing. Consider exposing a property to use GPU or CPU, but default to CPU.

Understood, I've added in SENTENCE_MODEL_CPU_ONLY to toggle CPU vs GPU behavior. During this time, I've also added in SENTENCE_MODEL_WTP_DEFAULT_ADAPTOR_LANGUAGE.

More advanced WtP models (ones with the language adaptor feature) would require the source text language being provided, so if the component fails to provide a valid language, we'll need to go with a default language for those cases. I've the checks in the text_splitter to support this property also as well as a localized test (I didn't want to download another WtP model for the Docker image just to support this test).

python/AzureTranslation/nlp_text_splitter/text_splitter.py line 100 at r2 (raw file):

Previously, jrobble (Jeff Robble) wrote…

I don't have a strong preference either way. Now that I know why those comments are there I'm fine with leaving them if it helps prevent the static type checker from reporting non-issues. It's probably not worth modifying our code much to make the type checker happy.

Sounds good! Resolving this discussion then.

python/AzureTranslation/nlp_text_splitter/text_splitter.py line 101 at r4 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Let's rephrase this as:
"Language {lang} was not used to train WtP model. "
"If text splitting is not working well with WtP, "
"consider trying spaCy's sentence detection model."
You don't need to include "Warning: " since the line will be logged using the warning level. I added a space after the first period. Also, let's not get too specific about how to use spaCy. That may change in the future.

Done, thanks!

python/AzureTranslation/nlp_text_splitter/text_splitter.py line 143 at r4 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Note that changes to nlp_text_splitter also need to be applied to other components that use the text splitter.

That change is coming in shortly in a separate PR.

python/AzureTranslation/tests/test_acs_translation.py line 618 at r4 (raw file):

Previously, jrobble (Jeff Robble) wrote…

It also doesn't seem to care about the wide period character.

It seems that the extra newlines can cause WtP to split text prematurely. When Azure properly detects Chinese and trims out the excess newlines, the splitting behavior does take into account the wide periods ('。' ) but I feel it doesn't always have priority.

However, if language detection fails and the newlines are kept, seems that WtP favors dividing by newline instead.

python/AzureTranslation/tests/test_acs_translation.py line 855 at r4 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Let's move this after test_guess_split_simple_sentence and test_split_sentence_end_punctuation since those are simpler tests.

Done!

jrobble

Reviewed 4 of 5 files at r6, 1 of 1 files at r7, all commit messages.
Reviewable status: all files reviewed, 10 unresolved discussions (waiting on @brosenberg42 and @hhuangMITRE)

python/AzureTranslation/Dockerfile line 44 at r7 (raw file):

# Install WtP and spaCy
RUN pip install --upgrade pip && \
    pip install spacy>=3.7.4 && \

These >= lines aren't enforcing a version, rather, they are redirecting the log output:

root@7a452a7c8b0c:/home/mpf/component_src# ls -la
total 32
drwxr-xr-x 1 root root  4096 Apr 12 19:56  .
drwxr-xr-x 1 root root  4096 Apr 10 13:45  ..
-rw-r--r-- 1 root root 11673 Apr 12 19:56 '=1.3.0'
-rw-r--r-- 1 root root 11011 Apr 12 19:56 '=3.7.4'

root@7a452a7c8b0c:/home/mpf/component_src# head -5 \=1.3.0
Collecting wtpsplit
  Downloading wtpsplit-1.3.0-py3-none-any.whl.metadata (497 bytes)
Collecting onnxruntime>=1.13.1 (from wtpsplit)
  Downloading onnxruntime-1.17.1-cp38-cp38-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.3 kB)
Collecting transformers>=4.22.2 (from wtpsplit)

root@7a452a7c8b0c:/home/mpf/component_src# head -5 \=3.7.4
Collecting spacy
  Downloading spacy-3.7.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (27 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
root@7a452a7c8b0c:/home/mpf/component_src#

python/AzureTranslation/README.md line 38 at r7 (raw file):

  not end with `/translate` because two separate endpoints are
  used. `ACS_URL + '/translate'` is used for translation.
  `ACS_URL + '/breaksentence'` is used to break up text when it is too long

The breaksentence endpoint is no longer used. Please remove this to avoid confusion.

python/AzureTranslation/README.md line 50 at r7 (raw file):

# Primary Job Properties
- `TO_LANGUAGE`: The BCP-47 language code for the language that the properties
-    should be translated to.

This should not be formatted as a bullet.

python/AzureTranslation/README.md line 122 at r7 (raw file):

it is recommended to set SENTENCE_MODEL_CPU_ONLY=FALSE as such models can use up to to ~3.5 GB of GPU memory.

You have "to" in there twice.

Let's rephrase this as:

If using more advanced WtP models likewtp-canine-s-12l, it is recommended to set SENTENCE_MODEL_CPU_ONLY=FALSE to improve performance. That model can use up to ~3.5 GB of GPU memory.

python/AzureTranslation/nlp_text_splitter/text_splitter.py line 103 at r7 (raw file):

            self.wtp_model =  WtP(wtp_model_name)

            if model_setting != "cpu" and model_setting != "cuda":

"cuda" is not a valid option because earlier in the code you do:

   if not nlp_model_setting:
            nlp_model_setting = "gpu"

I changed the last line to use "cuda", ran this job:

{
  "jobProperties": {
    "SENTENCE_MODEL": "wtp-canine-s-1l",
    "SENTENCE_MODEL_CPU_ONLY": "false"
  },
  "media": [
    {
      "mediaUri": "file:///opt/mpf/share/remote-media/art-of-war.txt"
    }
  ],
  "pipelineName": "AZURE TRANSLATION TEXT FILE PIPELINE"
}

And got this in the log output:

2024-04-15 18:52:24,979 WARN  [text_splitter.py:98] - [Job 56:art-of-war.txt] Model wtp-canine-s-1l not found, downloading from hugging face.
2024-04-15 18:52:53,690 ERROR [acs_translation_component.py:128] - [Job 56:art-of-war.txt] Failed to complete job due to the following exception:
Traceback (most recent call last):
  File "/opt/mpf/plugin-venv/lib/python3.8/site-packages/acs_translation_component/acs_translation_component.py", line 122, in get_detections_from_non_composite
    tc = TranslationClient(job.job_properties, sentence_model)
  File "/opt/mpf/plugin-venv/lib/python3.8/site-packages/acs_translation_component/acs_translation_component.py", line 190, in __init__
    self._sentence_splitter = SentenceSplitter(job_properties, sentence_model)
  File "/opt/mpf/plugin-venv/lib/python3.8/site-packages/acs_translation_component/acs_translation_component.py", line 472, in __init__
    self._sentence_model.update_model(nlp_model_name, nlp_model_setting, wtp_default_language)
  File "/opt/mpf/plugin-venv/lib/python3.8/site-packages/nlp_text_splitter/text_splitter.py", line 67, in update_model
    self._update_wtp_model(model_name, model_setting, default_lang)
  File "/opt/mpf/plugin-venv/lib/python3.8/site-packages/nlp_text_splitter/text_splitter.py", line 107, in _update_wtp_model
    self.wtp_model.to(model_setting)
  File "/opt/mpf/plugin-venv/lib/python3.8/site-packages/transformers/modeling_utils.py", line 2576, in to
    return super().to(*args, **kwargs)
  File "/opt/mpf/plugin-venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1152, in to
    return self._apply(convert)
  File "/opt/mpf/plugin-venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 802, in _apply
    module._apply(fn)
  File "/opt/mpf/plugin-venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 802, in _apply
    module._apply(fn)
  File "/opt/mpf/plugin-venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 802, in _apply
    module._apply(fn)
  File "/opt/mpf/plugin-venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 825, in _apply
    param_applied = fn(param)
  File "/opt/mpf/plugin-venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1150, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
  File "/opt/mpf/plugin-venv/lib/python3.8/site-packages/torch/cuda/__init__.py", line 293, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled
2024-04-15 18:52:53,691 ERROR [org.mitre.mpf.detection:0] - [Job 56:art-of-war.txt] An error occurred while invoking the "get_detections_from_generic" method on the Python component: AssertionError: Torch not compiled with CUDA enabled

At:
  /opt/mpf/plugin-venv/lib/python3.8/site-packages/torch/cuda/__init__.py(324): _lazy_init
  /opt/mpf/plugin-venv/lib/python3.8/site-packages/torch/nn/modules/module.py(1150): convert
  /opt/mpf/plugin-venv/lib/python3.8/site-packages/torch/nn/modules/module.py(825): _apply
  /opt/mpf/plugin-venv/lib/python3.8/site-packages/torch/nn/modules/module.py(802): _apply
  /opt/mpf/plugin-venv/lib/python3.8/site-packages/torch/nn/modules/module.py(802): _apply
  /opt/mpf/plugin-venv/lib/python3.8/site-packages/torch/nn/modules/module.py(802): _apply
  /opt/mpf/plugin-venv/lib/python3.8/site-packages/torch/nn/modules/module.py(1152): to
  /opt/mpf/plugin-venv/lib/python3.8/site-packages/transformers/modeling_utils.py(2576): to
  /opt/mpf/plugin-venv/lib/python3.8/site-packages/nlp_text_splitter/text_splitter.py(107): _update_wtp_model
  /opt/mpf/plugin-venv/lib/python3.8/site-packages/nlp_text_splitter/text_splitter.py(67): update_model
  /opt/mpf/plugin-venv/lib/python3.8/site-packages/acs_translation_component/acs_translation_component.py(472): __init__
  /opt/mpf/plugin-venv/lib/python3.8/site-packages/acs_translation_component/acs_translation_component.py(190): __init__
  /opt/mpf/plugin-venv/lib/python3.8/site-packages/acs_translation_component/acs_translation_component.py(129): get_detections_from_non_composite
  /opt/mpf/plugin-venv/lib/python3.8/site-packages/acs_translation_component/acs_translation_component.py(104): get_detections_from_generic

The job completed with:

   "errors": [
      {
         "mediaId": 1,
         "details": [
            {
               "source": "AZURETRANSLATION",
               "code": "UNRECOGNIZED_DETECTION_ERROR",
               "message": "An error occurred while invoking the \"get_detections_from_generic\" method on the Python component: Torch not compiled with CUDA enabled"
            }
         ]
      }
   ],

python/AzureTranslation/nlp_text_splitter/text_splitter.py line 120 at r7 (raw file):

                            )
        if self._mandatory_wtp_language:
            log.warning("WtP model requires a language."

Add a space between sentences.

python/AzureTranslation/plugin-files/descriptor/descriptor.json line 98 at r7 (raw file):

        {
          "name": "SENTENCE_MODEL_CPU_ONLY",
          "description": "If set to true, only use CPU resources for the sentence detection model. If set to False, allow sentence model to also use GPU resources (larger WtP models will use up to ~3.5 GB of GPU memory).",

Let's omit the "(larger WtP models will use up to ~3.5 GB of GPU memory)" part. Let's leave that level of detail to the README.

python/AzureTranslation/tests/test_acs_translation.py line 618 at r4 (raw file):

When Azure properly detects Chinese and trims out the excess newlines

Are you talking about what Azure STT does to the text before it gets to Azure Translation?

python/AzureTranslation/tests/test_acs_translation.py line 605 at r7 (raw file):
Let's rephrase this comment to:

This test should only be run manually outside of a Docker build. The WtP canine model is ~1 GB and not worth downloading and adding to the pre-built Docker image.

python/AzureTranslation/tests/test_acs_translation.py line 707 at r7 (raw file):

        self.assertEqual(sum(expected_chunk_lengths), len(text))

        # WtP will split by newlines (over the '。' character),

Say "(preferred over the ...`

hhuangMITRE

Reviewable status: 22 of 28 files reviewed, 10 unresolved discussions (waiting on @brosenberg42 and @jrobble)

python/AzureTranslation/Dockerfile line 44 at r7 (raw file):

Previously, jrobble (Jeff Robble) wrote…

These >= lines aren't enforcing a version, rather, they are redirecting the log output:

root@7a452a7c8b0c:/home/mpf/component_src# ls -la
total 32
drwxr-xr-x 1 root root  4096 Apr 12 19:56  .
drwxr-xr-x 1 root root  4096 Apr 10 13:45  ..
-rw-r--r-- 1 root root 11673 Apr 12 19:56 '=1.3.0'
-rw-r--r-- 1 root root 11011 Apr 12 19:56 '=3.7.4'

root@7a452a7c8b0c:/home/mpf/component_src# head -5 \=1.3.0
Collecting wtpsplit
  Downloading wtpsplit-1.3.0-py3-none-any.whl.metadata (497 bytes)
Collecting onnxruntime>=1.13.1 (from wtpsplit)
  Downloading onnxruntime-1.17.1-cp38-cp38-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.3 kB)
Collecting transformers>=4.22.2 (from wtpsplit)

root@7a452a7c8b0c:/home/mpf/component_src# head -5 \=3.7.4
Collecting spacy
  Downloading spacy-3.7.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (27 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
root@7a452a7c8b0c:/home/mpf/component_src#

Fixed. Thanks for spotting that.

python/AzureTranslation/README.md line 38 at r7 (raw file):

Previously, jrobble (Jeff Robble) wrote…

The breaksentence endpoint is no longer used. Please remove this to avoid confusion.

Done!

python/AzureTranslation/README.md line 50 at r7 (raw file):

Previously, jrobble (Jeff Robble) wrote…

This should not be formatted as a bullet.

Fixed!

python/AzureTranslation/README.md line 122 at r7 (raw file):

Previously, jrobble (Jeff Robble) wrote…

it is recommended to set SENTENCE_MODEL_CPU_ONLY=FALSE as such models can use up to to ~3.5 GB of GPU memory.

You have "to" in there twice.

Let's rephrase this as:

If using more advanced WtP models likewtp-canine-s-12l, it is recommended to set SENTENCE_MODEL_CPU_ONLY=FALSE to improve performance. That model can use up to ~3.5 GB of GPU memory.

Done, thanks for spotting that.

python/AzureTranslation/nlp_text_splitter/text_splitter.py line 103 at r7 (raw file):

Previously, jrobble (Jeff Robble) wrote…

"cuda" is not a valid option because earlier in the code you do:

   if not nlp_model_setting:
            nlp_model_setting = "gpu"

I changed the last line to use "cuda", ran this job:

{
  "jobProperties": {
    "SENTENCE_MODEL": "wtp-canine-s-1l",
    "SENTENCE_MODEL_CPU_ONLY": "false"
  },
  "media": [
    {
      "mediaUri": "file:///opt/mpf/share/remote-media/art-of-war.txt"
    }
  ],
  "pipelineName": "AZURE TRANSLATION TEXT FILE PIPELINE"
}

And got this in the log output:

2024-04-15 18:52:24,979 WARN  [text_splitter.py:98] - [Job 56:art-of-war.txt] Model wtp-canine-s-1l not found, downloading from hugging face.
2024-04-15 18:52:53,690 ERROR [acs_translation_component.py:128] - [Job 56:art-of-war.txt] Failed to complete job due to the following exception:
Traceback (most recent call last):
  File "/opt/mpf/plugin-venv/lib/python3.8/site-packages/acs_translation_component/acs_translation_component.py", line 122, in get_detections_from_non_composite
    tc = TranslationClient(job.job_properties, sentence_model)
  File "/opt/mpf/plugin-venv/lib/python3.8/site-packages/acs_translation_component/acs_translation_component.py", line 190, in __init__
    self._sentence_splitter = SentenceSplitter(job_properties, sentence_model)
  File "/opt/mpf/plugin-venv/lib/python3.8/site-packages/acs_translation_component/acs_translation_component.py", line 472, in __init__
    self._sentence_model.update_model(nlp_model_name, nlp_model_setting, wtp_default_language)
  File "/opt/mpf/plugin-venv/lib/python3.8/site-packages/nlp_text_splitter/text_splitter.py", line 67, in update_model
    self._update_wtp_model(model_name, model_setting, default_lang)
  File "/opt/mpf/plugin-venv/lib/python3.8/site-packages/nlp_text_splitter/text_splitter.py", line 107, in _update_wtp_model
    self.wtp_model.to(model_setting)
  File "/opt/mpf/plugin-venv/lib/python3.8/site-packages/transformers/modeling_utils.py", line 2576, in to
    return super().to(*args, **kwargs)
  File "/opt/mpf/plugin-venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1152, in to
    return self._apply(convert)
  File "/opt/mpf/plugin-venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 802, in _apply
    module._apply(fn)
  File "/opt/mpf/plugin-venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 802, in _apply
    module._apply(fn)
  File "/opt/mpf/plugin-venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 802, in _apply
    module._apply(fn)
  File "/opt/mpf/plugin-venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 825, in _apply
    param_applied = fn(param)
  File "/opt/mpf/plugin-venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1150, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
  File "/opt/mpf/plugin-venv/lib/python3.8/site-packages/torch/cuda/__init__.py", line 293, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled
2024-04-15 18:52:53,691 ERROR [org.mitre.mpf.detection:0] - [Job 56:art-of-war.txt] An error occurred while invoking the "get_detections_from_generic" method on the Python component: AssertionError: Torch not compiled with CUDA enabled

At:
  /opt/mpf/plugin-venv/lib/python3.8/site-packages/torch/cuda/__init__.py(324): _lazy_init
  /opt/mpf/plugin-venv/lib/python3.8/site-packages/torch/nn/modules/module.py(1150): convert
  /opt/mpf/plugin-venv/lib/python3.8/site-packages/torch/nn/modules/module.py(825): _apply
  /opt/mpf/plugin-venv/lib/python3.8/site-packages/torch/nn/modules/module.py(802): _apply
  /opt/mpf/plugin-venv/lib/python3.8/site-packages/torch/nn/modules/module.py(802): _apply
  /opt/mpf/plugin-venv/lib/python3.8/site-packages/torch/nn/modules/module.py(802): _apply
  /opt/mpf/plugin-venv/lib/python3.8/site-packages/torch/nn/modules/module.py(1152): to
  /opt/mpf/plugin-venv/lib/python3.8/site-packages/transformers/modeling_utils.py(2576): to
  /opt/mpf/plugin-venv/lib/python3.8/site-packages/nlp_text_splitter/text_splitter.py(107): _update_wtp_model
  /opt/mpf/plugin-venv/lib/python3.8/site-packages/nlp_text_splitter/text_splitter.py(67): update_model
  /opt/mpf/plugin-venv/lib/python3.8/site-packages/acs_translation_component/acs_translation_component.py(472): __init__
  /opt/mpf/plugin-venv/lib/python3.8/site-packages/acs_translation_component/acs_translation_component.py(190): __init__
  /opt/mpf/plugin-venv/lib/python3.8/site-packages/acs_translation_component/acs_translation_component.py(129): get_detections_from_non_composite
  /opt/mpf/plugin-venv/lib/python3.8/site-packages/acs_translation_component/acs_translation_component.py(104): get_detections_from_generic

The job completed with:

   "errors": [
      {
         "mediaId": 1,
         "details": [
            {
               "source": "AZURETRANSLATION",
               "code": "UNRECOGNIZED_DETECTION_ERROR",
               "message": "An error occurred while invoking the \"get_detections_from_generic\" method on the Python component: Torch not compiled with CUDA enabled"
            }
         ]
      }
   ],

Thank you. Added torch.cuda.is_available(): check in.

In addition, our current Dockerfile build only allows for the cpu version of PyTorch without cuda support.

To fully enable this option, I've added a parameter to the Dockerfile to toggle cpu vs gpu PyTorch installations and added the info for setting this parameter in the README.

Overall, the build seems to be working, but MITRE's machines are blocking the HuggingFace model, so I'm going to retest one more time. I'll ping this message again when this check is complete.

python/AzureTranslation/nlp_text_splitter/text_splitter.py line 120 at r7 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Add a space between sentences.

Done, thanks!

python/AzureTranslation/plugin-files/descriptor/descriptor.json line 98 at r7 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Let's omit the "(larger WtP models will use up to ~3.5 GB of GPU memory)" part. Let's leave that level of detail to the README.

Done. Oh also, added in a suggestion to consult the README: There's more info that the users need to enable this option using the Docker build.

python/AzureTranslation/tests/test_acs_translation.py line 618 at r4 (raw file):

Previously, jrobble (Jeff Robble) wrote…

When Azure properly detects Chinese and trims out the excess newlines

Are you talking about what Azure STT does to the text before it gets to Azure Translation?

To clarify: Azure Translation has a feature for language detection of input text (so this would usually involve text extracted from other documents). If the detected language is a variant of Chinese, the newlines in the document would be replaced with an empty character '', otherwise whitespace ' ' would be added in.

I misspoke earlier when I said the newlines are kept. Rather, the newlines are incorrectly replaced with extra whitespace.

WtP doesn't care about whitespace as much so the behavior being seen during the test is mostly due to the extra characters altering the amount of text being processed by WtP and being split apart. From my testing, some of the wide space characters are being separated out, but not all of them.

When testing Chinese text with WtP outside of the component, I noticed that it's sensitive to newlines even more than the wide period character. Every time there's a newline, there seems to be another split.

In general, that newline WtP behavior should be captured in the tests, so I've added a quick one for newlines vs no-newlines (also compared spaCy and WtP) in a separate test.

# Result after passing Chinese text sample through WtP.split(), no newline replacement
['兵者，國之大事，死生之地，存亡之道，不可不察也。\n',
 '故經之以五事，校之以計，而索其情：',
 '一曰道，二曰天，三曰地，四曰將，五曰法。',
 '道者，令民於上同意，可與之死，可與之生，\n',
 '而不危也；天者，陰陽、寒暑、時制也；地者，遠近、險易、廣狹、死生也；將者，智、信、仁、勇、嚴也；法者，曲制、官道、\n',
 '主用也。凡此五者，將莫不聞，知之者勝，不知之者不勝。故校之以計，而索其情，曰：主孰有道？將孰有能？天地孰得？法令孰行？\n',
 '兵眾孰強？士卒孰練？賞罰孰明？吾以此知勝負矣。將聽吾計，用之必勝，留之；將不聽吾計，用之必敗，去之。計利以聽，乃為之勢，\n',
 '以佐其外。勢者，因利而制權也。兵者，詭道也。故能而示之不能，用而示之不用，近而示之遠，遠而示之近。利而誘之，亂而取之，\n',
 '實而備之，強而避之，怒而撓之，卑而驕之，佚而勞之，親而離之，攻其無備，出其不意。此兵家之勝，不可先傳也。\n',
 '夫未戰而廟算勝者，得算多也；未戰而廟算不勝者，得算少也。多算勝少算，而況於無算乎！吾以此觀之，勝負見矣。']

python/AzureTranslation/tests/test_acs_translation.py line 605 at r7 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Let's rephrase this comment to:

This test should only be run manually outside of a Docker build. The WtP canine model is ~1 GB and not worth downloading and adding to the pre-built Docker image.

Done.

python/AzureTranslation/tests/test_acs_translation.py line 707 at r7 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Say "(preferred over the ...`

Quick note, a version of this statement has been transferred to the new test I've added, which demonstrates the difference between WtP and spaCy multilingual models. Also, added in a quick check regarding WtP's preference of newlines over wide period characters.

jrobble

Reviewed 1 of 6 files at r8, 5 of 5 files at r9, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @brosenberg42 and @hhuangMITRE)

python/AzureTranslation/nlp_text_splitter/text_splitter.py line 90 at r9 (raw file):
This is logged as a warning, so no need to say "Warning" in the message. Also, I ran a test using a version of the Docker image with GPU support built in and got the "no cuda support for this installation of PyTorch" message in the log file, which was misleading. To find the underlying reason I did:

root@90e56804e80c:/opt/mpf/plugin-venv/bin# . ./activate
(plugin-venv) root@90e56804e80c:/opt/mpf/plugin-venv/bin# python3
Python 3.8.10 (default, Nov 22 2023, 10:22:35)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
/opt/mpf/plugin-venv/lib/python3.8/site-packages/torch/cuda/__init__.py:141: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11060). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
False

Please update the message to:

"PyTorch determined that CUDA is not available. "
"You may need to update the NVIDIA driver for the host system, "
"or reinstall PyTorch with GPU support by setting "
"ARGS BUILD_TYPE=gpu in the Dockerfile when building this component."

python/AzureTranslation/tests/test_acs_translation.py line 618 at r4 (raw file):

Previously, hhuangMITRE (Howard W Huang) wrote…

To clarify: Azure Translation has a feature for language detection of input text (so this would usually involve text extracted from other documents). If the detected language is a variant of Chinese, the newlines in the document would be replaced with an empty character '', otherwise whitespace ' ' would be added in.

I misspoke earlier when I said the newlines are kept. Rather, the newlines are incorrectly replaced with extra whitespace.

WtP doesn't care about whitespace as much so the behavior being seen during the test is mostly due to the extra characters altering the amount of text being processed by WtP and being split apart. From my testing, some of the wide space characters are being separated out, but not all of them.

When testing Chinese text with WtP outside of the component, I noticed that it's sensitive to newlines even more than the wide period character. Every time there's a newline, there seems to be another split.

In general, that newline WtP behavior should be captured in the tests, so I've added a quick one for newlines vs no-newlines (also compared spaCy and WtP) in a separate test.
# Result after passing Chinese text sample through WtP.split(), no newline replacement
['兵者，國之大事，死生之地，存亡之道，不可不察也。\n',
 '故經之以五事，校之以計，而索其情：',
 '一曰道，二曰天，三曰地，四曰將，五曰法。',
 '道者，令民於上同意，可與之死，可與之生，\n',
 '而不危也；天者，陰陽、寒暑、時制也；地者，遠近、險易、廣狹、死生也；將者，智、信、仁、勇、嚴也；法者，曲制、官道、\n',
 '主用也。凡此五者，將莫不聞，知之者勝，不知之者不勝。故校之以計，而索其情，曰：主孰有道？將孰有能？天地孰得？法令孰行？\n',
 '兵眾孰強？士卒孰練？賞罰孰明？吾以此知勝負矣。將聽吾計，用之必勝，留之；將不聽吾計，用之必敗，去之。計利以聽，乃為之勢，\n',
 '以佐其外。勢者，因利而制權也。兵者，詭道也。故能而示之不能，用而示之不用，近而示之遠，遠而示之近。利而誘之，亂而取之，\n',
 '實而備之，強而避之，怒而撓之，卑而驕之，佚而勞之，親而離之，攻其無備，出其不意。此兵家之勝，不可先傳也。\n',
 '夫未戰而廟算勝者，得算多也；未戰而廟算不勝者，得算少也。多算勝少算，而況於無算乎！吾以此觀之，勝負見矣。']

Thanks for the explanation.

python/AzureTranslation/tests/test_acs_translation.py line 603 at r9 (raw file):

    def test_split_engine_difference(self):

👍

jrobble

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @brosenberg42 and @hhuangMITRE)

python/AzureTranslation/nlp_text_splitter/text_splitter.py line 103 at r7 (raw file):

Previously, hhuangMITRE (Howard W Huang) wrote…

Thank you. Added torch.cuda.is_available(): check in.

In addition, our current Dockerfile build only allows for the cpu version of PyTorch without cuda support.

To fully enable this option, I've added a parameter to the Dockerfile to toggle cpu vs gpu PyTorch installations and added the info for setting this parameter in the README.

Overall, the build seems to be working, but MITRE's machines are blocking the HuggingFace model, so I'm going to retest one more time. I'll ping this message again when this check is complete.

The build arg for GPU support looks good to me.

hhuangMITRE

Reviewable status: 27 of 28 files reviewed, 1 unresolved discussion (waiting on @brosenberg42 and @jrobble)

python/AzureTranslation/nlp_text_splitter/text_splitter.py line 103 at r7 (raw file):

Previously, jrobble (Jeff Robble) wrote…

The build arg for GPU support looks good to me.

Thanks, I'll rerun the Jenkins build again also.

python/AzureTranslation/nlp_text_splitter/text_splitter.py line 90 at r9 (raw file):

Previously, jrobble (Jeff Robble) wrote…

This is logged as a warning, so no need to say "Warning" in the message. Also, I ran a test using a version of the Docker image with GPU support built in and got the "no cuda support for this installation of PyTorch" message in the log file, which was misleading. To find the underlying reason I did:
root@90e56804e80c:/opt/mpf/plugin-venv/bin# . ./activate
(plugin-venv) root@90e56804e80c:/opt/mpf/plugin-venv/bin# python3
Python 3.8.10 (default, Nov 22 2023, 10:22:35)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
/opt/mpf/plugin-venv/lib/python3.8/site-packages/torch/cuda/__init__.py:141: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11060). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
False
Please update the message to:

"PyTorch determined that CUDA is not available. "
"You may need to update the NVIDIA driver for the host system, "
"or reinstall PyTorch with GPU support by setting "
"ARGS BUILD_TYPE=gpu in the Dockerfile when building this component."

Done! I also need to get into the habit of remembering that the message will start with WARNING:

python/AzureTranslation/tests/test_acs_translation.py line 618 at r4 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Thanks for the explanation.

Of course, thanks for double checking this as well!

jrobble

Reviewed 1 of 1 files at r10, 9 of 9 files at r11, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @brosenberg42)

python/AzureTranslation/Dockerfile line 65 at r11 (raw file):

# Environment variables required by nvidia runtime.
ENV NVIDIA_VISIBLE_DEVICES=all

Technically, I don't think these env. var. are required since I was able to get the prev version without these to work by setting the env. var in the service entry in the compose file:

  azure_translation:
    depends_on:
      - workflow-manager
    deploy:
      mode: replicated
      replicas: 1
    environment:
      WFM_PASSWORD: mpfadm
      WFM_USER: admin
      MPF_PROP_ACS_URL: [REDACTED]
      MPF_PROP_ACS_SUBSCRIPTION_KEY: [REDACTED]
      NVIDIA_VISIBLE_DEVICES: 2
    image: someregistry/openmpf_azure_translation:deleteme-jrobble-gpu
    volumes:
      - shared_data:/opt/mpf/share:rw

However, we do set these in many of our other components, so let's keep them for consistency.

python/AzureTranslation/LICENSE line 30 at r11 (raw file):

MIT License

Copyright (c) 2024 Benjamin Minixhofer

Did you intend to update someone else's license?

python/AzureTranslation/LICENSE line 56 at r11 (raw file):

The MIT License (MIT)

Copyright (C) 2016-2024 ExplosionAI GmbH, 2016 spaCy GmbH, 2015 Matthew Honnibal

Did you intend to update someone else's license?

jrobble

Reviewed 2 of 2 files at r12, 1 of 1 files at r13, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @brosenberg42)

hhuangMITRE requested review from brosenberg42 and jrobble April 2, 2024 19:52

hhuangMITRE added the enhancement label Apr 2, 2024

Update Azure Translation character limits. Add NLP TextSplitting Mode…

c502b2f

…ls. Update TextSplitter for input langs.

hhuangMITRE force-pushed the hhuang/azure-sentence-split-and-char-limit branch from 028633a to c502b2f Compare April 3, 2024 01:32

Code refactor and tooltip update.

434f1ca

hhuangMITRE commented Apr 4, 2024

View reviewed changes

hhuangMITRE added 3 commits April 4, 2024 00:47

Minor tooltip update.

303048f

Minor tooltip update.

192fb29

Update edge case for testing text splits.

a74b919

hhuangMITRE commented Apr 4, 2024

View reviewed changes

Improve formatting.

4c6fd7a

jrobble requested changes Apr 9, 2024

View reviewed changes

hhuangMITRE added 5 commits April 11, 2024 23:55

Merge remote-tracking branch 'origin' into hhuang/azure-sentence-spli…

c786612

…t-and-char-limit

Tooltip update, and adding additional WtP text splitter support.

ba61756

Merge branch 'develop' into hhuang/azure-sentence-split-and-char-limit

bf20678

Tooltip update, and adding additional WtP text splitter support.

7438731

Tooltip update.

983dce7

hhuangMITRE requested a review from jrobble April 12, 2024 14:12

hhuangMITRE commented Apr 12, 2024

View reviewed changes

jrobble requested changes Apr 15, 2024

View reviewed changes

hhuangMITRE added 2 commits April 16, 2024 04:01

Tooltip updates and PyTorch cuda build.

dda479b

Tooltip update.

23ae3b3

hhuangMITRE requested a review from jrobble April 16, 2024 13:56

hhuangMITRE commented Apr 16, 2024

View reviewed changes

hhuangMITRE added 3 commits April 16, 2024 10:42

Add minor check.

ddad147

Improved stage builds for gpu/cpu options.

13ee66e

Submitting tested changes for Docker build.

16c7b51

Final changes (cleanup + test cpu/gpu staged builds).

9657eee

jrobble requested changes Apr 18, 2024

View reviewed changes

jrobble reviewed Apr 18, 2024

View reviewed changes

Tooltip update.

d6e59eb

hhuangMITRE commented Apr 18, 2024

View reviewed changes

hhuangMITRE added 4 commits April 18, 2024 03:13

Set NVIDIA environment variables in Docker image.

28b641b

Toggling GPU mode (final check).

cf4375f

GPU mode passed, reverting change back to CPU.

1a7bad1

Minor Copyright Date Update.

9d4dfcb

jrobble approved these changes Apr 18, 2024

View reviewed changes

hhuangMITRE added 2 commits April 19, 2024 03:12

Additonal adjustments, test fix.

ed91663

Tooltip update.

487b7e7

jrobble approved these changes Apr 19, 2024

View reviewed changes

jrobble merged commit a05e7d4 into develop Apr 19, 2024

jrobble deleted the hhuang/azure-sentence-split-and-char-limit branch April 19, 2024 15:05

Update Azure Translation character limits. Add NLP TextSplitting Models. Update TextSplitter for input langs. #359

Update Azure Translation character limits. Add NLP TextSplitting Models. Update TextSplitter for input langs. #359

Uh oh!

Conversation

hhuangMITRE commented Apr 2, 2024 • edited by jrobble Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hhuangMITRE left a comment

Choose a reason for hiding this comment

Uh oh!

hhuangMITRE left a comment

Choose a reason for hiding this comment

Uh oh!

hhuangMITRE left a comment

Choose a reason for hiding this comment

Uh oh!

jrobble left a comment

Choose a reason for hiding this comment

Uh oh!

jrobble left a comment

Choose a reason for hiding this comment

Uh oh!

hhuangMITRE left a comment

Choose a reason for hiding this comment

Uh oh!

jrobble left a comment

Choose a reason for hiding this comment

This test should only be run manually outside of a Docker build. The WtP canine model is ~1 GB and not worth downloading and adding to the pre-built Docker image.

Uh oh!

hhuangMITRE left a comment

Choose a reason for hiding this comment

This test should only be run manually outside of a Docker build. The WtP canine model is ~1 GB and not worth downloading and adding to the pre-built Docker image.

Uh oh!

jrobble left a comment

Choose a reason for hiding this comment

Uh oh!

jrobble left a comment

Choose a reason for hiding this comment

Uh oh!

hhuangMITRE left a comment

Choose a reason for hiding this comment

Uh oh!

jrobble left a comment

Choose a reason for hiding this comment

Uh oh!

jrobble left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hhuangMITRE commented Apr 2, 2024 •

edited by jrobble

Loading