Hotfix: Adding capability for TikaTextDetection to perform feed-forward language ID. #310

hhuangMITRE · 2022-10-25T10:39:17Z

Issues:

Update Tika Text Detection to perform language detection on TEXT tracks from upstream openmpf#1606

This change is

…rd language ID.

hhuangMITRE · 2022-10-25T10:41:37Z

The following optimizations were transferred from the ongoing Tika OpenNLP Lang ID PR:

A static instance of the Optimaize language detector.
ISO Language Mapper.

Since this is a hotfix, the other optimizations (such as OpenNLP support) are still in #305

…missing FF tracks.

…ore and filtering option.

jrobble

Reviewed 1 of 6 files at r2.
Reviewable status: 1 of 6 files reviewed, 9 unresolved discussions (waiting on @hhuangMITRE)

java/TikaTextDetection/README.md line 60 at r4 (raw file):

the `LANGUAGE_DETECTOR` option:

- `LANGUAGE_DETECTOR = opennlp`: [Apache Tika OpenNLP Language Detector](https://tika.apache.org/2.4.1/api/org/apache/tika/langdetect/opennlp/OpenNLPDetector.html)

Is this a better link to use?

https://opennlp.apache.org/

java/TikaTextDetection/README.md line 69 at r4 (raw file):

  Supports almost every language in Optimaize except Aragonese.

  **PLEASE NOTE**, if `opennlp` is selected, ensure that the `MIN_LANGUAGES` property is set to 1 or greater.

I don't see a property called MIN_LANGUAGES defined.

Why is this only important for OpenNLP?

java/TikaTextDetection/README.md line 75 at r4 (raw file):

  Third party language detection project that supports 71 languages.
  Predicts target language using N-gram frequency matching between input and language profiles.
  Supports almost every language present in Tika's Language Detector except Esperanto.

Referring to "Tika's Language Detector" is confusing because it's not an option we support for LANGUAGE_DETECTOR. I'm not saying we should support it, rather, you should first describe it as a separate (old) language detector (and link to it) before mentioning OpenNLP or Optimaize. Then you can refer to it later in the README like you currently do.

java/TikaTextDetection/plugin-files/descriptor/descriptor.json line 41 at r4 (raw file):
Is theisReasonablyCertain() call useful for Optimaize? Just wondering if it's doing anything for us that MIN_LANGUAGE_CONFIDENCE filtering does not.

From the docs:

Tries to judge whether the identification is certain enough to be trusted. WARNING: Will never return true for small amount of input texts.

With your experience with small amount of text, is this true? Maybe it's a wash if small amount of text are always less than MIN_LANGUAGE_CONFIDENCE anyway.

java/TikaTextDetection/plugin-files/descriptor/descriptor.json line 48 at r4 (raw file):

        {
          "name": "MIN_LANGUAGE_CONFIDENCE",
          "description": "If set to a positive value, filter any `LANGUAGE` detections with a lower confidence score. ",

Say "filter out".

java/TikaTextDetection/plugin-files/descriptor/descriptor.json line 65 at r4 (raw file):

        },
        {
          "name": "FEED_FORWARD_PROP_TO_PROCESS",

Instead of mentioning translating, phrase the description instead to talk about which properties will be used to determine the text used for language detection.

java/TikaTextDetection/plugin-files/descriptor/descriptor.json line 87 at r4 (raw file):

    },
    {
      "name": "TIKA TEXT DETECTION (WITH OPENNLP LANG ID) ACTION",

Spell out LANGUAGE here and in the task and pipeline.

java/TikaTextDetection/plugin-files/descriptor/descriptor.json line 103 at r4 (raw file):

    {
      "name": "TIKA TEXT LANGUAGE ID (WITH FF REGION) ACTION",
      "description": "Executes the Tika text detection algorithm using the default parameters for language detection on feed forward `TEXT,TRANSCRIPT` results.",

Instead of "TEXT,TRANSCRIPT results" say "feed forward tracks" since the value of FEED_FORWARD_PROP_TO_PROCESS may change. Please make this change in other descriptions in this file too.

java/TikaTextDetection/plugin-files/descriptor/descriptor.json line 200 at r4 (raw file):

      "description": "Performs Tika text detection.",
      "tasks": [
        "TIKA TEXT OPENNLP LANGUAGE ID TASK"

This task does not exist. Also, I think the description here is incorrect or missing something.

hhuangMITRE

Reviewable status: 0 of 6 files reviewed, 9 unresolved discussions (waiting on @hhuangMITRE and @jrobble)

java/TikaTextDetection/README.md line 60 at r4 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Is this a better link to use?

https://opennlp.apache.org/

That aligns better with the optimaize language detector intro so I updated the link.

There are some useful documentation notes on Apache Tika's implementation of OpenNLP/Optimaize, so I've included that info as a separate link for both Optimaize and OpenNLP at the end of each description.

java/TikaTextDetection/README.md line 69 at r4 (raw file):

Previously, jrobble (Jeff Robble) wrote…

I don't see a property called MIN_LANGUAGES defined.

Why is this only important for OpenNLP?

Ah sorry, MIN_LANGUAGES will be included after the hotfix for multi-language identification tasks.

It's analogous to the FILTER_REASONABLE_LANGUAGES property in that it tries to force at least one prediction through, if the prediction exists, regardless of whether Tika considers it to be "reasonable".

That being said....perhaps it's better to get rid of FILTER_REASONABLE_LANGUAGES property and swap it with MIN_LANGUAGES altogether.

After the next update to support multi-language detection, we won't really need FILTER_REASONABLE_LANGUAGES to get additional predictions, MIN_LANGUAGES = 1 would yield the same behavior as FILTER_REASONABLE_LANGUAGES = FALSE.

Since both are new properties. I could include in the hotfixes descriptor something as follows in place of FILTER_REASONABLE_LANGUAGES:

        {
          "name": "MIN_LANGUAGES ",
          "description": "Current options are '0' and '1'. Set to '0' to ignore top language detection if it is not marked as reasonable by Tika. Set to '1' to allow top language detection even if it is labeled as unreasonable by Tika. Setting to '1' is recommended for OpenNLP as it's language filtering is stricter."
          "type": "INT",
          "defaultValue": "0"
        },

Maybe also include In the future, this property will enable additional language detections.

java/TikaTextDetection/README.md line 75 at r4 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Referring to "Tika's Language Detector" is confusing because it's not an option we support for LANGUAGE_DETECTOR. I'm not saying we should support it, rather, you should first describe it as a separate (old) language detector (and link to it) before mentioning OpenNLP or Optimaize. Then you can refer to it later in the README like you currently do.

Added the info for Tika's original language detection and a quick note on it's current status as deprecated.

java/TikaTextDetection/plugin-files/descriptor/descriptor.json line 41 at r4 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Is theisReasonablyCertain() call useful for Optimaize? Just wondering if it's doing anything for us that MIN_LANGUAGE_CONFIDENCE filtering does not.

From the docs:

Tries to judge whether the identification is certain enough to be trusted. WARNING: Will never return true for small amount of input texts.

With your experience with small amount of text, is this true? Maybe it's a wash if small amount of text are always less than MIN_LANGUAGE_CONFIDENCE anyway.

Right now isReasonablyCertain() is useful to block out erroneous predictions by Optimaize. In the past, we did notice some odd predictions by Optimaize for short sequences without this enabled.

We can try smaller amounts of text with isReeasonablyCertain() disabled. I will check on that later today, I suspect there may be an issue with smaller text still generating high confidence predictions.

If there is no issues there (low text = low confidence), we can switch over to using MIN_LANGUAGE_CONFIDENCE and find a general estimate that works well after the language ID study is done.

java/TikaTextDetection/plugin-files/descriptor/descriptor.json line 48 at r4 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Say "filter out".

Done.

java/TikaTextDetection/plugin-files/descriptor/descriptor.json line 65 at r4 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Instead of mentioning translating, phrase the description instead to talk about which properties will be used to determine the text used for language detection.

Corrected phrasing for language detection task, thanks for catching that!

java/TikaTextDetection/plugin-files/descriptor/descriptor.json line 87 at r4 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Spell out LANGUAGE here and in the task and pipeline.

Done. This change has also been transferred over to the Tesla pipeline and will be retested later today.

java/TikaTextDetection/plugin-files/descriptor/descriptor.json line 103 at r4 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Instead of "TEXT,TRANSCRIPT results" say "feed forward tracks" since the value of FEED_FORWARD_PROP_TO_PROCESS may change. Please make this change in other descriptions in this file too.

Updated descriptions.

java/TikaTextDetection/plugin-files/descriptor/descriptor.json line 200 at r4 (raw file):

Previously, jrobble (Jeff Robble) wrote…

This task does not exist. Also, I think the description here is incorrect or missing something.

Oh thanks for catching this, this was leftover from the pipeline tests. Removed.

Hotfix: Adding capability for TikaTextDetection to perform feed-forwa…

1c92b3e

…rd language ID.

hhuangMITRE added enhancement in progress labels Oct 25, 2022

hhuangMITRE requested a review from jrobble October 25, 2022 10:39

hhuangMITRE self-assigned this Oct 25, 2022

hhuangMITRE added this to To do in OpenMPF: Development via automation Oct 25, 2022

hhuangMITRE and others added 6 commits November 7, 2022 01:56

Add support for OpenNLP, additional README updates and code cleanup.

98a98be

Add support for OpenNLP, additional README updates and code cleanup.

44ce688

Update for Tika feed forward processing.

e7f8a47

Update descriptor file with FF region action. Update error check for …

ca75c05

…missing FF tracks.

Update ISO Mapper for ZH/ZHO outputs, update to include confidence sc…

98dc3d0

…ore and filtering option.

Adding OpenNLP language ID tasks.

deaf419

hhuangMITRE force-pushed the hotfix/tika-lang-feed-forward branch 2 times, most recently from d13fd04 to 98dc3d0 Compare December 1, 2022 04:47

hhuangMITRE added 2 commits December 1, 2022 01:55

Updates for ISO/Language code check.

64e44b4

Pipeline updates and filtering behavior correction.

8e70ffc

jrobble requested changes Dec 13, 2022

View reviewed changes

hhuangMITRE added 3 commits December 14, 2022 18:11

README and descriptor file update.

1fdda42

README update.

c741b25

Descriptor update.

368f5c9

hhuangMITRE commented Dec 14, 2022

View reviewed changes

hhuangMITRE added 4 commits December 20, 2022 00:37

Descriptor tooltip update.

dbb992e

Adjusting action/task name to align.

399a591

Merge remote-tracking branch 'origin' into hotfix/tika-lang-feed-forward

791a8f9

Updating to preserve Chinese script info.

7cf8441

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hotfix: Adding capability for TikaTextDetection to perform feed-forward language ID. #310

Hotfix: Adding capability for TikaTextDetection to perform feed-forward language ID. #310

hhuangMITRE commented Oct 25, 2022 •

edited by jrobble

Loading

hhuangMITRE commented Oct 25, 2022 •

edited

Loading

jrobble left a comment

hhuangMITRE left a comment

Hotfix: Adding capability for TikaTextDetection to perform feed-forward language ID. #310

Are you sure you want to change the base?

Hotfix: Adding capability for TikaTextDetection to perform feed-forward language ID. #310

Conversation

hhuangMITRE commented Oct 25, 2022 • edited by jrobble Loading

hhuangMITRE commented Oct 25, 2022 • edited Loading

jrobble left a comment

Choose a reason for hiding this comment

hhuangMITRE left a comment

Choose a reason for hiding this comment

hhuangMITRE commented Oct 25, 2022 •

edited by jrobble

Loading

hhuangMITRE commented Oct 25, 2022 •

edited

Loading