Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hotfix: Adding capability for TikaTextDetection to perform feed-forward language ID. #310

Open
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

hhuangMITRE
Copy link
Contributor

@hhuangMITRE hhuangMITRE commented Oct 25, 2022

@hhuangMITRE
Copy link
Contributor Author

hhuangMITRE commented Oct 25, 2022

The following optimizations were transferred from the ongoing Tika OpenNLP Lang ID PR:

  • A static instance of the Optimaize language detector.
  • ISO Language Mapper.

Since this is a hotfix, the other optimizations (such as OpenNLP support) are still in #305

@hhuangMITRE hhuangMITRE force-pushed the hotfix/tika-lang-feed-forward branch 2 times, most recently from d13fd04 to 98dc3d0 Compare December 1, 2022 04:47
Copy link
Member

@jrobble jrobble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 6 files at r2.
Reviewable status: 1 of 6 files reviewed, 9 unresolved discussions (waiting on @hhuangMITRE)


java/TikaTextDetection/README.md line 60 at r4 (raw file):

the `LANGUAGE_DETECTOR` option:

- `LANGUAGE_DETECTOR = opennlp`: [Apache Tika OpenNLP Language Detector](https://tika.apache.org/2.4.1/api/org/apache/tika/langdetect/opennlp/OpenNLPDetector.html)

Is this a better link to use?

https://opennlp.apache.org/


java/TikaTextDetection/README.md line 69 at r4 (raw file):

  Supports almost every language in Optimaize except Aragonese.

  **PLEASE NOTE**, if `opennlp` is selected, ensure that the `MIN_LANGUAGES` property is set to 1 or greater.

I don't see a property called MIN_LANGUAGES defined.

Why is this only important for OpenNLP?


java/TikaTextDetection/README.md line 75 at r4 (raw file):

  Third party language detection project that supports 71 languages.
  Predicts target language using N-gram frequency matching between input and language profiles.
  Supports almost every language present in Tika's Language Detector except Esperanto.

Referring to "Tika's Language Detector" is confusing because it's not an option we support for LANGUAGE_DETECTOR. I'm not saying we should support it, rather, you should first describe it as a separate (old) language detector (and link to it) before mentioning OpenNLP or Optimaize. Then you can refer to it later in the README like you currently do.


java/TikaTextDetection/plugin-files/descriptor/descriptor.json line 41 at r4 (raw file):
Is theisReasonablyCertain() call useful for Optimaize? Just wondering if it's doing anything for us that MIN_LANGUAGE_CONFIDENCE filtering does not.

From the docs:

Tries to judge whether the identification is certain enough to be trusted. WARNING: Will never return true for small amount of input texts.

With your experience with small amount of text, is this true? Maybe it's a wash if small amount of text are always less than MIN_LANGUAGE_CONFIDENCE anyway.


java/TikaTextDetection/plugin-files/descriptor/descriptor.json line 48 at r4 (raw file):

        {
          "name": "MIN_LANGUAGE_CONFIDENCE",
          "description": "If set to a positive value, filter any `LANGUAGE` detections with a lower confidence score. ",

Say "filter out".


java/TikaTextDetection/plugin-files/descriptor/descriptor.json line 65 at r4 (raw file):

        },
        {
          "name": "FEED_FORWARD_PROP_TO_PROCESS",

Instead of mentioning translating, phrase the description instead to talk about which properties will be used to determine the text used for language detection.


java/TikaTextDetection/plugin-files/descriptor/descriptor.json line 87 at r4 (raw file):

    },
    {
      "name": "TIKA TEXT DETECTION (WITH OPENNLP LANG ID) ACTION",

Spell out LANGUAGE here and in the task and pipeline.


java/TikaTextDetection/plugin-files/descriptor/descriptor.json line 103 at r4 (raw file):

    {
      "name": "TIKA TEXT LANGUAGE ID (WITH FF REGION) ACTION",
      "description": "Executes the Tika text detection algorithm using the default parameters for language detection on feed forward `TEXT,TRANSCRIPT` results.",

Instead of "TEXT,TRANSCRIPT results" say "feed forward tracks" since the value of FEED_FORWARD_PROP_TO_PROCESS may change. Please make this change in other descriptions in this file too.


java/TikaTextDetection/plugin-files/descriptor/descriptor.json line 200 at r4 (raw file):

      "description": "Performs Tika text detection.",
      "tasks": [
        "TIKA TEXT OPENNLP LANGUAGE ID TASK"

This task does not exist. Also, I think the description here is incorrect or missing something.

Copy link
Contributor Author

@hhuangMITRE hhuangMITRE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 6 files reviewed, 9 unresolved discussions (waiting on @hhuangMITRE and @jrobble)


java/TikaTextDetection/README.md line 60 at r4 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Is this a better link to use?

https://opennlp.apache.org/

That aligns better with the optimaize language detector intro so I updated the link.

There are some useful documentation notes on Apache Tika's implementation of OpenNLP/Optimaize, so I've included that info as a separate link for both Optimaize and OpenNLP at the end of each description.


java/TikaTextDetection/README.md line 69 at r4 (raw file):

Previously, jrobble (Jeff Robble) wrote…

I don't see a property called MIN_LANGUAGES defined.

Why is this only important for OpenNLP?

Ah sorry, MIN_LANGUAGES will be included after the hotfix for multi-language identification tasks.

It's analogous to the FILTER_REASONABLE_LANGUAGES property in that it tries to force at least one prediction through, if the prediction exists, regardless of whether Tika considers it to be "reasonable".

That being said....perhaps it's better to get rid of FILTER_REASONABLE_LANGUAGES property and swap it with MIN_LANGUAGES altogether.

After the next update to support multi-language detection, we won't really need FILTER_REASONABLE_LANGUAGES to get additional predictions, MIN_LANGUAGES = 1 would yield the same behavior as FILTER_REASONABLE_LANGUAGES = FALSE.

Since both are new properties. I could include in the hotfixes descriptor something as follows in place of FILTER_REASONABLE_LANGUAGES:

        {
          "name": "MIN_LANGUAGES ",
          "description": "Current options are '0' and '1'. Set to '0' to ignore top language detection if it is not marked as reasonable by Tika. Set to '1' to allow top language detection even if it is labeled as unreasonable by Tika. Setting to '1' is recommended for OpenNLP as it's language filtering is stricter."
          "type": "INT",
          "defaultValue": "0"
        },

Maybe also include In the future, this property will enable additional language detections.


java/TikaTextDetection/README.md line 75 at r4 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Referring to "Tika's Language Detector" is confusing because it's not an option we support for LANGUAGE_DETECTOR. I'm not saying we should support it, rather, you should first describe it as a separate (old) language detector (and link to it) before mentioning OpenNLP or Optimaize. Then you can refer to it later in the README like you currently do.

Added the info for Tika's original language detection and a quick note on it's current status as deprecated.


java/TikaTextDetection/plugin-files/descriptor/descriptor.json line 41 at r4 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Is theisReasonablyCertain() call useful for Optimaize? Just wondering if it's doing anything for us that MIN_LANGUAGE_CONFIDENCE filtering does not.

From the docs:

Tries to judge whether the identification is certain enough to be trusted. WARNING: Will never return true for small amount of input texts.

With your experience with small amount of text, is this true? Maybe it's a wash if small amount of text are always less than MIN_LANGUAGE_CONFIDENCE anyway.

Right now isReasonablyCertain() is useful to block out erroneous predictions by Optimaize. In the past, we did notice some odd predictions by Optimaize for short sequences without this enabled.

We can try smaller amounts of text with isReeasonablyCertain() disabled. I will check on that later today, I suspect there may be an issue with smaller text still generating high confidence predictions.

If there is no issues there (low text = low confidence), we can switch over to using MIN_LANGUAGE_CONFIDENCE and find a general estimate that works well after the language ID study is done.


java/TikaTextDetection/plugin-files/descriptor/descriptor.json line 48 at r4 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Say "filter out".

Done.


java/TikaTextDetection/plugin-files/descriptor/descriptor.json line 65 at r4 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Instead of mentioning translating, phrase the description instead to talk about which properties will be used to determine the text used for language detection.

Corrected phrasing for language detection task, thanks for catching that!


java/TikaTextDetection/plugin-files/descriptor/descriptor.json line 87 at r4 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Spell out LANGUAGE here and in the task and pipeline.

Done. This change has also been transferred over to the Tesla pipeline and will be retested later today.


java/TikaTextDetection/plugin-files/descriptor/descriptor.json line 103 at r4 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Instead of "TEXT,TRANSCRIPT results" say "feed forward tracks" since the value of FEED_FORWARD_PROP_TO_PROCESS may change. Please make this change in other descriptions in this file too.

Updated descriptions.


java/TikaTextDetection/plugin-files/descriptor/descriptor.json line 200 at r4 (raw file):

Previously, jrobble (Jeff Robble) wrote…

This task does not exist. Also, I think the description here is incorrect or missing something.

Oh thanks for catching this, this was leftover from the pipeline tests. Removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.

None yet

2 participants