Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for OpenNLP language detector and updating TikaTextDetection to output multiple language predictions. #305

Open
wants to merge 10 commits into
base: develop
Choose a base branch
from

Conversation

hhuangMITRE
Copy link
Contributor

@hhuangMITRE hhuangMITRE commented Sep 12, 2022

@hhuangMITRE hhuangMITRE changed the base branch from develop to master September 12, 2022 15:15
@hhuangMITRE hhuangMITRE changed the base branch from master to develop September 12, 2022 15:17
@hhuangMITRE hhuangMITRE force-pushed the feature/tika-language-detection-opennlp branch from d5a4575 to b3eecce Compare September 12, 2022 16:08
@jrobble jrobble added this to To do in OpenMPF: Development via automation Sep 12, 2022
Copy link
Member

@jrobble jrobble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far, I've only done a static code review.

Reviewed 2 of 6 files at r1, all commit messages.
Reviewable status: 2 of 6 files reviewed, 15 unresolved discussions (waiting on @hhuangMITRE)


java/TikaTextDetection/README.md line 43 at r1 (raw file):

  date and time information, "Page /", and "???".

# Language detection parameters

Capitalize every important word in the header.


java/TikaTextDetection/README.md line 61 at r1 (raw file):

Language results are stored as follows:
- `TEXT_LANGUAGE` : The primary detected language for a given text. Set to "Unknown" if no language is identified.

We need to ensure that we can use this output in Tika Text --> Translation pipelines. The custom translation component ( @cdglasz ) currently expects the language property to be DECODED_LANGUAGE or LANGUAGE. Also, the output is an ISO 639-3 code. For example:

"detectionProperties": {
   "DECODED_LANGUAGE": "SPA",
   "DECODED_LANGUAGE_CONFIDENCE": "0.9994221715043562",

Looking at the Tesseract component, I noticed that it outputs TEXT_LANGUAGE too. However, that is really a script (in all cases where OSD is successful and we're not defaulting to something - correct me if I'm wrong), for example, script/Cyrillic, so it's not suitable as a translation input. In a perfect world, I would rename that to TEXT_SCRIPT, but we can't do that right now or risk breaking compatibility with existing job parsers.

I think we should update the custom translation component to also accept TEXT_LANGUAGE as an input. If someone tries to create a Tesseract --> Translation pipeline using the custom translation component that will cause an error saying the language (really a script), is not supported. I think that's fine.


java/TikaTextDetection/README.md line 65 at r1 (raw file):

Secondary languages and their confidence scores are listed as comma separated strings:
- `SECONDARY_TEXT_LANGUAGES` : A list of secondary languages (from greatest to least confidence) separated by ", " delimiters.

I just want to point out that our use of semi-colons for PAGE_NUM is inconsistent with our use of commas for this output and the associated confidences, but don't change it.

Our custom speech-to-text components also use commas for the detected languages and associated confidences. It's best to be consistent with that for this language detection capability. Also, I don't want to change PAGE_NUM to use commas now since it might break compatibility with existing job parsers.


java/TikaTextDetection/README.md line 135 at r1 (raw file):

- Optimaize Language Detector:
  - Every language in Tika's language detector, except Esperanto.
  - Afrikaans

Listing these in alphabetical order will help readability.


java/TikaTextDetection/README.md line 185 at r1 (raw file):

- OpenNLP Language Detector:
  - Every language in Tika and Optimaize Language Detectors except Aragonese. 
  - Bihari languages 

Listing these in alphabetical order will help readability.


java/TikaTextDetection/plugin-files/descriptor/descriptor.json line 35 at r1 (raw file):

        },
        {
          "name": "LANG_DETECTOR",

Use `LANGUAGE_DETECTOR" for consistency with other parameters.


java/TikaTextDetection/plugin-files/descriptor/descriptor.json line 36 at r1 (raw file):

        {
          "name": "LANG_DETECTOR",
          "description": "Specifies which Tika Language detector to use. Current options are `opennlp`, `optimaize` and `tika` (Note: `tika` is depreciated and does not work for short text). Defaults to `opennlp`.",

Use lowercase L in "Language detector".

Don't need to mention default in the description.


java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java line 140 at r1 (raw file):

        if (!supportedLangDetectors.contains(langOption))
        {
            log.warn(langOption + " language option not supported. Defaulting to `opennlp` language detector.");

Say "language detection option".


java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java line 146 at r1 (raw file):

            identifier = new OptimaizeLangDetector();
            try {
                identifier.loadModels();

How long does this call take? Is it something we want to cache between job runs?

More generally, is the initialization time for any of the language detectors something that we want to do only once and only load a new detector if different from the last job run? Caching is not necessary if the overhead is trivial.


java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java line 154 at r1 (raw file):

        }
        else if(langOption == "tika") {
            //TODO: It appears that the legaacy tika detector has several issues procssing short sections of text.

Spelling: "procssing"


java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java line 207 at r1 (raw file):

                if (text.length() >= charLimit) {
                    List<LanguageResult> langResultList;
                    langResultList = identifier.detectAll(text);

This line can be combined with the one above it.


java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java line 212 at r1 (raw file):

                    List<String> confidenceList = new ArrayList<String>();

                    maxLang = maxLang > langResultList.size() || maxLang < 1 ? langResultList.size(): maxLang;

Please define new variables for maxLang and minLang here so we don't overwrite what we read in as parameters. I've encountered situations where developers assume input parameters are never reset, which has led to bugs in the code.

Also, parenthesis around the conditional part of the ternary operator (before the ?) will help readability.


java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java line 238 at r1 (raw file):

                    if(langList.size() < 1) {
                        genericDetectionProperties.put("TEXT_LANGUAGE", "Unknown");

Returning Unknown here could mess up the custom translator. We should discuss with @cdglasz . We may want the custom translator to ignore that rather than treat it as an actual language. I think some of the custom speech-to-text components may produce some form of unknown as the decoded language.


java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java line 241 at r1 (raw file):

                        genericDetectionProperties.put("TEXT_LANGUAGE_CONFIDENCE", "-1");
                        genericDetectionProperties.put("SECONDARY_TEXT_LANGUAGES", "Unknown");
                        genericDetectionProperties.put("SECONDARY_TEXT_LANGUAGE_CONFIDENCES", "NONE");

NONE seems odd since a list of numerical values is expected. The Tesseract component just omits the following detection properties if they're not suitable:

        if (secondary_scripts.size() > 0) {
            // ...
            detection_properties["OSD_SECONDARY_SCRIPTS"] = boost::algorithm::join(scripts, ", ");
            detection_properties["OSD_SECONDARY_SCRIPT_SCORES"] = boost::algorithm::join(scores, ", ");

java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java line 283 at r1 (raw file):

    // Map for translating from ISO 639-2 code to english description.
    private static Map<String,String> convertISO639() {

This method prepares a map for use in a later conversion (vs. performing a conversation itself at the moment it's called), so I prefer the old method name of initLangMap.


java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java line 358 at r1 (raw file):

        map.put("zh-tw", "Traditional Chinese");

        map.put("bih", "Bihari languages");

Putting these in alphabetical order will help readability.

@jrobble
Copy link
Member

jrobble commented Sep 13, 2022

java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java line 150 at r1 (raw file):

                String errorMsg = "Error loading Optimaize Language Model.";
                log.error(errorMsg, e);
                throw new MPFComponentDetectionError(MPFDetectionError.MPF_OTHER_DETECTION_ERROR_TYPE, errorMsg);

Please use MPF_DETECTION_NOT_INITIALIZED instead of MPF_OTHER_DETECTION_ERROR_TYPE when failing to load the models.

Copy link
Contributor Author

@hhuangMITRE hhuangMITRE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 of 6 files reviewed, 16 unresolved discussions (waiting on @cdglasz, @hhuangMITRE, and @jrobble)


java/TikaTextDetection/README.md line 43 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Capitalize every important word in the header.

Done.


java/TikaTextDetection/README.md line 61 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

We need to ensure that we can use this output in Tika Text --> Translation pipelines. The custom translation component ( @cdglasz ) currently expects the language property to be DECODED_LANGUAGE or LANGUAGE. Also, the output is an ISO 639-3 code. For example:

"detectionProperties": {
   "DECODED_LANGUAGE": "SPA",
   "DECODED_LANGUAGE_CONFIDENCE": "0.9994221715043562",

Looking at the Tesseract component, I noticed that it outputs TEXT_LANGUAGE too. However, that is really a script (in all cases where OSD is successful and we're not defaulting to something - correct me if I'm wrong), for example, script/Cyrillic, so it's not suitable as a translation input. In a perfect world, I would rename that to TEXT_SCRIPT, but we can't do that right now or risk breaking compatibility with existing job parsers.

I think we should update the custom translation component to also accept TEXT_LANGUAGE as an input. If someone tries to create a Tesseract --> Translation pipeline using the custom translation component that will cause an error saying the language (really a script), is not supported. I think that's fine.

@jrobble Yes, the Tesseract component technically detects scripts (rather than exact language) using the OSD capability.

As a follow up:

Discussed with Chris our options for a compatible text language field.

Currently, we've settled on ISO_LANGUAGE for Tika with missing language detections set to UNKNOWN.

We're going to continue passing over TEXT_LANGUAGE from Tesseract, with the assumption that the component will sometimes output ISO formatted languages (when specified by user).


java/TikaTextDetection/README.md line 135 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Listing these in alphabetical order will help readability.

Resorted!


java/TikaTextDetection/README.md line 185 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Listing these in alphabetical order will help readability.

Done.


java/TikaTextDetection/plugin-files/descriptor/descriptor.json line 35 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Use `LANGUAGE_DETECTOR" for consistency with other parameters.

Done.


java/TikaTextDetection/plugin-files/descriptor/descriptor.json line 36 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Use lowercase L in "Language detector".

Don't need to mention default in the description.

Done.


java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java line 140 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Say "language detection option".

Done.


java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java line 146 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

How long does this call take? Is it something we want to cache between job runs?

More generally, is the initialization time for any of the language detectors something that we want to do only once and only load a new detector if different from the last job run? Caching is not necessary if the overhead is trivial.

I've done a quick timing test, around 2-3 seconds for opennlp and 0.5 seconds for optimaize.

When multiple job calls are run in succession, the initialization time drops down below 1 millisecond.

Tika has reported that opennlp is initialized with a static call, and since we only need the detector to be initialized once, I've updated the component to handle the language detector calls once when the component is initialized.


java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java line 150 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Please use MPF_DETECTION_NOT_INITIALIZED instead of MPF_OTHER_DETECTION_ERROR_TYPE when failing to load the models.

Done. Relocated to component initalization.


java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java line 154 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Spelling: "procssing"

Whoops, thanks for catching that. I spotted another typo here and fixed that as well.


java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java line 207 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

This line can be combined with the one above it.

Done.

Code quote:

langResultList = identifier.detectAll(text);

java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java line 212 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Please define new variables for maxLang and minLang here so we don't overwrite what we read in as parameters. I've encountered situations where developers assume input parameters are never reset, which has led to bugs in the code.

Also, parenthesis around the conditional part of the ternary operator (before the ?) will help readability.

Thanks for the feedback. I agree that we'd want to avoid that confusion and it is good coding practice in general.
I've also updated the ternary operator code for clarity.


java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java line 238 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Returning Unknown here could mess up the custom translator. We should discuss with @cdglasz . We may want the custom translator to ignore that rather than treat it as an actual language. I think some of the custom speech-to-text components may produce some form of unknown as the decoded language.

Update: Synched with Chris and identified that we're using UNKNOWN keyword for blank language detections. Set this component return that in the ISO_LANGUAGE field.


java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java line 241 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

NONE seems odd since a list of numerical values is expected. The Tesseract component just omits the following detection properties if they're not suitable:

        if (secondary_scripts.size() > 0) {
            // ...
            detection_properties["OSD_SECONDARY_SCRIPTS"] = boost::algorithm::join(scripts, ", ");
            detection_properties["OSD_SECONDARY_SCRIPT_SCORES"] = boost::algorithm::join(scores, ", ");

I agree it's a bit strange, I can look into seeing if the language detectors offer more info. Right now only Optimiaze appears to be offering confidence results, but they've been sorted into the following labels by Tika:

"NONE, LOW, MEDIUM, HIGH"

https://tika.apache.org/1.21/api/org/apache/tika/language/detect/LanguageConfidence.html

If non-normalized confidence values are encountered, perhaps we should just leave this field out. I'll also check if OpenNLP has confidence parameters to be returned as well in a follow-up update.


java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java line 283 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

This method prepares a map for use in a later conversion (vs. performing a conversation itself at the moment it's called), so I prefer the old method name of initLangMap.

Done. Also added the isoMap as well for mapping between ISO 639-1 and ISO 639-3.


java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java line 358 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Putting these in alphabetical order will help readability.

Done.

Copy link
Member

@jrobble jrobble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 4 files at r2, all commit messages.
Reviewable status: 4 of 6 files reviewed, 7 unresolved discussions (waiting on @cdglasz and @hhuangMITRE)


java/TikaTextDetection/plugin-files/descriptor/descriptor.json line 35 at r2 (raw file):

        },
        {
          "name": "LANGUAGE_DETECTOR",

Should we just remove tika support if it's inferior and depreciated? As long as the component output for openNLP and optimaize is similar, we don't need to worry about breaking compatibility.

If we want to keep it, we need a unit test for it.


java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java line 143 at r1 (raw file):

        }
        LanguageDetector identifier;
        if (langOption == "optimaize") {

You need to use .equals() for string comparison in Java.


java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java line 146 at r1 (raw file):

Previously, hhuangMITRE (Howard W Huang) wrote…

I've done a quick timing test, around 2-3 seconds for opennlp and 0.5 seconds for optimaize.

When multiple job calls are run in succession, the initialization time drops down below 1 millisecond.

Tika has reported that opennlp is initialized with a static call, and since we only need the detector to be initialized once, I've updated the component to handle the language detector calls once when the component is initialized.

Another overhead consideration is the amount of memory loading each of these takes. I ran a profiler. Note the highlighted line and the one below:

TikaProfile.PNG

OpenNLP requires ~318 MiB
Optimaize requires ~263 MiB

That's a lot of memory for something that might not even be used, depending on the job properties. Please only load the required detector based on job properties, and switch to loading the other detector if the next job has a different value for that property.

We can't force Java to unload the other detector (easily), but if you have one LanguageDetector langDetector class variable in TikaTextDetectionComponent and overwrite it with the required detector, the previously loaded detector will go out of scope and be garbage collected eventually.


java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java line 153 at r1 (raw file):

            }
        }
        else if(langOption == "tika") {

You need to use .equals() for string comparison in Java.


java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java line 241 at r1 (raw file):

Previously, hhuangMITRE (Howard W Huang) wrote…

I agree it's a bit strange, I can look into seeing if the language detectors offer more info. Right now only Optimiaze appears to be offering confidence results, but they've been sorted into the following labels by Tika:

"NONE, LOW, MEDIUM, HIGH"

https://tika.apache.org/1.21/api/org/apache/tika/language/detect/LanguageConfidence.html

If non-normalized confidence values are encountered, perhaps we should just leave this field out. I'll also check if OpenNLP has confidence parameters to be returned as well in a follow-up update.

The LanguageResult class has a getRawScore() method, which appears to have a valid value for openNLP and optimaize between 0 and 1 (looking at the decompiled .class files for how the confidence string is determined). I think it's worth returning this raw score as the detection confidence instead of the string value.

Another reason for this change is that it seems odd that when you set MIN_LANGUAGES > 0 the associated TEXT_LANGUAGE_CONFIDENCE can beNONE for a valid TEXT_LANGUAGE.


I got openNLP to return a LOW confidence (as opposed to NONE) when providing it with a lot of Chinese text (10,368 characters). Not saying that this has anything to do with Chinese specifically, rather, that's a lot of characters for a LOW confidence.

Internally, it uses this calculation:

    private static LanguageConfidence getConfidence(double confidence) {
        if (confidence > 0.9) {
            return LanguageConfidence.HIGH;
        } else if (confidence > 0.85) {
            return LanguageConfidence.MEDIUM;
        } else {
            return confidence > 0.2 ? LanguageConfidence.LOW : LanguageConfidence.NONE;
        }
    }

Note that LanguageResult.isReasonablyCertain() is defined as:

    public boolean isReasonablyCertain() {
        return this.confidence == LanguageConfidence.HIGH;
    }

What I'm saying is, I don't think we want to rely on using isReasonablyCertain() for openNLP. It's too selective. I don't know how many characters it would take to achieve a HIGH confidence. I assume a lot.

Not saying this issue requires a code change. Just saying that we always want to make sure MIN_LANGUAGES is 1 or greater when OpenNLP is selected. We should probably document that.


If we don't set a cap via MAX_REASONABLE_LANGUAGES, how many languages have you ever seen it return as reasonable? I wonder if we can simplify things and not cap this. Then we wouldn't need the property.


java/TikaTextDetection/src/test/java/org/mitre/mpf/detection/tika/TestTikaTextDetectionComponent.java line 159 at r2 (raw file):

        MPFGenericJob genericJob = new MPFGenericJob("TestGenericJob", mediaPath, jobProperties, mediaProperties);
        boolean debug = false;

Remove this debug line.

Copy link
Member

@jrobble jrobble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 4 files at r2.
Reviewable status: all files reviewed, 9 unresolved discussions (waiting on @cdglasz and @hhuangMITRE)


java/TikaTextDetection/README.md line 66 at r2 (raw file):

Secondary languages and their confidence scores are listed as comma separated strings:
- `SECONDARY_TEXT_LANGUAGES` : A list of secondary languages (from greatest to least confidence) separated by ", " delimiters.

Just omit SECONDARY_TEXT_LANGUAGES and SECONDARY_TEXT_LANGUAGE_CONFIDENCES if they are not used , as Tesseract does for OSD_SECONDARY_SCRIPTS andOSD_SECONDARY_SCRIPT_SCORES. I don't think I was clear before. What I mean by "omit" is "not include them in the detection property map at all".

Also, I would feel better setting TEXT_LANGUAGE_CONFIDENCE to -1 for Unknown rather than leaving it blank.


java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java line 303 at r2 (raw file):

    // Map for translating from ISO 639-2 code to english description.
    private static Map<String,String> initLangMap() {

Consider moving these mappings to a util class since they take up a lot of space.

Copy link
Contributor Author

@hhuangMITRE hhuangMITRE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 7 files reviewed, 9 unresolved discussions (waiting on @cdglasz, @hhuangMITRE, and @jrobble)


java/TikaTextDetection/README.md line 66 at r2 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Just omit SECONDARY_TEXT_LANGUAGES and SECONDARY_TEXT_LANGUAGE_CONFIDENCES if they are not used , as Tesseract does for OSD_SECONDARY_SCRIPTS andOSD_SECONDARY_SCRIPT_SCORES. I don't think I was clear before. What I mean by "omit" is "not include them in the detection property map at all".

Also, I would feel better setting TEXT_LANGUAGE_CONFIDENCE to -1 for Unknown rather than leaving it blank.

Done.


java/TikaTextDetection/plugin-files/descriptor/descriptor.json line 35 at r2 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Should we just remove tika support if it's inferior and depreciated? As long as the component output for openNLP and optimaize is similar, we don't need to worry about breaking compatibility.

If we want to keep it, we need a unit test for it.

I agree with removing it, documentation (https://tika.apache.org/2.4.1/api/org/apache/tika/langdetect/tika/TikaLanguageDetector.html) lists it as obsolete for the time being.

We can revisit tika if Apache decides to update it, but it seems that they've switched over to OpenNLP for the latest language/text processing capabilities.


java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java line 143 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

You need to use .equals() for string comparison in Java.

Thanks for the reminder, I've fixed this and the line below.


java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java line 146 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Another overhead consideration is the amount of memory loading each of these takes. I ran a profiler. Note the highlighted line and the one below:

TikaProfile.PNG

OpenNLP requires ~318 MiB
Optimaize requires ~263 MiB

That's a lot of memory for something that might not even be used, depending on the job properties. Please only load the required detector based on job properties, and switch to loading the other detector if the next job has a different value for that property.

We can't force Java to unload the other detector (easily), but if you have one LanguageDetector langDetector class variable in TikaTextDetectionComponent and overwrite it with the required detector, the previously loaded detector will go out of scope and be garbage collected eventually.

Thanks for checking the memory usage. I agree with switching between detectors rather than caching both and have updated the code.


java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java line 153 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

You need to use .equals() for string comparison in Java.

Fixed, thanks!


java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java line 241 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

The LanguageResult class has a getRawScore() method, which appears to have a valid value for openNLP and optimaize between 0 and 1 (looking at the decompiled .class files for how the confidence string is determined). I think it's worth returning this raw score as the detection confidence instead of the string value.

Another reason for this change is that it seems odd that when you set MIN_LANGUAGES > 0 the associated TEXT_LANGUAGE_CONFIDENCE can beNONE for a valid TEXT_LANGUAGE.


I got openNLP to return a LOW confidence (as opposed to NONE) when providing it with a lot of Chinese text (10,368 characters). Not saying that this has anything to do with Chinese specifically, rather, that's a lot of characters for a LOW confidence.

Internally, it uses this calculation:

    private static LanguageConfidence getConfidence(double confidence) {
        if (confidence > 0.9) {
            return LanguageConfidence.HIGH;
        } else if (confidence > 0.85) {
            return LanguageConfidence.MEDIUM;
        } else {
            return confidence > 0.2 ? LanguageConfidence.LOW : LanguageConfidence.NONE;
        }
    }

Note that LanguageResult.isReasonablyCertain() is defined as:

    public boolean isReasonablyCertain() {
        return this.confidence == LanguageConfidence.HIGH;
    }

What I'm saying is, I don't think we want to rely on using isReasonablyCertain() for openNLP. It's too selective. I don't know how many characters it would take to achieve a HIGH confidence. I assume a lot.

Not saying this issue requires a code change. Just saying that we always want to make sure MIN_LANGUAGES is 1 or greater when OpenNLP is selected. We should probably document that.


If we don't set a cap via MAX_REASONABLE_LANGUAGES, how many languages have you ever seen it return as reasonable? I wonder if we can simplify things and not cap this. Then we wouldn't need the property.

For most cases, I've also only seen the top language is marked as reasonable, and sometimes even that is not guaranteed if the text is short (in several cases, both Optimaize and OpenNLP still have the correct prediction).

I think Tika has set a high threshold in order to ensure that all reasonable predictions are as advertised.
As a result, I agree with leaving reasonably certain languages uncapped.

Also, double checked the getRawScore() method for both language detectors, and switched the language confidence properties to this option.


java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java line 303 at r2 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Consider moving these mappings to a util class since they take up a lot of space.

Done.


java/TikaTextDetection/src/test/java/org/mitre/mpf/detection/tika/TestTikaTextDetectionComponent.java line 159 at r2 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Remove this debug line.

Done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

None yet

2 participants