-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding support for OpenNLP language detector and updating TikaTextDetection to output multiple language predictions. #305
base: develop
Are you sure you want to change the base?
Conversation
d5a4575
to
b3eecce
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far, I've only done a static code review.
Reviewed 2 of 6 files at r1, all commit messages.
Reviewable status: 2 of 6 files reviewed, 15 unresolved discussions (waiting on @hhuangMITRE)
java/TikaTextDetection/README.md
line 43 at r1 (raw file):
date and time information, "Page /", and "???". # Language detection parameters
Capitalize every important word in the header.
java/TikaTextDetection/README.md
line 61 at r1 (raw file):
Language results are stored as follows: - `TEXT_LANGUAGE` : The primary detected language for a given text. Set to "Unknown" if no language is identified.
We need to ensure that we can use this output in Tika Text --> Translation pipelines. The custom translation component ( @cdglasz ) currently expects the language property to be DECODED_LANGUAGE
or LANGUAGE
. Also, the output is an ISO 639-3 code. For example:
"detectionProperties": {
"DECODED_LANGUAGE": "SPA",
"DECODED_LANGUAGE_CONFIDENCE": "0.9994221715043562",
Looking at the Tesseract component, I noticed that it outputs TEXT_LANGUAGE
too. However, that is really a script (in all cases where OSD is successful and we're not defaulting to something - correct me if I'm wrong), for example, script/Cyrillic
, so it's not suitable as a translation input. In a perfect world, I would rename that to TEXT_SCRIPT
, but we can't do that right now or risk breaking compatibility with existing job parsers.
I think we should update the custom translation component to also accept TEXT_LANGUAGE
as an input. If someone tries to create a Tesseract --> Translation pipeline using the custom translation component that will cause an error saying the language (really a script), is not supported. I think that's fine.
java/TikaTextDetection/README.md
line 65 at r1 (raw file):
Secondary languages and their confidence scores are listed as comma separated strings: - `SECONDARY_TEXT_LANGUAGES` : A list of secondary languages (from greatest to least confidence) separated by ", " delimiters.
I just want to point out that our use of semi-colons for PAGE_NUM
is inconsistent with our use of commas for this output and the associated confidences, but don't change it.
Our custom speech-to-text components also use commas for the detected languages and associated confidences. It's best to be consistent with that for this language detection capability. Also, I don't want to change PAGE_NUM
to use commas now since it might break compatibility with existing job parsers.
java/TikaTextDetection/README.md
line 135 at r1 (raw file):
- Optimaize Language Detector: - Every language in Tika's language detector, except Esperanto. - Afrikaans
Listing these in alphabetical order will help readability.
java/TikaTextDetection/README.md
line 185 at r1 (raw file):
- OpenNLP Language Detector: - Every language in Tika and Optimaize Language Detectors except Aragonese. - Bihari languages
Listing these in alphabetical order will help readability.
java/TikaTextDetection/plugin-files/descriptor/descriptor.json
line 35 at r1 (raw file):
}, { "name": "LANG_DETECTOR",
Use `LANGUAGE_DETECTOR" for consistency with other parameters.
java/TikaTextDetection/plugin-files/descriptor/descriptor.json
line 36 at r1 (raw file):
{ "name": "LANG_DETECTOR", "description": "Specifies which Tika Language detector to use. Current options are `opennlp`, `optimaize` and `tika` (Note: `tika` is depreciated and does not work for short text). Defaults to `opennlp`.",
Use lowercase L in "Language detector".
Don't need to mention default in the description.
java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java
line 140 at r1 (raw file):
if (!supportedLangDetectors.contains(langOption)) { log.warn(langOption + " language option not supported. Defaulting to `opennlp` language detector.");
Say "language detection option".
java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java
line 146 at r1 (raw file):
identifier = new OptimaizeLangDetector(); try { identifier.loadModels();
How long does this call take? Is it something we want to cache between job runs?
More generally, is the initialization time for any of the language detectors something that we want to do only once and only load a new detector if different from the last job run? Caching is not necessary if the overhead is trivial.
java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java
line 154 at r1 (raw file):
} else if(langOption == "tika") { //TODO: It appears that the legaacy tika detector has several issues procssing short sections of text.
Spelling: "procssing"
java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java
line 207 at r1 (raw file):
if (text.length() >= charLimit) { List<LanguageResult> langResultList; langResultList = identifier.detectAll(text);
This line can be combined with the one above it.
java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java
line 212 at r1 (raw file):
List<String> confidenceList = new ArrayList<String>(); maxLang = maxLang > langResultList.size() || maxLang < 1 ? langResultList.size(): maxLang;
Please define new variables for maxLang
and minLang
here so we don't overwrite what we read in as parameters. I've encountered situations where developers assume input parameters are never reset, which has led to bugs in the code.
Also, parenthesis around the conditional part of the ternary operator (before the ?) will help readability.
java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java
line 238 at r1 (raw file):
if(langList.size() < 1) { genericDetectionProperties.put("TEXT_LANGUAGE", "Unknown");
Returning Unknown
here could mess up the custom translator. We should discuss with @cdglasz . We may want the custom translator to ignore that rather than treat it as an actual language. I think some of the custom speech-to-text components may produce some form of unknown as the decoded language.
java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java
line 241 at r1 (raw file):
genericDetectionProperties.put("TEXT_LANGUAGE_CONFIDENCE", "-1"); genericDetectionProperties.put("SECONDARY_TEXT_LANGUAGES", "Unknown"); genericDetectionProperties.put("SECONDARY_TEXT_LANGUAGE_CONFIDENCES", "NONE");
NONE
seems odd since a list of numerical values is expected. The Tesseract component just omits the following detection properties if they're not suitable:
if (secondary_scripts.size() > 0) {
// ...
detection_properties["OSD_SECONDARY_SCRIPTS"] = boost::algorithm::join(scripts, ", ");
detection_properties["OSD_SECONDARY_SCRIPT_SCORES"] = boost::algorithm::join(scores, ", ");
java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java
line 283 at r1 (raw file):
// Map for translating from ISO 639-2 code to english description. private static Map<String,String> convertISO639() {
This method prepares a map for use in a later conversion (vs. performing a conversation itself at the moment it's called), so I prefer the old method name of initLangMap
.
java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java
line 358 at r1 (raw file):
map.put("zh-tw", "Traditional Chinese"); map.put("bih", "Bihari languages");
Putting these in alphabetical order will help readability.
Please use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 2 of 6 files reviewed, 16 unresolved discussions (waiting on @cdglasz, @hhuangMITRE, and @jrobble)
java/TikaTextDetection/README.md
line 43 at r1 (raw file):
Previously, jrobble (Jeff Robble) wrote…
Capitalize every important word in the header.
Done.
java/TikaTextDetection/README.md
line 61 at r1 (raw file):
Previously, jrobble (Jeff Robble) wrote…
We need to ensure that we can use this output in Tika Text --> Translation pipelines. The custom translation component ( @cdglasz ) currently expects the language property to be
DECODED_LANGUAGE
orLANGUAGE
. Also, the output is an ISO 639-3 code. For example:"detectionProperties": { "DECODED_LANGUAGE": "SPA", "DECODED_LANGUAGE_CONFIDENCE": "0.9994221715043562",
Looking at the Tesseract component, I noticed that it outputs
TEXT_LANGUAGE
too. However, that is really a script (in all cases where OSD is successful and we're not defaulting to something - correct me if I'm wrong), for example,script/Cyrillic
, so it's not suitable as a translation input. In a perfect world, I would rename that toTEXT_SCRIPT
, but we can't do that right now or risk breaking compatibility with existing job parsers.I think we should update the custom translation component to also accept
TEXT_LANGUAGE
as an input. If someone tries to create a Tesseract --> Translation pipeline using the custom translation component that will cause an error saying the language (really a script), is not supported. I think that's fine.
@jrobble Yes, the Tesseract component technically detects scripts (rather than exact language) using the OSD capability.
As a follow up:
Discussed with Chris our options for a compatible text language field.
Currently, we've settled on ISO_LANGUAGE
for Tika with missing language detections set to UNKNOWN
.
We're going to continue passing over TEXT_LANGUAGE
from Tesseract, with the assumption that the component will sometimes output ISO formatted languages (when specified by user).
java/TikaTextDetection/README.md
line 135 at r1 (raw file):
Previously, jrobble (Jeff Robble) wrote…
Listing these in alphabetical order will help readability.
Resorted!
java/TikaTextDetection/README.md
line 185 at r1 (raw file):
Previously, jrobble (Jeff Robble) wrote…
Listing these in alphabetical order will help readability.
Done.
java/TikaTextDetection/plugin-files/descriptor/descriptor.json
line 35 at r1 (raw file):
Previously, jrobble (Jeff Robble) wrote…
Use `LANGUAGE_DETECTOR" for consistency with other parameters.
Done.
java/TikaTextDetection/plugin-files/descriptor/descriptor.json
line 36 at r1 (raw file):
Previously, jrobble (Jeff Robble) wrote…
Use lowercase L in "Language detector".
Don't need to mention default in the description.
Done.
java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java
line 140 at r1 (raw file):
Previously, jrobble (Jeff Robble) wrote…
Say "language detection option".
Done.
java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java
line 146 at r1 (raw file):
Previously, jrobble (Jeff Robble) wrote…
How long does this call take? Is it something we want to cache between job runs?
More generally, is the initialization time for any of the language detectors something that we want to do only once and only load a new detector if different from the last job run? Caching is not necessary if the overhead is trivial.
I've done a quick timing test, around 2-3 seconds for opennlp and 0.5 seconds for optimaize.
When multiple job calls are run in succession, the initialization time drops down below 1 millisecond.
Tika has reported that opennlp is initialized with a static call, and since we only need the detector to be initialized once, I've updated the component to handle the language detector calls once when the component is initialized.
java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java
line 150 at r1 (raw file):
Previously, jrobble (Jeff Robble) wrote…
Please use
MPF_DETECTION_NOT_INITIALIZED
instead ofMPF_OTHER_DETECTION_ERROR_TYPE
when failing to load the models.
Done. Relocated to component initalization.
java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java
line 154 at r1 (raw file):
Previously, jrobble (Jeff Robble) wrote…
Spelling: "procssing"
Whoops, thanks for catching that. I spotted another typo here and fixed that as well.
java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java
line 207 at r1 (raw file):
Previously, jrobble (Jeff Robble) wrote…
This line can be combined with the one above it.
Done.
Code quote:
langResultList = identifier.detectAll(text);
java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java
line 212 at r1 (raw file):
Previously, jrobble (Jeff Robble) wrote…
Please define new variables for
maxLang
andminLang
here so we don't overwrite what we read in as parameters. I've encountered situations where developers assume input parameters are never reset, which has led to bugs in the code.Also, parenthesis around the conditional part of the ternary operator (before the ?) will help readability.
Thanks for the feedback. I agree that we'd want to avoid that confusion and it is good coding practice in general.
I've also updated the ternary operator code for clarity.
java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java
line 238 at r1 (raw file):
Previously, jrobble (Jeff Robble) wrote…
Returning
Unknown
here could mess up the custom translator. We should discuss with @cdglasz . We may want the custom translator to ignore that rather than treat it as an actual language. I think some of the custom speech-to-text components may produce some form of unknown as the decoded language.
Update: Synched with Chris and identified that we're using UNKNOWN
keyword for blank language detections. Set this component return that in the ISO_LANGUAGE
field.
java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java
line 241 at r1 (raw file):
Previously, jrobble (Jeff Robble) wrote…
NONE
seems odd since a list of numerical values is expected. The Tesseract component just omits the following detection properties if they're not suitable:if (secondary_scripts.size() > 0) { // ... detection_properties["OSD_SECONDARY_SCRIPTS"] = boost::algorithm::join(scripts, ", "); detection_properties["OSD_SECONDARY_SCRIPT_SCORES"] = boost::algorithm::join(scores, ", ");
I agree it's a bit strange, I can look into seeing if the language detectors offer more info. Right now only Optimiaze appears to be offering confidence results, but they've been sorted into the following labels by Tika:
"NONE, LOW, MEDIUM, HIGH"
https://tika.apache.org/1.21/api/org/apache/tika/language/detect/LanguageConfidence.html
If non-normalized confidence values are encountered, perhaps we should just leave this field out. I'll also check if OpenNLP has confidence parameters to be returned as well in a follow-up update.
java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java
line 283 at r1 (raw file):
Previously, jrobble (Jeff Robble) wrote…
This method prepares a map for use in a later conversion (vs. performing a conversation itself at the moment it's called), so I prefer the old method name of
initLangMap
.
Done. Also added the isoMap as well for mapping between ISO 639-1 and ISO 639-3.
java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java
line 358 at r1 (raw file):
Previously, jrobble (Jeff Robble) wrote…
Putting these in alphabetical order will help readability.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 4 files at r2, all commit messages.
Reviewable status: 4 of 6 files reviewed, 7 unresolved discussions (waiting on @cdglasz and @hhuangMITRE)
java/TikaTextDetection/plugin-files/descriptor/descriptor.json
line 35 at r2 (raw file):
}, { "name": "LANGUAGE_DETECTOR",
Should we just remove tika
support if it's inferior and depreciated? As long as the component output for openNLP and optimaize is similar, we don't need to worry about breaking compatibility.
If we want to keep it, we need a unit test for it.
java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java
line 143 at r1 (raw file):
} LanguageDetector identifier; if (langOption == "optimaize") {
You need to use .equals()
for string comparison in Java.
java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java
line 146 at r1 (raw file):
Previously, hhuangMITRE (Howard W Huang) wrote…
I've done a quick timing test, around 2-3 seconds for opennlp and 0.5 seconds for optimaize.
When multiple job calls are run in succession, the initialization time drops down below 1 millisecond.
Tika has reported that opennlp is initialized with a static call, and since we only need the detector to be initialized once, I've updated the component to handle the language detector calls once when the component is initialized.
Another overhead consideration is the amount of memory loading each of these takes. I ran a profiler. Note the highlighted line and the one below:
OpenNLP requires ~318 MiB
Optimaize requires ~263 MiB
That's a lot of memory for something that might not even be used, depending on the job properties. Please only load the required detector based on job properties, and switch to loading the other detector if the next job has a different value for that property.
We can't force Java to unload the other detector (easily), but if you have one LanguageDetector langDetector
class variable in TikaTextDetectionComponent
and overwrite it with the required detector, the previously loaded detector will go out of scope and be garbage collected eventually.
java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java
line 153 at r1 (raw file):
} } else if(langOption == "tika") {
You need to use .equals()
for string comparison in Java.
java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java
line 241 at r1 (raw file):
Previously, hhuangMITRE (Howard W Huang) wrote…
I agree it's a bit strange, I can look into seeing if the language detectors offer more info. Right now only Optimiaze appears to be offering confidence results, but they've been sorted into the following labels by Tika:
"NONE, LOW, MEDIUM, HIGH"
https://tika.apache.org/1.21/api/org/apache/tika/language/detect/LanguageConfidence.html
If non-normalized confidence values are encountered, perhaps we should just leave this field out. I'll also check if OpenNLP has confidence parameters to be returned as well in a follow-up update.
The LanguageResult
class has a getRawScore()
method, which appears to have a valid value for openNLP and optimaize between 0 and 1 (looking at the decompiled .class files for how the confidence string is determined). I think it's worth returning this raw score as the detection confidence instead of the string value.
Another reason for this change is that it seems odd that when you set MIN_LANGUAGES > 0
the associated TEXT_LANGUAGE_CONFIDENCE
can beNONE
for a valid TEXT_LANGUAGE
.
I got openNLP to return a LOW
confidence (as opposed to NONE
) when providing it with a lot of Chinese text (10,368 characters). Not saying that this has anything to do with Chinese specifically, rather, that's a lot of characters for a LOW
confidence.
Internally, it uses this calculation:
private static LanguageConfidence getConfidence(double confidence) {
if (confidence > 0.9) {
return LanguageConfidence.HIGH;
} else if (confidence > 0.85) {
return LanguageConfidence.MEDIUM;
} else {
return confidence > 0.2 ? LanguageConfidence.LOW : LanguageConfidence.NONE;
}
}
Note that LanguageResult.isReasonablyCertain()
is defined as:
public boolean isReasonablyCertain() {
return this.confidence == LanguageConfidence.HIGH;
}
What I'm saying is, I don't think we want to rely on using isReasonablyCertain()
for openNLP. It's too selective. I don't know how many characters it would take to achieve a HIGH
confidence. I assume a lot.
Not saying this issue requires a code change. Just saying that we always want to make sure MIN_LANGUAGES
is 1 or greater when OpenNLP is selected. We should probably document that.
If we don't set a cap via MAX_REASONABLE_LANGUAGES
, how many languages have you ever seen it return as reasonable? I wonder if we can simplify things and not cap this. Then we wouldn't need the property.
java/TikaTextDetection/src/test/java/org/mitre/mpf/detection/tika/TestTikaTextDetectionComponent.java
line 159 at r2 (raw file):
MPFGenericJob genericJob = new MPFGenericJob("TestGenericJob", mediaPath, jobProperties, mediaProperties); boolean debug = false;
Remove this debug line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 4 files at r2.
Reviewable status: all files reviewed, 9 unresolved discussions (waiting on @cdglasz and @hhuangMITRE)
java/TikaTextDetection/README.md
line 66 at r2 (raw file):
Secondary languages and their confidence scores are listed as comma separated strings: - `SECONDARY_TEXT_LANGUAGES` : A list of secondary languages (from greatest to least confidence) separated by ", " delimiters.
Just omit SECONDARY_TEXT_LANGUAGES
and SECONDARY_TEXT_LANGUAGE_CONFIDENCES
if they are not used , as Tesseract does for OSD_SECONDARY_SCRIPTS
andOSD_SECONDARY_SCRIPT_SCORES
. I don't think I was clear before. What I mean by "omit" is "not include them in the detection property map at all".
Also, I would feel better setting TEXT_LANGUAGE_CONFIDENCE
to -1 for Unknown
rather than leaving it blank.
java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java
line 303 at r2 (raw file):
// Map for translating from ISO 639-2 code to english description. private static Map<String,String> initLangMap() {
Consider moving these mappings to a util class since they take up a lot of space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 7 files reviewed, 9 unresolved discussions (waiting on @cdglasz, @hhuangMITRE, and @jrobble)
java/TikaTextDetection/README.md
line 66 at r2 (raw file):
Previously, jrobble (Jeff Robble) wrote…
Just omit
SECONDARY_TEXT_LANGUAGES
andSECONDARY_TEXT_LANGUAGE_CONFIDENCES
if they are not used , as Tesseract does forOSD_SECONDARY_SCRIPTS
andOSD_SECONDARY_SCRIPT_SCORES
. I don't think I was clear before. What I mean by "omit" is "not include them in the detection property map at all".Also, I would feel better setting
TEXT_LANGUAGE_CONFIDENCE
to -1 forUnknown
rather than leaving it blank.
Done.
java/TikaTextDetection/plugin-files/descriptor/descriptor.json
line 35 at r2 (raw file):
Previously, jrobble (Jeff Robble) wrote…
Should we just remove
tika
support if it's inferior and depreciated? As long as the component output for openNLP and optimaize is similar, we don't need to worry about breaking compatibility.If we want to keep it, we need a unit test for it.
I agree with removing it, documentation (https://tika.apache.org/2.4.1/api/org/apache/tika/langdetect/tika/TikaLanguageDetector.html) lists it as obsolete for the time being.
We can revisit tika
if Apache decides to update it, but it seems that they've switched over to OpenNLP for the latest language/text processing capabilities.
java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java
line 143 at r1 (raw file):
Previously, jrobble (Jeff Robble) wrote…
You need to use
.equals()
for string comparison in Java.
Thanks for the reminder, I've fixed this and the line below.
java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java
line 146 at r1 (raw file):
Previously, jrobble (Jeff Robble) wrote…
Another overhead consideration is the amount of memory loading each of these takes. I ran a profiler. Note the highlighted line and the one below:
OpenNLP requires ~318 MiB
Optimaize requires ~263 MiBThat's a lot of memory for something that might not even be used, depending on the job properties. Please only load the required detector based on job properties, and switch to loading the other detector if the next job has a different value for that property.
We can't force Java to unload the other detector (easily), but if you have one
LanguageDetector langDetector
class variable inTikaTextDetectionComponent
and overwrite it with the required detector, the previously loaded detector will go out of scope and be garbage collected eventually.
Thanks for checking the memory usage. I agree with switching between detectors rather than caching both and have updated the code.
java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java
line 153 at r1 (raw file):
Previously, jrobble (Jeff Robble) wrote…
You need to use
.equals()
for string comparison in Java.
Fixed, thanks!
java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java
line 241 at r1 (raw file):
Previously, jrobble (Jeff Robble) wrote…
The
LanguageResult
class has agetRawScore()
method, which appears to have a valid value for openNLP and optimaize between 0 and 1 (looking at the decompiled .class files for how the confidence string is determined). I think it's worth returning this raw score as the detection confidence instead of the string value.Another reason for this change is that it seems odd that when you set
MIN_LANGUAGES > 0
the associatedTEXT_LANGUAGE_CONFIDENCE
can beNONE
for a validTEXT_LANGUAGE
.
I got openNLP to return a
LOW
confidence (as opposed toNONE
) when providing it with a lot of Chinese text (10,368 characters). Not saying that this has anything to do with Chinese specifically, rather, that's a lot of characters for aLOW
confidence.Internally, it uses this calculation:
private static LanguageConfidence getConfidence(double confidence) { if (confidence > 0.9) { return LanguageConfidence.HIGH; } else if (confidence > 0.85) { return LanguageConfidence.MEDIUM; } else { return confidence > 0.2 ? LanguageConfidence.LOW : LanguageConfidence.NONE; } }
Note that
LanguageResult.isReasonablyCertain()
is defined as:public boolean isReasonablyCertain() { return this.confidence == LanguageConfidence.HIGH; }
What I'm saying is, I don't think we want to rely on using
isReasonablyCertain()
for openNLP. It's too selective. I don't know how many characters it would take to achieve aHIGH
confidence. I assume a lot.Not saying this issue requires a code change. Just saying that we always want to make sure
MIN_LANGUAGES
is 1 or greater when OpenNLP is selected. We should probably document that.
If we don't set a cap via
MAX_REASONABLE_LANGUAGES
, how many languages have you ever seen it return as reasonable? I wonder if we can simplify things and not cap this. Then we wouldn't need the property.
For most cases, I've also only seen the top language is marked as reasonable, and sometimes even that is not guaranteed if the text is short (in several cases, both Optimaize and OpenNLP still have the correct prediction).
I think Tika has set a high threshold in order to ensure that all reasonable predictions are as advertised.
As a result, I agree with leaving reasonably certain languages uncapped.
Also, double checked the getRawScore()
method for both language detectors, and switched the language confidence properties to this option.
java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TikaTextDetectionComponent.java
line 303 at r2 (raw file):
Previously, jrobble (Jeff Robble) wrote…
Consider moving these mappings to a util class since they take up a lot of space.
Done.
java/TikaTextDetection/src/test/java/org/mitre/mpf/detection/tika/TestTikaTextDetectionComponent.java
line 159 at r2 (raw file):
Previously, jrobble (Jeff Robble) wrote…
Remove this debug line.
Done.
Issues:
This change is![Reviewable](https://camo.githubusercontent.com/23b05f5fb48215c989e92cc44cf6512512d083132bd3daf689867c8d9d386888/68747470733a2f2f72657669657761626c652e696f2f7265766965775f627574746f6e2e737667)