-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hotfix: Adding capability for TikaTextDetection to perform feed-forward language ID. #310
base: master
Are you sure you want to change the base?
Conversation
The following optimizations were transferred from the ongoing Tika OpenNLP Lang ID PR:
Since this is a hotfix, the other optimizations (such as OpenNLP support) are still in #305 |
…missing FF tracks.
…ore and filtering option.
d13fd04
to
98dc3d0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 6 files at r2.
Reviewable status: 1 of 6 files reviewed, 9 unresolved discussions (waiting on @hhuangMITRE)
java/TikaTextDetection/README.md
line 60 at r4 (raw file):
the `LANGUAGE_DETECTOR` option: - `LANGUAGE_DETECTOR = opennlp`: [Apache Tika OpenNLP Language Detector](https://tika.apache.org/2.4.1/api/org/apache/tika/langdetect/opennlp/OpenNLPDetector.html)
Is this a better link to use?
java/TikaTextDetection/README.md
line 69 at r4 (raw file):
Supports almost every language in Optimaize except Aragonese. **PLEASE NOTE**, if `opennlp` is selected, ensure that the `MIN_LANGUAGES` property is set to 1 or greater.
I don't see a property called MIN_LANGUAGES
defined.
Why is this only important for OpenNLP?
java/TikaTextDetection/README.md
line 75 at r4 (raw file):
Third party language detection project that supports 71 languages. Predicts target language using N-gram frequency matching between input and language profiles. Supports almost every language present in Tika's Language Detector except Esperanto.
Referring to "Tika's Language Detector" is confusing because it's not an option we support for LANGUAGE_DETECTOR
. I'm not saying we should support it, rather, you should first describe it as a separate (old) language detector (and link to it) before mentioning OpenNLP or Optimaize. Then you can refer to it later in the README like you currently do.
java/TikaTextDetection/plugin-files/descriptor/descriptor.json
line 41 at r4 (raw file):
Is theisReasonablyCertain()
call useful for Optimaize? Just wondering if it's doing anything for us that MIN_LANGUAGE_CONFIDENCE
filtering does not.
From the docs:
Tries to judge whether the identification is certain enough to be trusted. WARNING: Will never return true for small amount of input texts.
With your experience with small amount of text, is this true? Maybe it's a wash if small amount of text are always less than MIN_LANGUAGE_CONFIDENCE
anyway.
java/TikaTextDetection/plugin-files/descriptor/descriptor.json
line 48 at r4 (raw file):
{ "name": "MIN_LANGUAGE_CONFIDENCE", "description": "If set to a positive value, filter any `LANGUAGE` detections with a lower confidence score. ",
Say "filter out".
java/TikaTextDetection/plugin-files/descriptor/descriptor.json
line 65 at r4 (raw file):
}, { "name": "FEED_FORWARD_PROP_TO_PROCESS",
Instead of mentioning translating, phrase the description instead to talk about which properties will be used to determine the text used for language detection.
java/TikaTextDetection/plugin-files/descriptor/descriptor.json
line 87 at r4 (raw file):
}, { "name": "TIKA TEXT DETECTION (WITH OPENNLP LANG ID) ACTION",
Spell out LANGUAGE here and in the task and pipeline.
java/TikaTextDetection/plugin-files/descriptor/descriptor.json
line 103 at r4 (raw file):
{ "name": "TIKA TEXT LANGUAGE ID (WITH FF REGION) ACTION", "description": "Executes the Tika text detection algorithm using the default parameters for language detection on feed forward `TEXT,TRANSCRIPT` results.",
Instead of "TEXT,TRANSCRIPT
results" say "feed forward tracks" since the value of FEED_FORWARD_PROP_TO_PROCESS
may change. Please make this change in other descriptions in this file too.
java/TikaTextDetection/plugin-files/descriptor/descriptor.json
line 200 at r4 (raw file):
"description": "Performs Tika text detection.", "tasks": [ "TIKA TEXT OPENNLP LANGUAGE ID TASK"
This task does not exist. Also, I think the description here is incorrect or missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 6 files reviewed, 9 unresolved discussions (waiting on @hhuangMITRE and @jrobble)
java/TikaTextDetection/README.md
line 60 at r4 (raw file):
Previously, jrobble (Jeff Robble) wrote…
Is this a better link to use?
That aligns better with the optimaize language detector intro so I updated the link.
There are some useful documentation notes on Apache Tika's implementation of OpenNLP/Optimaize, so I've included that info as a separate link for both Optimaize and OpenNLP at the end of each description.
java/TikaTextDetection/README.md
line 69 at r4 (raw file):
Previously, jrobble (Jeff Robble) wrote…
I don't see a property called
MIN_LANGUAGES
defined.Why is this only important for OpenNLP?
Ah sorry, MIN_LANGUAGES
will be included after the hotfix for multi-language identification tasks.
It's analogous to the FILTER_REASONABLE_LANGUAGES
property in that it tries to force at least one prediction through, if the prediction exists, regardless of whether Tika considers it to be "reasonable".
That being said....perhaps it's better to get rid of FILTER_REASONABLE_LANGUAGES
property and swap it with MIN_LANGUAGES
altogether.
After the next update to support multi-language detection, we won't really need FILTER_REASONABLE_LANGUAGES
to get additional predictions, MIN_LANGUAGES = 1
would yield the same behavior as FILTER_REASONABLE_LANGUAGES = FALSE
.
Since both are new properties. I could include in the hotfixes descriptor something as follows in place of FILTER_REASONABLE_LANGUAGES
:
{
"name": "MIN_LANGUAGES ",
"description": "Current options are '0' and '1'. Set to '0' to ignore top language detection if it is not marked as reasonable by Tika. Set to '1' to allow top language detection even if it is labeled as unreasonable by Tika. Setting to '1' is recommended for OpenNLP as it's language filtering is stricter."
"type": "INT",
"defaultValue": "0"
},
Maybe also include In the future, this property will enable additional language detections.
java/TikaTextDetection/README.md
line 75 at r4 (raw file):
Previously, jrobble (Jeff Robble) wrote…
Referring to "Tika's Language Detector" is confusing because it's not an option we support for
LANGUAGE_DETECTOR
. I'm not saying we should support it, rather, you should first describe it as a separate (old) language detector (and link to it) before mentioning OpenNLP or Optimaize. Then you can refer to it later in the README like you currently do.
Added the info for Tika's original language detection and a quick note on it's current status as deprecated.
java/TikaTextDetection/plugin-files/descriptor/descriptor.json
line 41 at r4 (raw file):
Previously, jrobble (Jeff Robble) wrote…
Is the
isReasonablyCertain()
call useful for Optimaize? Just wondering if it's doing anything for us thatMIN_LANGUAGE_CONFIDENCE
filtering does not.From the docs:
Tries to judge whether the identification is certain enough to be trusted. WARNING: Will never return true for small amount of input texts.
With your experience with small amount of text, is this true? Maybe it's a wash if small amount of text are always less than
MIN_LANGUAGE_CONFIDENCE
anyway.
Right now isReasonablyCertain()
is useful to block out erroneous predictions by Optimaize. In the past, we did notice some odd predictions by Optimaize for short sequences without this enabled.
We can try smaller amounts of text with isReeasonablyCertain()
disabled. I will check on that later today, I suspect there may be an issue with smaller text still generating high confidence predictions.
If there is no issues there (low text = low confidence), we can switch over to using MIN_LANGUAGE_CONFIDENCE
and find a general estimate that works well after the language ID study is done.
java/TikaTextDetection/plugin-files/descriptor/descriptor.json
line 48 at r4 (raw file):
Previously, jrobble (Jeff Robble) wrote…
Say "filter out".
Done.
java/TikaTextDetection/plugin-files/descriptor/descriptor.json
line 65 at r4 (raw file):
Previously, jrobble (Jeff Robble) wrote…
Instead of mentioning translating, phrase the description instead to talk about which properties will be used to determine the text used for language detection.
Corrected phrasing for language detection task, thanks for catching that!
java/TikaTextDetection/plugin-files/descriptor/descriptor.json
line 87 at r4 (raw file):
Previously, jrobble (Jeff Robble) wrote…
Spell out LANGUAGE here and in the task and pipeline.
Done. This change has also been transferred over to the Tesla pipeline and will be retested later today.
java/TikaTextDetection/plugin-files/descriptor/descriptor.json
line 103 at r4 (raw file):
Previously, jrobble (Jeff Robble) wrote…
Instead of "
TEXT,TRANSCRIPT
results" say "feed forward tracks" since the value ofFEED_FORWARD_PROP_TO_PROCESS
may change. Please make this change in other descriptions in this file too.
Updated descriptions.
java/TikaTextDetection/plugin-files/descriptor/descriptor.json
line 200 at r4 (raw file):
Previously, jrobble (Jeff Robble) wrote…
This task does not exist. Also, I think the description here is incorrect or missing something.
Oh thanks for catching this, this was leftover from the pipeline tests. Removed.
Issues:
This change is![Reviewable](https://camo.githubusercontent.com/23b05f5fb48215c989e92cc44cf6512512d083132bd3daf689867c8d9d386888/68747470733a2f2f72657669657761626c652e696f2f7265766965775f627574746f6e2e737667)