Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for OpenNLP language detector and updating TikaTextDetection to output multiple language predictions. #305

Open
wants to merge 10 commits into
base: develop
Choose a base branch
from
Open
6 changes: 6 additions & 0 deletions java/TikaTextDetection/LICENSE
Original file line number Diff line number Diff line change
Expand Up @@ -27,3 +27,9 @@ The "language-detector" project is licensed under the Apache 2.0 License.
Project Page: https://github.com/optimaize/language-detector

--------------------------------------------------------------------------

The "Apache OpenNLP" project is licensed under the Apache 2.0 License.
Project Page: https://github.com/apache/opennlp

--------------------------------------------------------------------------

191 changes: 190 additions & 1 deletion java/TikaTextDetection/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,4 +38,193 @@ The following format-specific behaviors were observed using Tika 1.28.1 on Ubunt
and `SECTION_NUM = 1`.

- OpenDocument Spreadsheet documents will generate one track per cell, as well as some additional tracks with
date and time information, "Page /", and "???".
date and time information, "Page /", and "???".

# Language detection parameters

Tika supports the following language detection properties:

- `MAX_REASONABLE_LANGUAGES`: Specifies maximum number of top detected languages.
When set to 0 or below, allow any number of language results that are marked as reasonably certain by Tika.

- `MIN_LANGUAGES`: When set to a positive integer, attempt to return specified number of top languages, even if some are not marked as reasonably certain. Non-positive values disable this property to only accept reasonable predictions.

For instance, if `MAX_REASONABLE_LANGUAGES` is set to 5 and `MIN_LANGUAGES` is set to 2, the component will always attempt to return the top 2 predicted languages, followed by the next 3 if they are marked as reasonably certain.

If `MAX_REASONABLE_LANGUAGES` is set to -1 and `MIN_LANGUAGES` is set to 2 (default), the component will always attempt to return the top predicted language and a secondary language, even if they are not set to reasonably confident by Tika.

Please note that the behavior of `MIN_LANGUAGES` is different depending on the language detector:
- The Optimaize language detector will often produce the exact number of `MIN_LANGUAGES` requested by a user.
- The OpenNLP language detector tends to only produce 1 language unless `MIN_LANGUAGES` is set to 2 or greater. However, low confidence results are still sometimes thrown out even if the user requested additional language predictions. (This indicates that OpenNLP uses a different filtering mechanism which could be investigated further).

Language results are stored as follows:
- `TEXT_LANGUAGE` : The primary detected language for a given text. Set to "Unknown" if no language is identified.
- `TEXT_LANGUAGE_CONFIDENCE` : A confidence setting for the primary language ranging from `NONE` to `HIGH` confidence. See [note here.](https://tika.apache.org/1.21/api/org/apache/tika/language/detect/LanguageConfidence.html)

Secondary languages and their confidence scores are listed as comma separated strings:
- `SECONDARY_TEXT_LANGUAGES` : A list of secondary languages (from greatest to least confidence) separated by ", " delimiters.
Set to "Unknown" if no language is identified.
- `SECONDARY_TEXT_LANGUAGE_CONFIDENCES` : A confidence list corresponding to the secondary detected languages in order (also separated by commas).

# Supported Language Detectors:

This component supports the following language detectors. Users can select their preferred detector using
the `LANG_DETECTOR` option:

- `LANG_DETECTOR = opennlp`: [Apache Tika OpenNLP Language Detector](https://tika.apache.org/2.4.1/api/org/apache/tika/langdetect/opennlp/OpenNLPDetector.html)

Apache Tika's latest in-house language detection capability based on OpenNLP's language detector.
Uses Machine Learning (ML) models trained on the following datasets and supports 148 languages in total
- [Leipzig corpus](https://wortschatz.uni-leipzig.de/en/download)
- [cc-100](https://data.statmt.org/cc-100/)

Supports almost every language in Optimaize and Tika Language Detectors except Aragonese.


- `LANG_DETECTOR = optimaize`: [Optimaize Language Detector](https://github.com/optimaize/language-detector)

Third party language detection project that supports 71 languages.
Predicts target language using N-gram frequency matching between input and language profiles.
Supports almost every language present in Tika's Language Detector except Esperanto.
Please note that Optimaize supports Punjabi/Panjabi while OpenNLP supports Western Punjabi/Panjabi.


- (Depreciated) `LANG_DETECTOR = tika`: [Apache Tika Language Detector](https://tika.apache.org/2.4.1/api/org/apache/tika/langdetect/tika/TikaLanguageDetector.html)

Apache Tika's original in-house language detection capability.
Predicts target language using vector distance of trigrams between input string and language models.
Supports 28 languages (listed in following section).
**NOTE: Developers have warned that this legacy detector is depreciated and won't work well on short snippets of text.**


# Supported Language List:

- Tika Language Detector:
- Belarusian
- Catalan
- Danish
- German
- Esperanto
- Estonian
- Greek
- English
- Spanish
- Finnish
- French
- Persian
- Galician
- Hungarian
- Icelandic
- Italian
- Lithuanian
- Dutch
- Norwegian
- Polish
- Portuguese
- Romanian
- Russian
- Slovakian
- Slovenian
- Swedish
- Thai
- Ukrainian


- Optimaize Language Detector:
- Every language in Tika's language detector, except Esperanto.
- Afrikaans
- **Aragonese** (Unique to this detector)
- Arabic
- Asturian
- Breton
- Bulgarian
- Bengali
- Czech
- Welsh
- Basque
- Irish
- Gujarati
- Hebrew
- Hindi
- Croatian
- Haitian
- Indonesian
- Japanese
- Khmer
- Kannada
- Korean
- Latvian
- Macedonian
- Malayalam
- Marathi
- Malay
- Maltese
- Nepali
- Occitan
- Punjabi
- Slovak
- Slovene
- Somali
- Albanian
- Serbian
- Swahili
- Tamil
- Telugu
- Tagalog
- Turkish
- Urdu
- Vietnamese
- Walloon
- Yiddish
- Simplified Chinese
- Traditional Chinese


- OpenNLP Language Detector:
- Every language in Tika and Optimaize Language Detectors except Aragonese.
- Bihari languages
- Swiss German
- Turkmen
- Bashkir
- Mongolian
- Balinese
- Pushto
- Faroese
- Swati
- Min Nan Chinese
- Yoruba
- Scottish Gaelic
- Javanese
- Iranian Persian
- Esperanto
- Western Panjabi
- Standard Latvian
- Western Frisian
- Burmese
- Eastern Mari
- Paraguayan Guaraní
- Slovenian
- Cebuano
- Mandarin Chinese
- Kurdish
- Pedi
- Azerbaijani
- Uighur
- Minangkabau
- Tajik
- Uzbek
- Maori
- Sindhi
- Konkani
- Armenian
- Igbo
- Assamese
- Malay
- Low German
- Fulah
- Xhosa
- Standard Estonian
- Goan Konkani
- Lingala
- Dhivehi
- Zulu
19 changes: 19 additions & 0 deletions java/TikaTextDetection/plugin-files/descriptor/descriptor.json
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,25 @@
"type": "BOOLEAN",
"defaultValue": "false"
},
{
"name": "LANG_DETECTOR",
"description": "Specifies which Tika Language detector to use. Current options are `opennlp`, `optimaize` and `tika` (Note: `tika` is depreciated and does not work for short text). Defaults to `opennlp`.",
"type": "STRING",
"defaultValue": "opennlp"
},
{

"name": "MAX_REASONABLE_LANGUAGES",
"description": "Specifies maximum number of top detected languages. When set to 0 or below, allow any number of language results that are marked as reasonably certain by TIKA.",
"type": "INT",
"defaultValue": "-1"
},
{
"name": "MIN_LANGUAGES",
"description": "When set to a positive integer, attempt to always return specified number of top languages, even if some are not marked as reasonably certain. Non-positive values disable this property to only accept reasonable predictions.",
"type": "INT",
"defaultValue": "2"
},
{
"name": "LIST_ALL_PAGES",
"description": "Specifies whether or not to store each page as a track, even if no text is extracted.",
Expand Down
13 changes: 10 additions & 3 deletions java/TikaTextDetection/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -66,13 +66,20 @@
<version>2.4.1</version>
</dependency>
<dependency>
<!-- Please note that Apache Tika versions 2.0 and above now support multiple language detection options: -->
<!-- https://github.com/apache/tika/tree/main/tika-langdetect -->
<!-- TODO: Investigate other language detection capabilities and update this library if needed.-->
<groupId>org.apache.tika</groupId>
<artifactId>tika-langdetect-optimaize</artifactId>
<version>2.4.1</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-langdetect-opennlp</artifactId>
<version>2.4.1</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-langdetect-tika</artifactId>
<version>2.4.1</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-core</artifactId>
Expand Down
Loading