Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for OpenNLP language detector and updating TikaTextDetection to output multiple language predictions. #305

Open
wants to merge 10 commits into
base: develop
Choose a base branch
from
Open
6 changes: 6 additions & 0 deletions java/TikaTextDetection/LICENSE
Original file line number Diff line number Diff line change
Expand Up @@ -27,3 +27,9 @@ The "language-detector" project is licensed under the Apache 2.0 License.
Project Page: https://github.com/optimaize/language-detector

--------------------------------------------------------------------------

The "Apache OpenNLP" project is licensed under the Apache 2.0 License.
Project Page: https://github.com/apache/opennlp

--------------------------------------------------------------------------

182 changes: 181 additions & 1 deletion java/TikaTextDetection/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,4 +38,184 @@ The following format-specific behaviors were observed using Tika 1.28.1 on Ubunt
and `SECTION_NUM = 1`.

- OpenDocument Spreadsheet documents will generate one track per cell, as well as some additional tracks with
date and time information, "Page /", and "???".
date and time information, "Page /", and "???".

# Language Detection Parameters

Tika supports the following language detection properties:

- `MIN_LANGUAGES`: When set to a positive integer, attempt to return specified number of top languages, even if some are not marked as reasonably certain. Non-positive values disable this property to only accept reasonable predictions.

For instance, if `MIN_LANGUAGES` is set to 2, the component will always attempt to return the top 2 predicted languages, followed by the next languages if they are also marked as reasonably certain.

Please note that the behavior of `MIN_LANGUAGES` is different depending on the language detector:
- The Optimaize language detector will often produce the exact number of `MIN_LANGUAGES` requested by a user.
- The OpenNLP language detector tends to only produce 1 language unless `MIN_LANGUAGES` is set to 2 or greater. However, low confidence results are still sometimes thrown out even if the user requested additional language predictions. (This indicates that OpenNLP uses a different filtering mechanism which could be investigated further).

Language results are stored as follows:
- `TEXT_LANGUAGE` : The primary detected language for a given text. Set to "Unknown" if no language is identified.
- `TEXT_LANGUAGE_CONFIDENCE` : Raw confidence score for the primary language. Ranges from 0 to 1 for OpenNLP and Optimaize language detectors, and set to -1 if no results are found. See [note here.](https://tika.apache.org/2.4.1/api/index.html?org/apache/tika/language/detect/LanguageResult.html)
- `ISO_LANGUAGE` : The primary detected language for a given text in ISO 639-3 format. Set to "UNKNOWN" if no language id identified (as "UNK" is an ISO code).

Secondary languages and their confidence scores are listed as comma separated strings:
- `SECONDARY_TEXT_LANGUAGES` : A list of secondary languages (from greatest to least confidence) separated by ", " delimiters.
Note,`SECONDARY_TEXT` properties are not included if no secondary languages are identified.
- `SECONDARY_TEXT_LANGUAGE_CONFIDENCES` : A confidence list corresponding to the secondary detected languages in order (also separated by commas).

For secondary language predictions, ensure that `MIN_LANGUAGES` property is set to 2 or greater. Secondary language results are typically ignored by the component as Tika has a high confidence threshold for reasonable predictions.

# Supported Language Detectors:

This component supports the following language detectors. Users can select their preferred detector using
the `LANGUAGE_DETECTOR` option:

- `LANGUAGE_DETECTOR = opennlp`: [Apache Tika OpenNLP Language Detector](https://tika.apache.org/2.4.1/api/org/apache/tika/langdetect/opennlp/OpenNLPDetector.html)

Apache Tika's latest in-house language detection capability based on OpenNLP's language detector.
Uses Machine Learning (ML) models trained on the following datasets and supports 148 languages in total
- [Leipzig corpus](https://wortschatz.uni-leipzig.de/en/download)
- [cc-100](https://data.statmt.org/cc-100/)

Supports almost every language in Optimaize and Tika Language Detectors except Aragonese.

**PLEASE NOTE**, if `opennlp` is selected, ensure that the `MIN_LANGUAGES` property is set to 1 or greater.

- `LANGUAGE_DETECTOR = optimaize`: [Optimaize Language Detector](https://github.com/optimaize/language-detector)

Third party language detection project that supports 71 languages.
Predicts target language using N-gram frequency matching between input and language profiles.
Supports almost every language present in Tika's Language Detector except Esperanto.
Please note that Optimaize supports Punjabi/Panjabi while OpenNLP supports Western Punjabi/Panjabi.


# Supported Language List:

- Tika Language Detector:
- Belarusian
- Catalan
- Danish
- Dutch
- English
- Esperanto
- Estonian
- Finnish
- French
- Galician
- German
- Greek
- Hungarian
- Icelandic
- Italian
- Lithuanian
- Norwegian
- Persian
- Polish
- Portuguese
- Romanian
- Russian
- Slovakian
- Slovenian
- Spanish
- Swedish
- Thai
- Ukrainian


- Optimaize Language Detector:
- Every language in Tika's language detector, except Esperanto.
- Afrikaans
- **Aragonese** (Unique to this detector)
- Albanian
- Arabic
- Asturian
- Basque
- Bengali
- Breton
- Bulgarian
- Croatian
- Czech
- Gujarati
- Haitian
- Hebrew
- Hindi
- Indonesian
- Irish
- Japanese
- Kannada
- Khmer
- Korean
- Latvian
- Macedonian
- Malay
- Malayalam
- Maltese
- Marathi
- Nepali
- Occitan
- Punjabi
- Serbian
- Simplified Chinese
- Slovak
- Slovene
- Somali
- Swahili
- Tagalog
- Tamil
- Telugu
- Traditional Chinese
- Turkish
- Urdu
- Vietnamese
- Walloon
- Welsh
- Yiddish


- OpenNLP Language Detector:
- Every language in Tika and Optimaize Language Detectors except Aragonese.
- Armenian
- Assamese
- Azerbaijani
- Balinese
- Bashkir
- Bihari languages
- Burmese
- Cebuano
- Dhivehi
- Eastern Mari
- Esperanto
- Faroese
- Fulah
- Goan Konkani
- Igbo
- Iranian Persian
- Javanese
- Konkani
- Kurdish
- Lingala
- Low German
- Malay
- Mandarin Chinese
- Maori
- Min Nan Chinese
- Minangkabau
- Mongolian
- Paraguayan Guaraní
- Pedi
- Pushto
- Scottish Gaelic
- Sindhi
- Slovenian
- Standard Estonian
- Standard Latvian
- Swati
- Swiss German
- Tajik
- Turkmen
- Uighur
- Uzbek
- Western Frisian
- Western Panjabi
- Xhosa
- Yoruba
- Zulu
12 changes: 12 additions & 0 deletions java/TikaTextDetection/plugin-files/descriptor/descriptor.json
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,18 @@
"type": "BOOLEAN",
"defaultValue": "false"
},
{
"name": "LANGUAGE_DETECTOR",
"description": "Specifies which Tika language detector to use. Current options are `opennlp` and `optimaize`. Please note, if `opennlp` is selected, ensure that the `MIN_LANGUAGES` property is set to 1 or greater.",
"type": "STRING",
"defaultValue": "opennlp"
},
{
"name": "MIN_LANGUAGES",
"description": "When set to a positive integer, attempt to always return specified number of top languages, even if some are not marked as reasonably certain. Non-positive values disable this property to only accept reasonable predictions.",
"type": "INT",
"defaultValue": "2"
},
{
"name": "LIST_ALL_PAGES",
"description": "Specifies whether or not to store each page as a track, even if no text is extracted.",
Expand Down
8 changes: 5 additions & 3 deletions java/TikaTextDetection/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -66,13 +66,15 @@
<version>2.4.1</version>
</dependency>
<dependency>
<!-- Please note that Apache Tika versions 2.0 and above now support multiple language detection options: -->
<!-- https://github.com/apache/tika/tree/main/tika-langdetect -->
<!-- TODO: Investigate other language detection capabilities and update this library if needed.-->
<groupId>org.apache.tika</groupId>
<artifactId>tika-langdetect-optimaize</artifactId>
<version>2.4.1</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-langdetect-opennlp</artifactId>
<version>2.4.1</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-core</artifactId>
Expand Down
Loading