Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate Apache Tika's New OpenNLP Library #1545

Open
hhuangMITRE opened this issue Jul 13, 2022 · 0 comments
Open

Investigate Apache Tika's New OpenNLP Library #1545

hhuangMITRE opened this issue Jul 13, 2022 · 0 comments

Comments

@hhuangMITRE
Copy link
Contributor

While digging for the Optimaize Language Detection module, we found the following modules available for Tika's language detection capabilities:

    <module>tika-langdetect-lingo24</module>
    <module>tika-langdetect-optimaize</module>
    <module>tika-langdetect-mitll-text</module>
    <module>tika-langdetect-opennlp</module>

In particular the OpenNLP module appears to have heavy investment from the Apache Tika Team:

https://tika.apache.org/2.0.0/api/org/apache/tika/langdetect/opennlp/OpenNLPDetector.html
https://opennlp.apache.org/

We would like to investigate each of these new Tika modules to see if they offer better language detection capabilities as the Optimaize package is dated: https://github.com/optimaize/language-detector.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

2 participants