Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve classification accuracy and coverage by merging profiles #69

merged 22 commits into from Jun 7, 2017


Copy link

yanirs commented May 31, 2017

Motivation: This set of changes improves our understanding of how the plugin performs on texts of various lengths and types, and increases the number of supported languages while improving classification accuracy.

Main changes

  • Add tests for LangdetectService that evaluate its classification accuracy (percentage of correctly-classified texts) on various text lengths and types. These tests are instantiated dynamically by the DetectLanguageAccuracyTest class using a CSV file (src/test/resources/org/xbib/elasticsearch/index/mapper/langdetect/accuracies.csv), which contains a row for each set of parameters with the expected classification accuracy for each language. The tested parameters include:
    • Text length, with emphasis on short texts in the 5-20 characters range (simulating search queries).
    • Text type, represented by two new datasets that cover all the languages supported by the plugin: translations of the Universal Declaration of Human Rights (src/test/resources/org/xbib/elasticsearch/index/mapper/langdetect/udhr.tsv), and translations of the WordPress interface (src/test/resources/org/xbib/elasticsearch/index/mapper/langdetect/wordpress-translations.tsv).
    • Language profile: original-default, short-text, or the newly-added merged-average (more on this below).
    • Supported languages: original-default or all.
  • Add the Python code used to generate the datasets. See scripts/ for running instructions.
  • Add an option to regenerate DetectLanguageAccuracyTest's CSV input file: When the path.accuracies.out system property is set, the test class writes the accuracies to a CSV file, making it easy to update the expected results if they ever change.
  • Add merged-average, a new language profile that combines the original-default and short-text profiles by averaging the n-gram frequencies for every language. This language profile supports 55 languages (the union of the 53 languages supported by the original-default profile and the 47 languages supported by the short-text profile), while increasing classification accuracy on the 45 original-default languages (the intersection of the two existing profiles). A comparison of the performance of the different profiles is shown below. Given this comparison, the default settings have been changed to use the merged-average profile on all 55 languages.
  • Refactor LangProfile to make it immutable and allow it to read long integers.
  • Normalise Romanian and Vietnamese characters, as done by Shuyo's original library. According to tests on the UDHR and WordPress translations datasets, this improves classification accuracy on these languages, and doesn't affect performance on other languages. Original code for reference:

Summary of experiments

The following table presents the mean accuracy by dataset, profile, text length (full versus short – texts of length 5, 10 & 20), and language setting. As a reminder, the original number of default languages is 45, while the original-default, merged-average, and short-text profiles support 53, 55, and 47 languages respectively. Therefore, when testing on all languages, two numbers are reported for the original-default and short-text profiles: The first is the mean accuracy across all 55 languages (including languages they can't get right due to lack of support), and the second is the mean accuracy across only the supported languages (marked with S:). I think that the first number is more in line with the goals of many plugin users, who would want to increase coverage, but can't guarantee that they'd only try to classify texts in supported languages. The second number is provided for completeness.

Dataset Profile Full texts; all 55 languages Full texts; original-default 45 languages Short texts; all 55 languages Short texts; original-default 45 languages
udhr original-default 96.36%
(S: 100%)
100% 75.11%
(S: 77.94%)
merged-average 100% 100% 77.76% 81.39%
short-text 85.45%
(S: 100%)
100% 68.32%
(S: 79.95%)
wordpress-translations original-default 95.45%
(S: 99.06%)
98.93% 69.55%
(S: 72.17%)
merged-average 99.60% 99.69% 73.43% 77.25%
short-text 85.16%
(S: 99.66%)
99.64% 65.01%
(S: 76.08%)

Manual testing

  • Run ./gradlew test --rerun-tasks --info to view the output of the new tests (all tests should pass).
  • Run ./gradlew test --rerun-tasks -Dpath.accuracies.out=accuracies.csv, and verify that the output accuracies.csv file is identical to src/test/resources/org/xbib/elasticsearch/index/mapper/langdetect/accuracies.csv.
yanirs and others added 22 commits Nov 26, 2016
Add tests to evaluate classification accuracy
Make classification accuracy tests more granular

Add normalisation for Romanian and Vietnamese from original library
Copy link
Contributor Author

yanirs commented Jun 5, 2017

@jprante Any thoughts on this PR? I'm happy to provide more details if necessary. 🙂

Copy link

jprante commented Jun 7, 2017

@yanirs sorry for the delay. That's simply marvelous work, I'm very impressed. Thank you for sharing your imrpovments with me and the community!

I have to go into the details for myself in a quiet hour, I'm confident from your excellent pull request that everything will work perfectly.

@jprante jprante merged commit 070d9d6 into jprante:master Jun 7, 2017
1 check passed
1 check passed
continuous-integration/travis-ci/pr The Travis CI build passed
Copy link
Contributor Author

yanirs commented Jun 7, 2017

@jprante Thank you! 😄

Copy link

jprante commented Jun 7, 2017

I have released version with the pull request included.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants
You can’t perform that action at this time.