New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should not fail and throw an exception when given punctuation, UTF-8 chars/symbols, empty text #4

Closed
gibrown opened this Issue Oct 23, 2013 · 3 comments

Comments

Projects
None yet
2 participants
@gibrown

gibrown commented Oct 23, 2013

There are a number of different cases where the language detector will fail:

  • Leading punctuation. Something like "----------ROMA.....sexy ragazza orientale 3888669169---------- ' Perch correre affannosamente qua e l senza motivo? Tu sei ci che l esistenza vuole che tu sia.'" will fail. Generally any text that leads off with a number of punctuation characters fails.
  • Unicode symbols, emoticons, etc anywhere in the text cause failures: U+2000-U+2BFF (symbols), U+1f000-U+1ffff (symbols, emoticons), probably others
  • Having any characters in the U+1780-U+17FF (Khmer lang symbols) range fail
  • Any text that has no Unicode characters fails (\p{L} in PCRE).

This is probably an issue with the underlying library, but if so, then would be nice for this wrapper to run some checks. Currently I have the following checks implemented in my client: https://gist.github.com/gibrown/7122061

Running about a million lang detect API calls a day, and I think this catches almost all failures.

@jprante

This comment has been minimized.

Show comment
Hide comment
@jprante

jprante Oct 23, 2013

Owner

Very much appreciated! I will work on this to make langdetect foolproof.

Owner

jprante commented Oct 23, 2013

Very much appreciated! I will work on this to make langdetect foolproof.

@jprante

This comment has been minimized.

Show comment
Hide comment
@jprante

jprante Oct 25, 2013

Owner

Should be much better in released 2.0.0+

With Java 7, pattern matching is fixe to work correct with Unicode characters.

Asian languages and punctuation tests added. More tests are welcome.

Owner

jprante commented Oct 25, 2013

Should be much better in released 2.0.0+

With Java 7, pattern matching is fixe to work correct with Unicode characters.

Asian languages and punctuation tests added. More tests are welcome.

@jprante jprante closed this Oct 25, 2013

@gibrown

This comment has been minimized.

Show comment
Hide comment
@gibrown

gibrown Nov 27, 2013

@jprante Just wanted to say that this stomped out a ton of exceptions, thanks!

I'll keep and eye out for more.

gibrown commented Nov 27, 2013

@jprante Just wanted to say that this stomped out a ton of exceptions, thanks!

I'll keep and eye out for more.

jprante pushed a commit that referenced this issue Jun 7, 2017

Merge pull request #4 from Automattic/port-romanian-vietnamese-normal…
…ization

Add normalisation for Romanian and Vietnamese from original library
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment