UTF8 strings lowercase conversion #26

Merged
merged 1 commit into from Mar 14, 2014

Conversation

Projects
None yet
2 participants
@p-lambert
Contributor

p-lambert commented Mar 7, 2014

Currently, process_text converts the given string to lowercase in order to perform the word matching, but unfortunately Ruby does not convert UTF8 strings properly.
I've noticed inconsistencies like this one:

require 'whatlanguage'
puts "ÂNCORA COR ÂMBAR".language # => spanish
puts "âncora cor âmbar".language # => portuguese

Thanks for the library!

@peterc

This comment has been minimized.

Show comment Hide comment
@peterc

peterc Mar 7, 2014

Owner

This is a general problem in Ruby. Do you know of any reasonable solutions?

The problem here is WhatLanguage in its current form is dependent on words and having all combinations of casing in the word lists is impractical, so we have to normalize them somehow. Is there a better way to do this normalization?

Owner

peterc commented Mar 7, 2014

This is a general problem in Ruby. Do you know of any reasonable solutions?

The problem here is WhatLanguage in its current form is dependent on words and having all combinations of casing in the word lists is impractical, so we have to normalize them somehow. Is there a better way to do this normalization?

@p-lambert

This comment has been minimized.

Show comment Hide comment
@p-lambert

p-lambert Mar 7, 2014

Contributor

I did some research on that and there is no simple solution (like 1-to-1 mappings covering all scenarios) as long as there are several conditions to be taken into account, and mostly because some of them are locale dependent (see, for example, Character Properties, Case Mappings & Names FAQ).

Thus we get stuck in a circular problem: we need to normalize the string in order to identify the language and ideally the language must be taken into account in this process of normalization.

Although this seems rather disappointing, I really believe results would be greatly improved if we at least performed those simple conversions (i.e., the case folding as specified by Unicode), even disregarding these locale dependent rules.

Of course this casing conversion goes beyond the scope of this library, so I would propose to use an external one. https://github.com/lang/unicode_utils seems to do the trick and appears to be well written, using official specifications from Unicode. We could dynamically define a to_lowercase method which would either delegate this conversion to UnicodeUtils if defined or simply perform this by String#downcase. That way the user could optionally require the aforementioned library and it would not be a dependency. This sounds too ugly?

Contributor

p-lambert commented Mar 7, 2014

I did some research on that and there is no simple solution (like 1-to-1 mappings covering all scenarios) as long as there are several conditions to be taken into account, and mostly because some of them are locale dependent (see, for example, Character Properties, Case Mappings & Names FAQ).

Thus we get stuck in a circular problem: we need to normalize the string in order to identify the language and ideally the language must be taken into account in this process of normalization.

Although this seems rather disappointing, I really believe results would be greatly improved if we at least performed those simple conversions (i.e., the case folding as specified by Unicode), even disregarding these locale dependent rules.

Of course this casing conversion goes beyond the scope of this library, so I would propose to use an external one. https://github.com/lang/unicode_utils seems to do the trick and appears to be well written, using official specifications from Unicode. We could dynamically define a to_lowercase method which would either delegate this conversion to UnicodeUtils if defined or simply perform this by String#downcase. That way the user could optionally require the aforementioned library and it would not be a dependency. This sounds too ugly?

@peterc

This comment has been minimized.

Show comment Hide comment
@peterc

peterc Mar 7, 2014

Owner

I concur. The plan for the next version of WhatLanguage mitigates this somewhat as it will include using histograms of Unicode codepoint usage, but this approach may still be useful.

I think your suggestion in the last paragraph makes sense. Do you want to have a quick attempt at it or would you prefer me to look at it?

Owner

peterc commented Mar 7, 2014

I concur. The plan for the next version of WhatLanguage mitigates this somewhat as it will include using histograms of Unicode codepoint usage, but this approach may still be useful.

I think your suggestion in the last paragraph makes sense. Do you want to have a quick attempt at it or would you prefer me to look at it?

@p-lambert

This comment has been minimized.

Show comment Hide comment
@p-lambert

p-lambert Mar 7, 2014

Contributor

I'll try something! Thanks

Contributor

p-lambert commented Mar 7, 2014

I'll try something! Thanks

@p-lambert

This comment has been minimized.

Show comment Hide comment
@p-lambert

p-lambert Mar 12, 2014

Contributor

@peterc, any comments on that?

Contributor

p-lambert commented Mar 12, 2014

@peterc, any comments on that?

@peterc

This comment has been minimized.

Show comment Hide comment
@peterc

peterc Mar 14, 2014

Owner

I think it's a nice, gentle, mostly hands-off approach that could work for now, so thanks! I'll merge it in :-)

Owner

peterc commented Mar 14, 2014

I think it's a nice, gentle, mostly hands-off approach that could work for now, so thanks! I'll merge it in :-)

peterc added a commit that referenced this pull request Mar 14, 2014

Merge pull request #26 from p-lambert/unicode-lowercase
UTF8 strings lowercase conversion

@peterc peterc merged commit 4b8212e into peterc:master Mar 14, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment