Skip to content


Subversion checkout URL

You can clone with
Download ZIP


UTF8 strings lowercase conversion #26

merged 1 commit into from

2 participants


Currently, process_text converts the given string to lowercase in order to perform the word matching, but unfortunately Ruby does not convert UTF8 strings properly.
I've noticed inconsistencies like this one:

require 'whatlanguage'
puts "ÂNCORA COR ÂMBAR".language # => spanish
puts "âncora cor âmbar".language # => portuguese

Thanks for the library!


This is a general problem in Ruby. Do you know of any reasonable solutions?

The problem here is WhatLanguage in its current form is dependent on words and having all combinations of casing in the word lists is impractical, so we have to normalize them somehow. Is there a better way to do this normalization?


I did some research on that and there is no simple solution (like 1-to-1 mappings covering all scenarios) as long as there are several conditions to be taken into account, and mostly because some of them are locale dependent (see, for example, Character Properties, Case Mappings & Names FAQ).

Thus we get stuck in a circular problem: we need to normalize the string in order to identify the language and ideally the language must be taken into account in this process of normalization.

Although this seems rather disappointing, I really believe results would be greatly improved if we at least performed those simple conversions (i.e., the case folding as specified by Unicode), even disregarding these locale dependent rules.

Of course this casing conversion goes beyond the scope of this library, so I would propose to use an external one. seems to do the trick and appears to be well written, using official specifications from Unicode. We could dynamically define a to_lowercase method which would either delegate this conversion to UnicodeUtils if defined or simply perform this by String#downcase. That way the user could optionally require the aforementioned library and it would not be a dependency. This sounds too ugly?


I concur. The plan for the next version of WhatLanguage mitigates this somewhat as it will include using histograms of Unicode codepoint usage, but this approach may still be useful.

I think your suggestion in the last paragraph makes sense. Do you want to have a quick attempt at it or would you prefer me to look at it?


I'll try something! Thanks


@peterc, any comments on that?


I think it's a nice, gentle, mostly hands-off approach that could work for now, so thanks! I'll merge it in :-)

@peterc peterc merged commit 4b8212e into peterc:master
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Mar 7, 2014
  1. @p-lambert
This page is out of date. Refresh to see the latest.
Showing with 18 additions and 1 deletion.
  1. +7 −1 lib/whatlanguage.rb
  2. +11 −0 test/test_whatlanguage.rb
8 lib/whatlanguage.rb
@@ -33,7 +33,7 @@ def languages
def process_text(text)
results =
it = 0
- text.downcase.split.each do |word|
+ to_lowercase(text).split.each do |word|
it += 1
languages.each do |lang|
@@ -62,6 +62,12 @@ def self.filter_from_dictionary(filename) { |word| bf.add(word) }
+ if !defined? UnicodeUtils
+ define_method(:to_lowercase) { |str| str.downcase }
+ else
+ define_method(:to_lowercase) { |str| UnicodeUtils.casefold(str) }
+ end
class String
11 test/test_whatlanguage.rb
@@ -1,6 +1,12 @@
# encoding: utf-8
require "test/unit"
+# not a dependency
+ require 'unicode_utils'
+rescue LoadError
require 'whatlanguage'
class TestWhatLanguage < Test::Unit::TestCase
@@ -114,4 +120,9 @@ def test_language_selection_mixed
selective_wl =, :all, :english)
assert_equal :russian, selective_wl.language("Все новости в хронологическом порядке")
+ def test_casing_conversion
+ skip unless defined? UnicodeUtils
+ assert_equal "âncora cor âmbar".language, "ÂNCORA COR ÂMBAR".language
+ end
Something went wrong with that request. Please try again.