Permalink
Browse files

Docs, demos and slides.

  • Loading branch information...
1 parent cab2ebd commit 562e2945ea9833a31b7968b6d460c5f4674c19b7 @norman committed Oct 21, 2010
Showing with 106 additions and 6 deletions.
  1. +32 −0 README.md
  2. +25 −0 databases.sql
  3. +8 −6 demos/identity.rb → equivalence.rb
  4. BIN slides.pdf
  5. +10 −0 source.rb
  6. +31 −0 strip_diacritics.rb
View
@@ -0,0 +1,32 @@
+# The 9th Bit: Encodings in Ruby 1.9
+
+By [Norman Clarke](http://twitter.com/compay)
+
+I hope you enjoyed my presentation at [RubyConf Brazil 2010](http://www.rubyconf.com.br/)!
+
+This repository has my slides, some code demos you can run, and some links to
+resources to get more information on encodings and Ruby.
+
+Comments? Feel free to send me an email at norman@njclarke.com.
+
+## Encoding Resources
+
+### Basic Information
+
+* Fabio Akita - [Convertendo meu Banco de Latin1 para UTF-8](http://akitaonrails.com/2010/01/01/convertendo-meu-banco-de-latin1-para-utf-8)
+* Ilya Grigorik - [Secure UTF-8 Input in Rails](http://www.igvita.com/2007/04/11/secure-utf-8-input-in-rails/)
+* Yehuda Katz - [Encodings, Unabridged](http://yehudakatz.com/2010/05/17/encodings-unabridged/)
+
+### More Advanced
+
+* James Edward Grey II - [Understanding M17N](http://blog.grayproductions.net/articles/understanding_m17n)
+* Yui Naruse - [The Design and Implementation of Ruby M17N](http://yokolet.blogspot.com/2009/07/design-and-implementation-of-ruby-m17n.html)
+* Ben Peterson - [Unicode in Japan](http://web.archive.org/web/20080122094511/http://www.jbrowse.com/text/unij.html)
+* Brian Candler - [String19](http://github.com/candlerb/string19)
+* Otfried Chong - [Han Unification in Unicode](http://tclab.kaist.ac.kr/~otfried/Mule/unihan.html)
+* Ken Lundie - [CJKV Information Processing](http://oreilly.com/catalog/9780596514471) (Book)
+
+### Libraries
+
+* [Unicode](http://github.com/blackwinter/unicode)
+* [Babosa](http://github.com/norman/babosa)
View
@@ -0,0 +1,25 @@
+DROP TABLE IF EXISTS example;
+
+CREATE TABLE example (
+ note VARCHAR(20),
+ value CHAR(1)
+);
+
+-- MySQL: FAIL
+INSERT INTO example VALUES ('one byte:', 'a');
+INSERT INTO example VALUES ('two bytes:', 'ã');
+INSERT INTO example VALUES ('three bytes:', '');
+INSERT INTO example VALUES ('four bytes:', '沿');
+SELECT * FROM example;
+DELETE FROM example;
+
+-- WIN
+SET NAMES 'utf8';
+ALTER TABLE example CONVERT TO CHARACTER SET utf8 COLLATE utf8_bin;
+
+INSERT INTO example VALUES ('one byte:', 'a');
+INSERT INTO example VALUES ('two bytes:', 'ã');
+INSERT INTO example VALUES ('three bytes:', '');
+INSERT INTO example VALUES ('four bytes:', '沿');
+SELECT * FROM example;
+SELECT * FROM example WHERE value = 'ã';
@@ -1,17 +1,19 @@
-#encoding: utf-8
+# encoding: utf-8
+# Don't edit this file and save it, or else you'll probably break the demo.
+# Run this under Ruby 1.9.
def ________________________________________________________________________________
puts "_" * 80;
end
puts "A Quick Executable Lesson on Unicode Strings"
________________________________________________________________________________
-puts "Identity with ASCII strings is pretty straightforward. If two strings *look*"
+puts "Equivalence with ASCII strings is pretty straightforward. If two strings *look*"
print "the same, they *are* the same. Here, does 'John' == 'John'? "
puts "John" == "John"
________________________________________________________________________________
-puts "But with UTF-8 it's not so straightforward, because there are often several"
-print "ways to encode non-ASCII characters. Does 'João' == 'João'? "
+puts "But with UTF-8 it's not so straightforward, because there are 2 ways"
+print "to encode some non-ASCII characters. Does 'João' == 'João'? "
puts "João" == "João"
________________________________________________________________________________
@@ -23,7 +25,7 @@ def ____________________________________________________________________________
________________________________________________________________________________
print 'But this "ã" has three bytes: '
p "".unpack("C*")
-print 'And is two Unicode characters ("a" and "˜"): '
+print 'And is two UTF-8 characters ("a" and "˜"): '
p "".unpack("U*")
________________________________________________________________________________
@@ -60,4 +62,4 @@ def ____________________________________________________________________________
end
________________________________________________________________________________
puts "So if you only remember one thing from this presentation, remember this:"
-puts "\n\nNormalizing your Unicode data before you save it!!!"
+puts "\n\nNormalize your Unicode data before you save it!!!"
View
Binary file not shown.
View
@@ -0,0 +1,10 @@
+# encoding: utf-8
+# Yes, this is valid Ruby 1.9 - even though your text editor's
+# syntax highlighting will probably not think so.
+class Canção
+ GÊNEROS = [:forró, :carimbó, :afoxé]
+ attr_accessor :gênero
+end
+asa_branca = Canção.new
+asa_branca.gênero = :forró
+p asa_branca.gênero
View
@@ -0,0 +1,31 @@
+# encoding: utf-8
+require "active_support"
+require "active_support/inflector"
+require "unicode"
+
+strings = ["ã", "ç", "ê", "ó"]
+strings2 = ["ø", "ß", "œ"]
+
+class String
+
+ def to_ascii1
+ # You'll often see this recommended as a way to "asciify" characters by
+ # stripping off accent marks. It works ok for Portuguese, but isn't a good
+ # general solution because many common Latin characters don't decompose.
+ Unicode.normalize_D(self).gsub(/[^\x00-\x7F]/, '')
+ end
+
+ def to_ascii2
+ # Instead, use a library that has transliteration tables to map the
+ # characters to a reasonable ASCII representation.
+ ActiveSupport::Inflector.transliterate(self).to_s
+ end
+end
+
+# FAIL
+p strings.map &:to_ascii1
+p strings2.map &:to_ascii1
+
+# OK
+p strings.map &:to_ascii2
+p strings2.map &:to_ascii2

0 comments on commit 562e294

Please sign in to comment.