Docs, demos and slides.

norman · Oct 27, 2010 · 562e294 · 562e294
1 parent cab2ebd
commit 562e294
Show file tree

Hide file tree

Showing 6 changed files with 106 additions and 6 deletions.
diff --git a/README.md b/README.md
@@ -0,0 +1,32 @@
+# The 9th Bit: Encodings in Ruby 1.9
+
+By [Norman Clarke](http://twitter.com/compay)
+
+I hope you enjoyed my presentation at [RubyConf Brazil 2010](http://www.rubyconf.com.br/)!
+
+This repository has my slides, some code demos you can run, and some links to
+resources to get more information on encodings and Ruby.
+
+Comments? Feel free to send me an email at norman@njclarke.com.
+
+## Encoding Resources
+
+### Basic Information
+
+* Fabio Akita - [Convertendo meu Banco de Latin1 para UTF-8](http://akitaonrails.com/2010/01/01/convertendo-meu-banco-de-latin1-para-utf-8)
+* Ilya Grigorik - [Secure UTF-8 Input in Rails](http://www.igvita.com/2007/04/11/secure-utf-8-input-in-rails/)
+* Yehuda Katz - [Encodings, Unabridged](http://yehudakatz.com/2010/05/17/encodings-unabridged/)
+
+### More Advanced
+
+* James Edward Grey II - [Understanding M17N](http://blog.grayproductions.net/articles/understanding_m17n)
+* Yui Naruse - [The Design and Implementation of Ruby M17N](http://yokolet.blogspot.com/2009/07/design-and-implementation-of-ruby-m17n.html)
+* Ben Peterson - [Unicode in Japan](http://web.archive.org/web/20080122094511/http://www.jbrowse.com/text/unij.html)
+* Brian Candler - [String19](http://github.com/candlerb/string19)
+* Otfried Chong - [Han Unification in Unicode](http://tclab.kaist.ac.kr/~otfried/Mule/unihan.html)
+* Ken Lundie - [CJKV Information Processing](http://oreilly.com/catalog/9780596514471) (Book)
+
+### Libraries
+
+* [Unicode](http://github.com/blackwinter/unicode)
+* [Babosa](http://github.com/norman/babosa)
diff --git a/databases.sql b/databases.sql
@@ -0,0 +1,25 @@
+DROP TABLE IF EXISTS example;
+
+CREATE TABLE example (
+  note VARCHAR(20),
+  value CHAR(1)
+);
+
+-- MySQL: FAIL
+INSERT INTO example VALUES ('one byte:',    'a');
+INSERT INTO example VALUES ('two bytes:',   'ã');
+INSERT INTO example VALUES ('three bytes:', 'の');
+INSERT INTO example VALUES ('four bytes:',  '沿');
+SELECT * FROM example;
+DELETE FROM example;
+
+-- WIN
+SET NAMES 'utf8';
+ALTER TABLE example CONVERT TO CHARACTER SET utf8 COLLATE utf8_bin;
+
+INSERT INTO example VALUES ('one byte:',    'a');
+INSERT INTO example VALUES ('two bytes:',   'ã');
+INSERT INTO example VALUES ('three bytes:', 'の');
+INSERT INTO example VALUES ('four bytes:',  '沿');
+SELECT * FROM example;
+SELECT * FROM example WHERE value = 'ã';
diff --git a/demos/identity.rb → equivalence.rb b/demos/identity.rb → equivalence.rb
@@ -1,17 +1,19 @@
-#encoding: utf-8
+# encoding: utf-8
+# Don't edit this file and save it, or else you'll probably break the demo.
+# Run this under Ruby 1.9.
 def ________________________________________________________________________________
   puts "_" * 80;
 end
 
 puts "A Quick Executable Lesson on Unicode Strings"
 ________________________________________________________________________________
-puts "Identity with ASCII strings is pretty straightforward. If two strings *look*"
+puts "Equivalence with ASCII strings is pretty straightforward. If two strings *look*"
 print "the same, they *are* the same. Here, does 'John' == 'John'? "
 
 puts "John" == "John"
 ________________________________________________________________________________
-puts "But with UTF-8 it's not so straightforward, because there are often several"
-print "ways to encode non-ASCII characters. Does 'João' == 'João'? "
+puts "But with UTF-8 it's not so straightforward, because there are 2 ways"
+print "to encode some non-ASCII characters. Does 'João' == 'João'? "
 puts "João" == "João"
 ________________________________________________________________________________
 
@@ -23,7 +25,7 @@ def ____________________________________________________________________________
 ________________________________________________________________________________
 print 'But this "ã" has three bytes: '
 p "ã".unpack("C*")
-print 'And is two Unicode characters ("a" and "˜"): '
+print 'And is two UTF-8 characters ("a" and "˜"): '
 p "ã".unpack("U*")
 
 ________________________________________________________________________________
@@ -60,4 +62,4 @@ def ____________________________________________________________________________
 end
 ________________________________________________________________________________
 puts "So if you only remember one thing from this presentation, remember this:"
-puts "\n\nNormalizing your Unicode data before you save it!!!"
+puts "\n\nNormalize your Unicode data before you save it!!!"
diff --git a/slides.pdf b/slides.pdf
diff --git a/source.rb b/source.rb
@@ -0,0 +1,10 @@
+# encoding: utf-8
+# Yes, this is valid Ruby 1.9 - even though your text editor's
+# syntax highlighting will probably not think so.
+class Canção
+  GÊNEROS = [:forró, :carimbó, :afoxé]
+  attr_accessor :gênero
+end
+asa_branca = Canção.new
+asa_branca.gênero = :forró
+p asa_branca.gênero
diff --git a/strip_diacritics.rb b/strip_diacritics.rb
@@ -0,0 +1,31 @@
+# encoding: utf-8
+require "active_support"
+require "active_support/inflector"
+require "unicode"
+
+strings = ["ã", "ç", "ê", "ó"]
+strings2 = ["ø", "ß", "œ"]
+
+class String
+
+  def to_ascii1
+    # You'll often see this recommended as a way to "asciify" characters by
+    # stripping off accent marks. It works ok for Portuguese, but isn't a good
+    # general solution because many common Latin characters don't decompose.
+    Unicode.normalize_D(self).gsub(/[^\x00-\x7F]/, '')
+  end
+
+  def to_ascii2
+    # Instead, use a library that has transliteration tables to map the
+    # characters to a reasonable ASCII representation.
+    ActiveSupport::Inflector.transliterate(self).to_s
+  end
+end
+
+# FAIL
+p strings.map &:to_ascii1
+p strings2.map &:to_ascii1
+
+# OK
+p strings.map &:to_ascii2
+p strings2.map &:to_ascii2