Skip to content

Commit

Permalink
Docs, demos and slides.
Browse files Browse the repository at this point in the history
  • Loading branch information
norman committed Oct 27, 2010
1 parent cab2ebd commit 562e294
Show file tree
Hide file tree
Showing 6 changed files with 106 additions and 6 deletions.
32 changes: 32 additions & 0 deletions README.md
@@ -0,0 +1,32 @@
# The 9th Bit: Encodings in Ruby 1.9

By [Norman Clarke](http://twitter.com/compay)

I hope you enjoyed my presentation at [RubyConf Brazil 2010](http://www.rubyconf.com.br/)!

This repository has my slides, some code demos you can run, and some links to
resources to get more information on encodings and Ruby.

Comments? Feel free to send me an email at norman@njclarke.com.

## Encoding Resources

### Basic Information

* Fabio Akita - [Convertendo meu Banco de Latin1 para UTF-8](http://akitaonrails.com/2010/01/01/convertendo-meu-banco-de-latin1-para-utf-8)
* Ilya Grigorik - [Secure UTF-8 Input in Rails](http://www.igvita.com/2007/04/11/secure-utf-8-input-in-rails/)
* Yehuda Katz - [Encodings, Unabridged](http://yehudakatz.com/2010/05/17/encodings-unabridged/)

### More Advanced

* James Edward Grey II - [Understanding M17N](http://blog.grayproductions.net/articles/understanding_m17n)
* Yui Naruse - [The Design and Implementation of Ruby M17N](http://yokolet.blogspot.com/2009/07/design-and-implementation-of-ruby-m17n.html)
* Ben Peterson - [Unicode in Japan](http://web.archive.org/web/20080122094511/http://www.jbrowse.com/text/unij.html)
* Brian Candler - [String19](http://github.com/candlerb/string19)
* Otfried Chong - [Han Unification in Unicode](http://tclab.kaist.ac.kr/~otfried/Mule/unihan.html)
* Ken Lundie - [CJKV Information Processing](http://oreilly.com/catalog/9780596514471) (Book)

### Libraries

* [Unicode](http://github.com/blackwinter/unicode)
* [Babosa](http://github.com/norman/babosa)
25 changes: 25 additions & 0 deletions databases.sql
@@ -0,0 +1,25 @@
DROP TABLE IF EXISTS example;

CREATE TABLE example (
note VARCHAR(20),
value CHAR(1)
);

-- MySQL: FAIL
INSERT INTO example VALUES ('one byte:', 'a');
INSERT INTO example VALUES ('two bytes:', 'ã');
INSERT INTO example VALUES ('three bytes:', '');
INSERT INTO example VALUES ('four bytes:', '沿');
SELECT * FROM example;
DELETE FROM example;

-- WIN
SET NAMES 'utf8';
ALTER TABLE example CONVERT TO CHARACTER SET utf8 COLLATE utf8_bin;

INSERT INTO example VALUES ('one byte:', 'a');
INSERT INTO example VALUES ('two bytes:', 'ã');
INSERT INTO example VALUES ('three bytes:', '');
INSERT INTO example VALUES ('four bytes:', '沿');
SELECT * FROM example;
SELECT * FROM example WHERE value = 'ã';
14 changes: 8 additions & 6 deletions demos/identity.rb → equivalence.rb
@@ -1,17 +1,19 @@
#encoding: utf-8
# encoding: utf-8
# Don't edit this file and save it, or else you'll probably break the demo.
# Run this under Ruby 1.9.
def ________________________________________________________________________________
puts "_" * 80;
end

puts "A Quick Executable Lesson on Unicode Strings"
________________________________________________________________________________
puts "Identity with ASCII strings is pretty straightforward. If two strings *look*"
puts "Equivalence with ASCII strings is pretty straightforward. If two strings *look*"
print "the same, they *are* the same. Here, does 'John' == 'John'? "

puts "John" == "John"
________________________________________________________________________________
puts "But with UTF-8 it's not so straightforward, because there are often several"
print "ways to encode non-ASCII characters. Does 'João' == 'João'? "
puts "But with UTF-8 it's not so straightforward, because there are 2 ways"
print "to encode some non-ASCII characters. Does 'João' == 'João'? "
puts "João" == "João"
________________________________________________________________________________

Expand All @@ -23,7 +25,7 @@ def ____________________________________________________________________________
________________________________________________________________________________
print 'But this "ã" has three bytes: '
p "ã".unpack("C*")
print 'And is two Unicode characters ("a" and "˜"): '
print 'And is two UTF-8 characters ("a" and "˜"): '
p "ã".unpack("U*")

________________________________________________________________________________
Expand Down Expand Up @@ -60,4 +62,4 @@ def ____________________________________________________________________________
end
________________________________________________________________________________
puts "So if you only remember one thing from this presentation, remember this:"
puts "\n\nNormalizing your Unicode data before you save it!!!"
puts "\n\nNormalize your Unicode data before you save it!!!"
Binary file added slides.pdf
Binary file not shown.
10 changes: 10 additions & 0 deletions source.rb
@@ -0,0 +1,10 @@
# encoding: utf-8
# Yes, this is valid Ruby 1.9 - even though your text editor's
# syntax highlighting will probably not think so.
class Canção
GÊNEROS = [:forró, :carimbó, :afoxé]
attr_accessor :gênero
end
asa_branca = Canção.new
asa_branca.gênero = :forró
p asa_branca.gênero
31 changes: 31 additions & 0 deletions strip_diacritics.rb
@@ -0,0 +1,31 @@
# encoding: utf-8
require "active_support"
require "active_support/inflector"
require "unicode"

strings = ["ã", "ç", "ê", "ó"]
strings2 = ["ø", "ß", "œ"]

class String

def to_ascii1
# You'll often see this recommended as a way to "asciify" characters by
# stripping off accent marks. It works ok for Portuguese, but isn't a good
# general solution because many common Latin characters don't decompose.
Unicode.normalize_D(self).gsub(/[^\x00-\x7F]/, '')
end

def to_ascii2
# Instead, use a library that has transliteration tables to map the
# characters to a reasonable ASCII representation.
ActiveSupport::Inflector.transliterate(self).to_s
end
end

# FAIL
p strings.map &:to_ascii1
p strings2.map &:to_ascii1

# OK
p strings.map &:to_ascii2
p strings2.map &:to_ascii2

0 comments on commit 562e294

Please sign in to comment.