From a5cb71a503b730ac5e901de4d91154b27f116613 Mon Sep 17 00:00:00 2001 From: Jan Lelis Date: Wed, 25 May 2016 15:44:36 +0200 Subject: [PATCH] Add 56: "US-ASCII-8BIT" --- source/categories/index.html.erb | 3 +- source/posts/56-us-ascii-8bit.html.md | 40 +++++++++++++++++++++++++++ 2 files changed, 42 insertions(+), 1 deletion(-) create mode 100644 source/posts/56-us-ascii-8bit.html.md diff --git a/source/categories/index.html.erb b/source/categories/index.html.erb index 35ac849..280e42d 100644 --- a/source/categories/index.html.erb +++ b/source/categories/index.html.erb @@ -78,9 +78,10 @@ title: Idiosyncratic Ruby - Browse by Category
  • Ruby's Initial State
  • Constant Re-Assignment
  • Identifiers to Avoid
  • -
  • Private & Deprecated Constants
  • +
  • Private & Deprecated Constants
  • .try_convert
  • .equal?, eql?, ==, ===
  • +
  • `ASCII-8BIT` vs. `US-ASCII`
  • Miscellaneous

    diff --git a/source/posts/56-us-ascii-8bit.html.md b/source/posts/56-us-ascii-8bit.html.md new file mode 100644 index 0000000..100dbcb --- /dev/null +++ b/source/posts/56-us-ascii-8bit.html.md @@ -0,0 +1,40 @@ +--- +title: US-ASCII-8BIT +date: 2016-05-26 +tags: string, encoding, ascii +--- + +How comes that Ruby has two [ASCII](https://en.wikipedia.org/wiki/ASCII) encodings? + +ARTICLE + + Encoding.name_list.grep(/ASCII/) + # => ["ASCII-8BIT", "US-ASCII"] + +Which one is the *normal* one you should use for ASCII? + +## Aliases + + ASCII-8BIT | US-ASCII +------------|---------- + BINARY | ASCII + | ANSI_X3.4-1968 + | 646 + +So, **US-ASCII** is aliased to **ASCII**, but then what is **ASCII-8BIT** for? [Encodings' RDoc](http://ruby-doc.org/core-2.3.1/Encoding.html) has some help: + + Encoding::ASCII_8BIT is a special encoding that is usually + used for a byte string, not a character string. But as the name insists, + its characters in the range of ASCII are considered as ASCII characters. + This is useful when you use ASCII-8BIT characters with other ASCII + compatible characters. + +So basically, it is not a real encoding, but represents an arbitrary stream of bytes (bytes with a value between 0 and 255). It is used for raw byte stream or if you want to make clear that you do not know about a string's encoding! + +The ASCII charset only takes 7 bytes, so in strict ASCII, the 8th byte should never be set. The allowed byte value range is from 0 to 127. This is what the **US-ASCII** encoding is all about: It is used when dealing with ASCII encoded strings. Think: **"ASCII-7BIT"** + +A simple example illustrating the difference: + + out_of_ascii_range = 128.chr # => "\x80" + out_of_ascii_range.force_encoding("US-ASCII").valid_encoding? # => false + out_of_ascii_range.force_encoding("ASCII-8BIT").valid_encoding? # => true \ No newline at end of file