Skip to content

Commit

Permalink
Add 56: "US-ASCII-8BIT"
Browse files Browse the repository at this point in the history
  • Loading branch information
janlelis committed May 25, 2016
1 parent 9438699 commit a5cb71a
Show file tree
Hide file tree
Showing 2 changed files with 42 additions and 1 deletion.
3 changes: 2 additions & 1 deletion source/categories/index.html.erb
Expand Up @@ -78,9 +78,10 @@ title: Idiosyncratic Ruby - Browse by Category
<li><a href="/43-new-ruby-startup.html">Ruby's Initial State</a></li>
<li><a href="/45-constant-shuffle">Constant Re-Assignment</a></li>
<li><a href="/50-naming-too-good.html">Identifiers to Avoid</a><li>
<li><a href="/52-constant-visibility.html">Private & Deprecated Constants</a></li>
<li><a href="/52-constant-visibility.html">Private &amp; Deprecated Constants</a></li>
<li><a href="/54-try-converting.html"><code>.try_convert</code></a></li>
<li><a href="/55-struggling-four-equality.html"><code>.equal?</code>, <code>eql?</code>, <code>==</code>, <code>===</code></a></li>
<li><a href="/56-us-ascii-8bit.html">`ASCII-8BIT` vs. `US-ASCII`</a></li>
</ul>

<h2>Miscellaneous</h2>
Expand Down
40 changes: 40 additions & 0 deletions source/posts/56-us-ascii-8bit.html.md
@@ -0,0 +1,40 @@
---
title: US-ASCII-8BIT
date: 2016-05-26
tags: string, encoding, ascii
---

How comes that Ruby has two [ASCII](https://en.wikipedia.org/wiki/ASCII) encodings?

ARTICLE

Encoding.name_list.grep(/ASCII/)
# => ["ASCII-8BIT", "US-ASCII"]

Which one is the *normal* one you should use for ASCII?

## Aliases

ASCII-8BIT | US-ASCII
------------|----------
BINARY | ASCII
| ANSI_X3.4-1968
| 646

So, **US-ASCII** is aliased to **ASCII**, but then what is **ASCII-8BIT** for? [Encodings' RDoc](http://ruby-doc.org/core-2.3.1/Encoding.html) has some help:

Encoding::ASCII_8BIT is a special encoding that is usually
used for a byte string, not a character string. But as the name insists,
its characters in the range of ASCII are considered as ASCII characters.
This is useful when you use ASCII-8BIT characters with other ASCII
compatible characters.

So basically, it is not a real encoding, but represents an arbitrary stream of bytes (bytes with a value between 0 and 255). It is used for raw byte stream or if you want to make clear that you do not know about a string's encoding!

The ASCII charset only takes 7 bytes, so in strict ASCII, the 8th byte should never be set. The allowed byte value range is from 0 to 127. This is what the **US-ASCII** encoding is all about: It is used when dealing with ASCII encoded strings. Think: **"ASCII-7BIT"**

A simple example illustrating the difference:

out_of_ascii_range = 128.chr # => "\x80"
out_of_ascii_range.force_encoding("US-ASCII").valid_encoding? # => false
out_of_ascii_range.force_encoding("ASCII-8BIT").valid_encoding? # => true

0 comments on commit a5cb71a

Please sign in to comment.