# Decoding to Unicode

In [1]:
f = open("unicode.txt", "w")
f.write(u"€".encode("utf-8"))

In [2]:
f.close()

In [5]:
f = open("unicode.txt", "rb")
euro = f.read()
print [ord(i) for i in euro]
print [hex(ord(i)) for i in euro]

In [6]:
type(u"€"), type(euro)

Note that Python 2 strings have both an `encode()` and a `decode()` method.
This really isn't helpful as it often leads to strings being
multiply-encoded, which is really difficult to make sense of.

In [7]:
dir(len)

In [8]:
[n for n in dir(euro) if not n.startswith("_")]

In [9]:
help(euro.decode)

In [10]:
euro.decode("utf-8")

In [11]:
print euro.decode("utf-8")

In [12]:
print euro

Each Unicode encoding is embodied in a _codec_ containing the information
needed to complete the encoding of Unicode strings and the decoding of
byte strings to Unicode.
They aren't easy to understand ...

In [13]:
import codecs
dir(codecs)

In [14]:
utf_16_codec = codecs.utf_16_decode

In [15]:
type(utf_16_codec)

In [16]:
dir(utf_16_codec)

In [17]:
help(utf_16_codec)

So if you need to know more, please consult the rather opaque x[Python codecs documentation](http://docs.python.org/3/library/codecs.html).
But there are some things you can find out.
First of all, you can hit problems if you get the encoding wrong
when you try to decode a byte string.

In [18]:
euro

In [19]:
euro.decode("utf-8")

In [20]:
euro.decode("utf-16")

In [21]:
from encodings import cp1252

In [22]:
dir(cp1252) 

In [23]:
type(cp1252.decoding_table), type(cp1252.encoding_table)

In [24]:
len(cp1252.decoding_table)

In [25]:
[ord(c) for c in cp1252.decoding_table]

In [26]:
[ord(cp1252.encoding_table[chr(i)]) for i in range(256)]

In [27]:
dir(cp1252.encoding_table)

In [28]:
help(cp1252.encoding_table)

###Possible Discussions

* How significant is Unicode in this environment?
* Why is Unicode handling more difficult in Python 2?

###And, of course, whatever _you_ want ...