#Encoding from Unicode

When your editing system is capable of handling Unicode (and the IPython Notebook certainly is) you can include Unicode characters directly in your string literals, but not in your byte strings. Unfortunately the only reliable mechanism for communication with arbitrary endpoints is a stream of bytes. 

The issue here is that each character (or one-byte escape sequence) is supposed to appear in the bytestring as one byte. There is, for example, no way to represent the Euro symbol in a single byte. The question then arises as to how you write Unicode strings out to the byte streams that at base are the only way to communicate with storage devices and network endpoints.

So let's look at how the interpreter undertakes this fascinating task. We do so by the brute-force method of writing a text file out and then reading it back in binary.

The answer is by adopting encodings, but a further complication arises: how do two parties agree on the particular encoding to use? Sometimes people just use the default encoding (which by default is `UTF-8` for Python), and even that policy goes a long way. Some network protocols like HTTP allow headers to specifiy the encoding of some or all of the message content. We will disregard that question from now on.

For now we will focus on the question of how you can convert the Unicode string `"Please pay €9.99 at exit"` for storage or transmission.

In [None]:
s1 = u'Please pay €9.99 at exit'
o_f = open("unicode.txt", "w")
print(type(o_f))

In [None]:
o_f.write(s1.encode("utf-8"))
o_f.close()
i_f = open("unicode.txt", "rb")

In [None]:
b1 = i_f.read()
print b1, repr(b1)
i_f.close()

The interpreter has obviously written something other than the Unicode character to the file, and has written the Euro character out as the three-byte sequence `b'\xe2\x82\xac'`.
Clearly the web browser understand this, because it's interpreting
those three bytes as a Euro sign when presented with them.

You may wonder, if you happen to remember that the euro sign is Unicode code point `0x20ac`, why those bytes don't appear in `b1`. The answer is that the byte string `b1` represents an ___encoding___ of the (Unicode) string `'Please pay €9.99 at exit'`. The `io.TextWrapper` object `o_f` took your Unicode string and turned it into a sequence of bytes. In fact the three-byte string `'\xe2\x82\xac'` is how the code point is encoded in a scheme known as ___UTF-8___.

In [None]:
repr("€"), repr(u"€")

The system used to write these lessons had UTF-8 as its default encoding, which has many advantages in this role.
As discussed elsewhere, for good reasons there are many different encodings, and by using Unicode your programs allow users to work with their favored encodings. Coding and decoding is performed by __codecs__. The [Python standard library
](http://docs.python.org/3/library/codecs.html#standard-encodings) provides codecs for many of the more popular encodings.

In [None]:
import sys
sys.getdefaultencoding()

When you open a file in Python 3 you can specify the encoding you want to use. In Python 2 the encoding has to be explicitly performed.
Not all encodings can handle the full range of Unicode code points. For
example, the IBM775 encoding has never been updated to include the Euro symbol.

In [None]:
o_f = open("unicode.txt", "w")
print(type(o_f))
o_f.write(s1.decode("utf-8").encode("IBM775"))

In [None]:
o_f = open("unicode.txt", "w")
print(type(o_f))

In [None]:
o_f.write(s1.encode("utf-16-be"))
o_f.close()
i_f = open("unicode.txt", "rb")
print type(i_f)
b1 = i_f.read()

In [None]:
l1 = l2 = ""
for b in b1[::1]:
    l1 += "{0:02X}|".format(ord(b))
    l2 += "{0:2}|".format(b)
i_f.close()
print l2
print l1

In [None]:
be = b'\x20\xac'

In [None]:
be.decode('utf-16-be')

###Possible Discussions

* Which encodings might you want to use?
* Are there advantages to one encoding over another?

###And, of course, whatever _you_ want ...