# Unicode in Python

I don't have to deal with unicode issues too often but when I do, I forget this stuff every time because I don't have to deal with unicode issues too often. These are my notes to help jog that memory.

This [Ned Batchelder talk](https://nedbatchelder.com/text/unipain.html) is very helpful.

## Code points

Unicode maps "characters" to code points, or maybe it would be more accurate to say it maps code points to "characters". A code point is just a number. For example, 'A' is code point 65 and lower case Greek alpha is 945.

Code points are written U+dddd where *dddd* is the hex representation of the code point. So for the two examples, 'A' would typically be written U+0041 and the lower case alpha as U+03B1.

The definition for "characters" is quite broad, for example

&#9730;

is a "character", the code point U+2602 (9730 in decimal, which apparently you need when you put it in HTML and markdown).

Here it is in python. Enter unicode code points using the \u escape.

In [1]:
"\u2602"

'☂'

The [www.unicode.org](http://www.unicode.org) website has all the code points meticulously catalogued. For example, all the Greek letters can be found on [this chart](http://www.unicode.org/charts/PDF/U0370.pdf).

In [2]:
"\u03b1 \u03b2"  # alpha beta

'α β'

## Encodings

Code points need to be stored on disk somehow using bytes, and this is where encodings come in.

In [3]:
"AB".encode()

b'AB'

The b at the beginning means it is a byte string. Since we are dealing with numbers in hex, let's see it that way:

In [4]:
"AB".encode().hex()

'4142'

So what's the big deal? Different encodings produce different bytes for the same unicode, at least when the unicode contains something other than ASCII characters.

In [5]:
"\u03b1".encode()

b'\xce\xb1'

In [6]:
"\u03b1".encode("utf-8")

b'\xce\xb1'

In [7]:
"\u03b1".encode("utf-16")

b'\xff\xfe\xb1\x03'

The default for encoding and decoding is utf-8. The `\xff\xfe` byte sequence is a byte order mark. Look it up if you want.

Note that 'A' (code point U+0041) encodes to 0x41 (as a byte). Also note that lower case Greek alpha (code point U+03B1) encodes to 0xCEB1 *when using the utf-8 encoding* and 0xB103 *when using the utf-16 encoding*.

This is the biggest source of confusion for me. The lower 7 bit ASCII code points from U+0000 up to U+007F in (almost?) all encodings are encoded as a byte that matches the code point value. 'A' (\U+0041) encodes to 0x41. But for anything above that, the encodings can and definitely do differ.

## Pain

If you have a byte string, you cannot infer the encoding, you must be told or you have to guess. If you guess wrong, you will often get garbage or even an error.

In [8]:
"\u03b1".encode("utf-8").decode("utf-16")  # this should be alpha but we get something else

'뇎'

In [9]:
"\u03b1".encode("utf-16").decode("utf-8")  # utf-8 can't decode the utf-16 byte order mark

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

You can't mix unicode and bytes.

In [10]:
"Hello " + b"World"

TypeError: Can't convert 'bytes' object to str implicitly

## Python types

Bytes are not unicode and unicode is not bytes. I'm sure that sentence has a grammatical error. Unicode in python (3) is type 'str' by the way.

In [11]:
type("AB")

str

In [12]:
type(b"AB")

bytes

What is an encoding? It takes unicode and converts it to bytes.

In [13]:
"AB".encode()

b'AB'

You can't encode bytes--they are already encoded.

In [14]:
b"AB".encode()

AttributeError: 'bytes' object has no attribute 'encode'

And you can't decode unicode.

In [15]:
"AB".decode()

AttributeError: 'str' object has no attribute 'decode'

Decoding takes bytes and produces unicode.

In [16]:
b"AB".decode()

'AB'

# Misc

## Hex and such

Convert bytes to hex:

In [17]:
b'A\xbe'.hex()

'41be'

Convert hex to bytes:

In [18]:
bytes.fromhex('41be')

b'A\xbe'

## Python escape sequences

In [19]:
def prnt(val):
    print(val, "evals to", eval("\"" + val + "\""))


# \xhh: enter a character using hex where 'hh' is the hex value
prnt("\\x41")  # 0x41 = 65 which is 'A' in ASCII
prnt("\\x61")  #                    'a' in ASCII

# \ooo: enter a character as octal
prnt("\\101")  # 0101 = 65 in octal

# \uXXXX: 16-bit hex value unicode
prnt("\\u03b1")

# \N{NAME}: named unicode
prnt("\\N{GREEK SMALL LETTER ALPHA}")

\x41 evals to A
\x61 evals to a
\101 evals to A
\u03b1 evals to α
\N{GREEK SMALL LETTER ALPHA} evals to α
