# Character representation schemes
- a representation is a mapping between integers and character

# Limitations of Ascii
- 1 bit was not used, leaving some room for extensions, but one byte is not enough to represent all the characters in the world



# Unicode
- "universal character set"
- represents over a million different characters
- every language on earth
    - somebody tried to add Klingon, but it was rejected
- each character represented by a unique integer
- [code charts](http://www.unicode.org/charts/)

# Python 'str' type 
- stores Unicode characters, not ascii

# encodings
- 'encoding' is converting a unicode string into a byte array or stream (in some encoding)
- 'decoding' is converting a byte stream(in some encoding) into a unicode string
- there are several different encoding/decoding schemes
- java uses UTF-16
- W3C recommends web pages use UTF-8
- the UTF-8 encoding has the special property that if the unicode string is just ascii characters, the UTF-8 encoding
is the same as the ascii encoding
- when you WRITE a unicode string from  python(saving a file, writing to the network), you must ENCODE it into a sequence of bytes
- when you read a unicode string INTO of python, you must DECODE it from a sequence of bytes


In [None]:
# 'Python' spelled in characters from different 
# unicode character sets. len is 6, 
# which is the numbers of characters, 
# not the number bytes it takes to represent them
# \uabcd is 32 bits written in hex
# \Uabcdefgh is 64 bits written in hex

uni = '\U00002119\u01b4\u2602\u210c\xf8\u1f24'
[type(uni), uni, len(uni)]

In [None]:
# python knows how to render lots of characters!

''.join([chr(j) for j in range(17600, 18999)])

# 'ord' maps a char into its unicode integer
# 'chr' maps a unicode integer into a char

In [None]:
# 3rd char is from 'dingbats'

[ ord('A'), chr(65), chr(0x2702)]

In [None]:
uni

In [None]:
# three different encodings of unicode 

utf8, utf16, utf32 = [uni.encode(et) \
                      for et in \
                      ['utf-8', 'utf-16', 'utf-32']]

In [None]:
# length of unicode encoding varies 
# with different encodings

[[len(u), type(u)] for u in [utf8, utf16, utf32]]

In [None]:
# utf8, utf16, utf32 are type 'bytes', not str. 
# note b' prefix

[type(uni), type(utf8), utf8, utf16, utf32]

In [None]:
# decode converts bytes into unicode string

utf32.decode('utf-32')

In [None]:
utf8.decode('utf-8')

In [None]:
# to decode, must know the encoding type(key)
# selecting the wrong decoder doesn't 
# always generate an error
# sometimes you will just get a bogus string

utf32.decode('utf-8')

# ascii vs unicode
- ascii is easy, because storage media and networks handle bytes, and ascii is just bytes
- no byte order issues(big/little endian)
- unicode is harder, because
    - writing to the network or storage from Python, the unicode string must be ENCODED into a byte stream, in some format like utf-8, utf-16, etc
    - reading from the network or storage into Python, the byte stream must be DECODED into a unicode stream. somehow the encoding used must be provided
- given Python uses 'str' unicode, you are always
    - encoding as strings leave your program
    - decoding as strings enter your program
- if all you are using are ascii characters, then everything just works, without any special effort
- [standard text encoders](https://docs.python.org/3/library/codecs.html#standard-encodings)