# Unicode

The best way to start understanding what they are is to cover one of the simplest character encodings, ASCII.  
So what is a more formal definition of a character encoding?  
At a very high level, it’s a way of translating characters (such as letters, punctuation, symbols, whitespace, and control characters) to integers and ultimately to bits. Each character can be encoded to a unique sequence of bits.  The entire ASCII table contains 128 characters. 

* 0 through 31	Control/non-printable characters
* 32 through 64	Punctuation, symbols, numbers, and space
* 65 through 90	Uppercase English alphabet letters
* 91 through 96	Additional graphemes, such as [ and \
* 97 through 122	Lowercase English alphabet letters
* 123 through 126	Additional graphemes, such as { and |
* 127	Control/non-printable character (DEL)

In [1]:
whitespace = ' \t\n\r\v\f'
ascii_lowercase = 'abcdefghijklmnopqrstuvwxyz'
ascii_uppercase = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
ascii_letters = ascii_lowercase + ascii_uppercase
digits = '0123456789'
hexdigits = digits + 'abcdef' + 'ABCDEF'
octdigits = '01234567'
punctuation = r"""!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~"""

In [2]:
printable = digits + ascii_letters + punctuation + whitespace
print(printable)

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ 	



In [3]:
import string
s = "What's wrong with ASCII?!?!?"
s.rstrip(string.punctuation)

"What's wrong with ASCII"

Here’s a handy way to represent ASCII strings as sequences of bits in Python. Each character from the ASCII string gets pseudo-encoded into 8 bits, with spaces in between the 8-bit sequences that each represent a single character:

In [4]:
def make_bitseq(s: str) -> str:
    if not s.isascii():
        raise ValueError("ASCII only allowed")
    return " ".join(f"{ord(i):08b}" for i in s)

In [5]:
make_bitseq("bits")

'01100010 01101001 01110100 01110011'

The f-string f"{ord(i):08b}" uses Python’s [Format Specification Mini-Language](https://docs.python.org/3/library/string.html#formatspec)
Using the Python ord() function gives you the base-10 code point for a single str character.
The right hand side of the colon is the format specifier. 08 means width 8, 0 padded, and the b functions as a sign to output the resulting number in base 2 (binary).

In [6]:
i = 'X'
print(f"in Hex : {ord('X'):02x}")

in Hex : 58


In [7]:
make_bitseq("$25.43")

'00100100 00110010 00110101 00101110 00110100 00110011'

In [9]:
int('11', base=2)  # Binary to int

3

In [8]:
int('11', base=8)  # Octal to int

9

In [None]:
int('11', base=16)  # Hex to int

Python accepts literal forms of each of the 3 alternative numbering systems above


In [None]:
0b11  # Binary literal

In [None]:
0o11  # Octal literal

In [None]:
0x11  # Hex literaL

## Unicode
Unicode fundamentally serves the same purpose as ASCII, but it just encompasses a way, way, way bigger set of code points  
Think of Unicode as a massive version of the ASCII table—one that has 1,114,112 possible code points (really 1,111,998 characters). That’s 0 through 1,114,111, or 0 through 17 * (216) - 1, or 0x10ffff hexadecimal. In fact, ASCII is a perfect subset of Unicode. The first 128 characters in the Unicode table correspond precisely to the ASCII characters that you’d reasonably expect them to.
Unicode itself is not an encoding. Rather, Unicode is implemented by different character encodings.
There is one thing that Unicode doesn’t tell you: it doesn’t tell you how to get actual bits from text—just code points. It doesn’t tell you enough about how to convert text to binary data and vice versa.
Unicode is an abstract encoding standard, not an encoding. That’s where UTF-8 and other encoding schemes come into play.


The results of str.encode() is a bytes object, the default encoding in str.encode() and bytes.decode() is UTF-8.

In [10]:
 %timeit "😘".encode("utf-8")

85 ns ± 0.746 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


In [11]:
b'\xf0\x9f\x98\x98'.decode("utf-8") 

'😘'

In [12]:
"é".encode("utf-8") # sequence represents two bytes, 0xc3 and xa9s in hex

b'\xc3\xa9'

Anything from the Unicode character set is usable in identifiers Python’s re module defaults to the re.UNICODE flag rather than re.ASCII. This means, for instance, that r"\w" matches Unicode word characters, not just ASCII letters.

In [13]:
é = 1

In [14]:
import locale
locale.getpreferredencoding()

'UTF-8'

A crucial feature is that UTF-8 is a variable-length encoding.

In [20]:
ibrow = "🤨"
len(ibrow)

1

In [18]:
ibrow.encode("utf-8")

b'\xf0\x9f\xa4\xa8'

In [None]:
len(ibrow.encode("utf-8"))

Calling list() on a bytes object gives you the decimal value for each byte

In [None]:
list(b'\xf0\x9f\xa4\xa8')

Wikipedia’s [UTF-8 article](https://en.wikipedia.org/wiki/UTF-8) has some more technical detail, and there is always the official [Unicode Standard](http://www.unicode.org/versions/latest/) 

## UTF-16 and UTF-32
Wrong results like this are possible when the same encoding isn’t used bidirectionally.
UTF-16 has 2 or 4 Bytes Per Character variable.  UTF-32 4 bytes not variable.

In [None]:
letters = "αβγδ"
rawdata = letters.encode("utf-8")
rawdata.decode("utf-8")

In [None]:
rawdata.decode("utf-16")  

In [None]:
rawdata

 UTF-8 will not always take up less space than UTF-16. Example with the [iroha poem](https://de.wikipedia.org/wiki/Iroha)

In [None]:
text = "以呂波耳本部止 千利奴流乎和加 餘多連曽津祢那 良牟有為能於久 耶万計不己衣天 阿佐伎喩女美之 恵比毛勢須"

In [None]:
len(text.encode("utf-8"))

In [None]:
len(text.encode("utf-16"))

## Python’s Built-In Functions

ascii(), bin(), hex(), and oct() are for obtaining a different representation of an input. Each one produces a str. The first, ascii(), produces an ASCII only representation of an object, with non-ASCII characters escaped. The remaining three give binary, hexadecimal, and octal representations of an integer, respectively. These are only representations, not a fundamental change in the input.

bytes(), str(), and int() are class constructors for their respective types, bytes, str, and int. They each offer ways of coercing the input into the desired type. For instance, as you saw earlier, while int(11.0) is probably more common, you might also see int('11', base=16).

ord() and chr() are inverses of each other in that the Python ord() function converts a str character to its base-10 code point, while chr() does the opposite.

Example ascii()

In [None]:
ascii("abcdefg")


In [None]:
ascii("jalepeño")

In [None]:
ascii(0xc0ffee)  # Hex literal (int)

Example bin()

In [None]:
bin(400)

In [None]:
bin(0xc0ffee) 

In [None]:
[bin(i) for i in [1, 2, 4, 8, 16]] 

Example bytes() representing raw binary data:

In [None]:
bytes((104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100))

In [None]:
bytes(range(97, 123))

In [None]:
bytes("real 🐍", "utf-8")

chr() converts an integer code point to a single Unicode character:

In [None]:
chr(97)

In [None]:
chr(0b01100100) 

hex() gives the hexadecimal representation of an integer, with the prefix "0x":

In [None]:
[hex(i) for i in [1, 2, 4, 8, 16]]

int() coerces the input to int, optionally interpreting the input in a given base:

In [None]:
int('11', base=2)

In [None]:
int.from_bytes(b"Python", "big")

In [None]:
int.from_bytes(b"Python", "big")

Python ord() function converts a single Unicode character to its integer code point:

In [None]:
ord("a")

In [None]:
[ord(i) for i in "hello world"]
[104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100]

str() coerces the input to str, representing text:

In [None]:
str(b"\xc2\xbc cup of flour", "utf-8")

## Python String Literals:

 there are up to six ways that Python will allow you to type the same Unicode character.

In [None]:
"a" == "\x61" == "\N{LATIN SMALL LETTER A}" ==  "\u0061" ==  "\U00000061"

a short function to convert strings that look like "U+10346" into something Python can work with.

In [None]:
def make_uchr(code: str):
    return chr(int(code.lstrip("U+").zfill(8), 16))
make_uchr("U+10346")

In [None]:
make_uchr("U+0026")

In [None]:
alef_hamza = chr(1571)
alef_hamza


In [None]:
alef_hamza.encode("unicode-escape")

## Careful of wrong assumptions
### Other Encodings
One example is Latin-1 (also called ISO-8859-1), which is technically the default for the Hypertext Transfer Protocol (HTTP)

In [None]:
data = b"\xbc cup of flour"

In [None]:
data.decode("utf-8")  # ops! 😳 it waas not UTF-8!

In [None]:
 %timeit data.decode("latin-1") # 😀 

## unicodedata


In [None]:
import unicodedata
unicodedata.name("€")

Inspired from https://realpython.com/python-encodings-guide/

In [12]:
import struct
import codecs
asd = ['e2','07']
text = ''.join(asd)
text

'e207'

In [13]:
encoded = codecs.decode(text, 'hex')
struct.unpack("<H", encoded)

(2018,)

In [14]:
struct.unpack(">H", encoded)

(57863,)

In [22]:
text = '0001'
encoded = codecs.decode(text, 'hex')
encoded

b'\x00\x01'

In [23]:
struct.unpack(">H", encoded) # big endian

(1,)

In [24]:
struct.unpack("<H", encoded) # little endian

(256,)