### Strings III

In [None]:
import string

import time
from IPython.display import YouTubeVideo
from IPython.display import clear_output

In [None]:
word = "silencio"

## Advanced: Unicode, encoding and UTF-8 

[doc (aka the gory detail)](https://docs.python.org/3/howto/unicode.html)  
[`ord()` doc](https://docs.python.org/3/library/functions.html#ord)  
[`chr()` doc](https://docs.python.org/3/library/functions.html#chr)  
[`bin()` doc](https://docs.python.org/3/library/functions.html#bin)  
[`int()` doc](https://docs.python.org/3/library/functions.html#int)

- `unicode`: worldwide standard assigning one **number** to one **character**
- encoding: the way you **implement** this in computers (how to organise the 0s and 1s, binary representation, so as to get the computer to handle text properly)
- `utf-8` (Unicode Transformation Format – 8-bit): one specific implementation, in this case using 8 bits for each character


In [None]:
YouTubeVideo("MijmeoH9LT4", width=853, height=480) #  Characters, Symbols and the Unicode Miracle - Computerphile 

In [None]:
print("the unicode point for 'a' is:", ord("a"))

In [None]:
print("the character for unicode point 97:", chr(97))

In [None]:
print("binary representation of 97 is:", bin(97))

In [None]:
print("converting from binary to integer (base 10):", int('10', 2)) # try '0', '1', '10', '11' ...
print("'converting' from base 10 to integer (also base 10):", int('10', 10)) 

In [None]:
print("converting bytes for 97 back to 97:", int(bin(97), 2))

In [None]:
print("converting bytes for 97 back to 'a':", chr(int(bin(97), 2)))

In [None]:
# to convert a string to binary,
# first 'encode' to bytes
byte_string = "a".encode("utf8") # try adding more letters
# then turn the bytes into binary
# (the '0b' only indicates this is a binary string)
# (see the `int()` doc for details)
list_of_binary_strings = [bin(byte) for byte in byte_string]
print(list_of_binary_strings)

In [None]:
# Chinese characters take more than one byte!
byte_string = "龙".encode("utf8")
list_of_binary_strings = [bin(byte) for byte in byte_string]
print(list_of_binary_strings)

Jörg Piringer's [unicode](https://joerg.piringer.net/index.php?href=unicode/unicode.xml), going through numbers `0 - 65536 (49571 characters)`:

In [None]:
YouTubeVideo("Z_sl99D2a18", width=853, height=480) #  unicode 

In [None]:
# many characters are also space / invisible / etc., so won't display anything
for i in range(65536):
    print(chr(i))
    # try this if you also want the index
    # print(i, chr(i))
    time.sleep(.09)
    clear_output(wait=True)

## Extra: `str.translate`

[translate doc](https://docs.python.org/3/library/stdtypes.html#str.translate)  
[maketrans doc](https://docs.python.org/3/library/stdtypes.html#str.maketrans)

A useful tool to translate strings character by character (e.g. all "e"s become "a"s, or "remove all punctuation (all punctuation becomes ''").

In [None]:
# translate expects the `unicode` number!
word.translate(
    {
        ord("i"): ord("1"),
        ord("e"): ord("3"),
        ord("c"): ord("<"),
        ord("o"): ord("0"),
    }
)

In [None]:
word.translate(
    str.maketrans("ieco", "13<0")
)

In [None]:
# maketrans can take a third argument:
# here we say: translate everything as is ("" to "")
# and the last argument is for *everything that needs to be removed*
# (equivalent to do `ord(","): None` for all punctuation characters)
print(str.maketrans("", "", string.punctuation))

In [None]:
# now, we can remove all punctuation from text
print("Hello there! How are you? Yes, you...".translate(
    str.maketrans("", "", string.punctuation)
))