### Name Lookup using name() in Python 3

We will create a <b style='color:blue'>function that returns the name of a unicode character</b>. We will need to import the  unicodedata module for that.

In [1]:
import unicodedata
def unicode_test(unicode_character):
    name = unicodedata.name(unicode_character)
    value = unicodedata.lookup(name)
    print('value="%s", name="%s"' % (value, name))

In [7]:
# Testing simple letters
unicode_test('A')

value="A", name="LATIN CAPITAL LETTER A"


In [8]:
# Testing simple letters
unicode_test('a')

value="a", name="LATIN SMALL LETTER A"


In [9]:
# Special symbols
unicode_test('$')

value="$", name="DOLLAR SIGN"


In [13]:
# The unicode currency character for cent.
unicodedata.name('\u00a2')

'CENT SIGN'

Just to check !!!

In [12]:
print('\u00a2')

¢


In [24]:
# Lets invite the snowman
unicode_test('\u2603')

value="☃", name="SNOWMAN"


### Lets print a special latin letters in a message

In [14]:
u_umlaut = '\N{LATIN SMALL LETTER U WITH DIAERESIS}'

In [15]:
latin_e = '\N{LATIN SMALL LETTER E WITH ACUTE}'
place = 'caf'+latin_e

In [16]:
drink = 'Gew' + u_umlaut + 'rztraminer'

In [17]:
print('Now I can finally have my', drink, 'in a', place)

Now I can finally have my Gewürztraminer in a café


### Encoding and Decoding

When data is to be shared with other machines and over a network, we need to encode character strings to bytes and decode bytes to character strings.

Ken Thompson and Rob Pike, who developed Unix and Windows, designed the <b style='color:blue'>UTF-8 dynamic encoding</b>. It uses one to four bytes per Unicode character:

1. One byte for ASCII
2. Two bytes for most Latin-derived (but not Cyrillic) languages
3. Three bytes for the rest of the basic multilingual plane
4. Four bytes for the rest, including some Asian languages and symbols

UTF-8 is the standard text encoding in Python, Linux, and HTML. It’s fast, complete,
and works well.

#### Encoding
The string encode() function helps us to encode strings to bytes. Its first argument is the encoding name. There are many choices that includes ASCII encoding and UTF-8. We are going to be using UTF-8. We can encode anything as UTF-8. 

Let’s assign the Unicode string '\u2603' to the name snowman:

In [19]:
snowman = '\u2603' 

In [20]:
print(snowman)

☃


snowman is a Python Unicode string with a single character, regardless of how many bytes might be needed to store it internally:

In [21]:
len(snowman)

1

Next let’s encode this Unicode character to a sequence of bytes:

In [22]:
# Next let’s encode this Unicode character to a sequence of bytes:
ds = snowman.encode('utf-8')

UTF-8 is a variable-length encoding. In this case, it used three bytes to encode the single snowman Unicode character:

In [23]:
ds

b'\xe2\x98\x83'

In [24]:
len(ds)

3

len() returns the number of bytes (3) because ds is a bytes variable.

#### Decoding
We decode byte strings to Unicode strings. Whenever we get text from some external source (files, databases, websites, network APIs, and so on), it’s encoded as byte strings. 

The important part is identifying the encoding that was actually used, so we can run it backward and get Unicode strings. But, the problem is that nothing in the byte string says what encoding was used.

Here is how it works:

In [25]:
# Create a string with value 'café'
place = 'caf\u00e9'

In [26]:
place

'café'

In [27]:
#Encode it in UTF-8 format and assign it to a bytes variable called place_bytes:
place_bytes = place.encode('utf-8')

In [28]:
place_bytes

b'caf\xc3\xa9'

We could see that place_bytes has five bytes. The first three "caf" are the same as ASCII and the final two encode the 'é'.

##### Decoding the String back
Now, let’s decode that byte string back to a Unicode string:

In [29]:
place_decoded = place_bytes.decode('utf-8')

In [30]:
place_decoded

'café'

In [31]:
print(place_decoded)

café


That worked because we knew we encoded to UTF-8 and hence we were able to decode it from UTF-8. There are times when we will have to work with strings encoded with other standards. <b style='color:blue'>Lets try to decode with ASCII</b>.

In [33]:
# Lets try and decode the string using ascii
place_ascii = place_bytes.decode('ascii')

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)

It throws an error as expected because it can't support strings encoded using utf-8. As seen in the message the byte value 0xc3 is illegal in ASCII.

There are some 8-bit character set encodings in which values between 128 (hex 80) and
255 (hex FF) are legal but not the same as UTF-8:

In [34]:
place_latin = place_bytes.decode('latin-1')

In [35]:
place_latin

'cafÃ©'

That worked without throwing an exception but the result wasn't what we expected. Hence <b style='color:blue'>it is advised that we use utf-8 everywhere</b> as it creates uniformity and makes our lives easier.

For more information, refer,

https://docs.python.org/3/howto/unicode.html

https://nedbatchelder.com/text/unipain.html

https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/