# Character encoding

## Unicode
Main takeway from [Unicode, Pycon 2020](https://www.youtube.com/watch?v=olhKTHFYNxA&feature=youtu.be): Identify the boundaries of your Python program (opening/writing files, network connections etc.), and do the encoding and decoding there, and only there. Inside the Python program, only work with strings.

In [1]:
import unicodedata
import math
import binascii
import pandas as pd

In [2]:
unicodedata.name('H')
# https://www.fileformat.info/info/unicode/char/0048/index.htm

'LATIN CAPITAL LETTER H'

In [3]:
def make_bitseq(s: str) -> str:
    return " ".join(f"{ord(i):08b}" for i in s)

print(make_bitseq('H'))
print(binascii.hexlify(b'H'))
math.pow(2,3)+math.pow(2, 6)

01001000
b'48'


72.0

In [4]:
print(ord('H'))
print(hex(ord('H')))
ord?

72
0x48


[0;31mSignature:[0m [0mord[0m[0;34m([0m[0mc[0m[0;34m,[0m [0;34m/[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m Return the Unicode code point for a one-character string.
[0;31mType:[0m      builtin_function_or_method


In [5]:
len('😅')

1

In [6]:
s = '\u00FC' # ü
[hex(ord(c)) for c in unicodedata.normalize('NFD', s)]

['0x75', '0x308']

In [7]:
'ABC'.casefold() == 'abc'.casefold()
# Bedre å bruke dette enn lower/upper! (ofte ikke mulig i SQL, men bør brukes i Python)

True

## Various encodings
UTF-8 is a multibyte encoding that can represent any Unicode character. ISO 8859-1 is a single-byte encoding that can represent the first 256 Unicode characters. Both encode ASCII exactly the same way.
Code points 128 - 255 differ by becoming 2-byte sequence with UTF-8 whereas they are single bytes with ISO 8859-1 (Latin-1). Ref. [SO](https://stackoverflow.com/questions/7048745/what-is-the-difference-between-utf-8-and-iso-8859-1)

Links with more information:

- https://docs.python.org/3/howto/unicode.html
- https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
- https://realpython.com/python-encodings-guide/
- https://kunststube.net/encoding/
- https://mincong.io/2019/04/07/understanding-iso-8859-1-and-utf-8/
- https://unicodebook.readthedocs.io/encodings.html

In [8]:
s = "ñ"  # https://www.fileformat.info/info/unicode/char/00f1/index.htm.
b1 = s.encode("utf-8")
b2 = s.encode("iso-8859-15")
print(b1)  # Note: Two bytes
print(b2)

b'\xc3\xb1'
b'\xf1'


In [9]:
print(b'\xc3\xb1'.decode("utf-8"))
print(b'\xf1'.decode("iso-8859-15"))

ñ
ñ


Examples of using the wrong encoding:

In [10]:
b'\xc3\xb1'.decode("iso-8859-15")

'Ã±'

In [11]:
try:
    b'\xf1'.decode("utf-8")
except UnicodeDecodeError as e:
    print(e)

'utf-8' codec can't decode byte 0xf1 in position 0: unexpected end of data


In utf-8, ñ is encoded as \xc3\xb1. \xf1 is not part of the utf-8 character list, ref. https://www.fileformat.info/info/charset/UTF-8/list.htm


In [12]:
s = '€'
b = s.encode("iso-8859-15")
print(b)
print(b'\xa4'.decode("iso-8859-15"))

b'\xa4'
€


In [13]:
try:
    '€'.encode("iso-8859-1")
except UnicodeEncodeError as e:
    print(e)

'latin-1' codec can't encode character '\u20ac' in position 0: ordinal not in range(256)


ISO-8859-15 is similar to ISO 8859-1, and thus also intended for “Western European” languages, but replaces some less common symbols with the euro sign and some letters that were deemed necessary, ref. [Wikipedia](https://en.wikipedia.org/wiki/ISO/IEC_8859-15).

## Linux commands

in.txt:
```
H
Æ Ø Å
ô í ó
Ž €
```

`$ file in.txt`
```
in.txt: UTF-8 Unicode text
```

`$ iconv -f utf-8 -t iso-8859-15 in.txt > out.txt`

out.txt:
```
H
Æ Ø Å
ô í ó
´ ¤
```

`$ hexdump -C in.txt`
```
00000000  48 0a c3 86 20 c3 98 20  c3 85 0a c3 b4 20 c3 ad  |H... .. ..... ..|
00000010  20 c3 b3 0a c5 bd 20 e2  82 ac 0a c2 bc 0a        | ..... .......|
```

H=48, 0a = end of line,
Æ = c3 86, Space = 20, Ø = c3 98, Space=20, Å = c3 85, 0a = end of line
etc. (https://www.utf8-chartable.de/)

`$ hexdump -C out.txt`
```
00000000  48 0a c6 20 d8 20 c5 0a  f4 20 ed 20 f3 0a b4 20  |H.. . ... . ... |
00000010  a4 0a                                             |..|
00000012
```

H=48, 0a = end of line,
Æ = c6 86, Space = 20, Ø = d8, Space=20, Å = c5 85, 0a = end of line,
ô = f4, Space = 20, í = ed, Space = 20, ó = f3, 0a = end of line
etc. (https://en.wikipedia.org/wiki/ISO/IEC_8859-15 / http://www.columbia.edu/kermit/latin9.html)

Ž and € gets converted to  til ´ og ¤, ref. [Wikipedia](https://en.wikipedia.org/wiki/ISO/IEC_8859-15).

If the character ¼ is added to in.txt:

```
iconv: illegal input sequence at position 27
```

(since the character is not included in iso-8859-15)

## Pandas
[Unicode Character 'BLACK CHESS KNIGHT'](https://www.fileformat.info/info/unicode/char/265e/index.htm)

In [14]:
df = pd.read_csv("files/in.csv", encoding="utf-8")
df

Unnamed: 0,a,b
0,H,♞


In [15]:
with open('files/in.csv', 'rb') as f:
    content = f.read()
print(binascii.hexlify(content))

b'612c620a482ce2999e0a'


H=48, comma=2c, black chess knight = e2999e (three bytes), end of line = 0a

In [16]:
df2 = pd.read_csv("files/in.csv", encoding="iso-8859-1")
df2

Unnamed: 0,a,b
0,H,â


ISO-8859-1 has a valid character mapping for every possible byte sequence, ref. [SO](https://stackoverflow.com/questions/40029017/python2-using-decode-with-errors-replace-still-returns-errors).

Some encodings are able to decode any byte sequences. All encodings of the ISO-8859 family have this property, because all of the 256 code points of these 8 bits encodings are assigned, ref. [link](https://unicodebook.readthedocs.io/encodings.html#undecodable-byte-sequences)

In [17]:
# df.to_csv("files/out.csv", encoding="utf-8")  # No problem
try:
    df.to_csv("files/out.csv", encoding="iso-8859-1")
except UnicodeEncodeError as e:
    print(e)

'latin-1' codec can't encode character '\u265e' in position 4: ordinal not in range(256)


In [18]:
df.to_csv("files/out.csv", encoding="iso-8859-1", errors='replace', index=False)
df_out = pd.read_csv("files/out.csv", encoding="iso-8859-1")
df_out

Unnamed: 0,a,b
0,H,?


In [19]:
df.to_csv("files/out.csv", encoding="iso-8859-1", errors='backslashreplace', index=False)
df_out = pd.read_csv("files/out.csv", encoding="iso-8859-1")
df_out

Unnamed: 0,a,b
0,H,\u265e


[New in Pandas version 1.1.0 ((July 28, 2020))](https://pandas.pydata.org/docs/whatsnew/v1.1.0.html): DataFrame.to_csv() and Series.to_csv() now accept an errors argument ([GH22610](https://github.com/pandas-dev/pandas/issues/22610)).

errors is an optional string that specifies how encoding and decoding errors are to be handled. Options are *'strict'* (raise ValueError, default), *'ignore'*, *'replace'*, *'backslashreplace'* etc., ref. [docs](https://docs.python.org/3/library/functions.html#open)

## Misc database stuff

### Oracle
```
-- Character set
SELECT * FROM v$nls_parameters
 WHERE parameter LIKE '%CHARACTERSET';
 
--The queries below show that the database uses ISO-8859-15
-- (ref. https://en.wikipedia.org/wiki/ISO/IEC_8859-15)
 
select 'Adrían Błażéj Смирнов' name from dual; -- Adrían B¿a¿éj ¿¿¿¿¿¿¿
 
select '♞' from dual; -- ¿
 
select 'Ž' from dual; -- Ž
 
select '¼' from dual; -- ¿
 
select 'Ƹ' from dual; -- ¿
```