In [120]:
type('abc')  # plain str
type(b'abc')  # str encoded (using ASCII encoding) as bytes

str

bytes

In [57]:
list('abc')  # str is like [char, ...]
list(b'abc')  # bytes is like [int, ...]

['a', 'b', 'c']

[97, 98, 99]

### ASCII

https://www.ascii-code.com/

- standard ASCII uses 7 bits: 0-127 -> 128 chars
- extended ASCII uses 8 bits: 0-255 -> 256 chars

There are lots of variants (character sets) of extended ASCII. Everybody used that 8th bit as per their desire.

But we have two major character sets:

- Windows-1252 (CP-1252)
  - https://en.wikipedia.org/wiki/Windows-1252
  - Printable characters rather than control characters in the 128 to 159 range
  
- ISO-8859-1 (ISO Latin-1)
  - https://en.wikipedia.org/wiki/ISO/IEC_8859-1
  - Control characters in the 128 to 159 range
  
- Both have same characters in the 160 to 255 range
- Windows-1252 is like a superset of ISO-8859-1, in terms of printable characters
- Both were created to support most Western European languages by adding accented letters (like é, ü, ñ) and symbols

`ord(char)`: char -> unicode code-point

`chr(code)`: unicode code-point -> char

Both are like reverse of each other

The first 256 Unicode code points (U+0000 to U+00FF) correspond exactly to the 256 characters defined in the ISO-8859-1 character set

https://www.compart.com/en/unicode/plane

In [80]:
# encoding (char -> int)
# you want the ordinal value of a unicode character
ord('A')
ord('a')

65

97

In [79]:
# decoding (int -> char)
# you want the unicode character for an ordinal value
chr(65)
chr(97)

'A'

'a'

In [83]:
ord('€')  # € is mapped to 128 in Windows-1252

8364

In [84]:
# as per unicode:

chr(128)  # is NOT €, but a control character (non printable)
chr(8364)  # is €

# conclusion: characters of windows-1252, in the range 128 to 159, are mapped elsewhere in unicode

'\x80'

'€'

### string to bytes encoding

- `ascii` encoding: limited to 128 characters of standard ASCII
- `utf-8` encoding: enjoy full unicode character set

In [122]:
list(b'abc')  # ascii encoding by default for b-strings

[97, 98, 99]

In [124]:
# b'abcé'
# SyntaxError: bytes can only contain ASCII literal characters

b'abc\xc3\xa9'  # \xc3\xa9 represents the character 'é' in utf-8 encoding

b'abc\xc3\xa9'

In [127]:
# b'abc\xc3\xa9'.decode('ascii')
# UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)

b'abc\xc3\xa9'.decode('utf-8')

'abcé'

In [128]:
list(bytes('abc', 'ascii'))
list(bytes('abc', 'utf-8'))

# both are same if we are using standard 7-bit ascii characters (0-127)

[97, 98, 99]

[97, 98, 99]