### Bit, Bytes to Text

Computers store data in bits, using the two bits `0` and `1`. A sequence of 8 bits is a byte.

There is no direct representation of text in computer hardware. This necessitate a translation from bits and bytes to characters / text.

The translation is provided by a mapping from bytes to text, and is called a code page. When storing text in memory or on disk, text is converted into bytes using the mapping provided by the code page. A similar translation is employed when coverting bytes to text.

Languages have different alphabets, the people of Bougainville have the smallest alphabet in the world; their rotokas alphabet is composed of 12 letters; A, E, G, I, K, O, P, R, S, T, U and V.

As an exercise we can come up with a codepage for the Rotokas alphabet using 4 bits to represent each character.

The english language typically uses the ASCII codepage for encoding and decoding character. ASCII uses 7 bits to represent character. This allows for 127 ($2^7-1$) character to be encoded and decoded. English is a simple language and the entire alphabet can be represented in numbers in the range from 0 to 127 (*65* is capital `A`, *97* is lowercase `a`).

Western european language like German have more than 127 character, and extend the range to 255 ($2^8-1$) utilizing 8 bits. The most common encoding for these language is cp-1252 also called "windows-1252"

Then there are multibyte character sets like Chinese, Korean and Japanese, due to the large alphabet in these languages. But different multi-byte encoding have the same problem as different single byte encoding, the same number means different things that can only be deciphered using the code page.

When interoperability is not necessary, application program and operating system can implement a custom code page (encoding and decoding) for text. When using a proprietary binary format, and application can optimize the encoding and decoding of text as no other program is expected to be able to decipher the document.

However with the internetworked world, documents and text are shared between users in different geographical locations. In addition to having to utilize the correct code page, computers must agree on a byte-order (big-ending or little-endian) before encoding or decoding characters into binary.

#### Unicode

Unicode is a system designed to represent every character from every language. Unicode represent each letter, character or ideograph as a 4-byte number. Each number represents a unique character used in at least one of the world's languages. Not all numbers are used, but more than 64k of them are, so 2 bytes won't suffice.

Character that are used in multiple languages generally have the same number, unless there is good etymological reason not to.

The drawback for the using unicode encoding, is the wasteful utilization of disk or memory when representing every character as 4 bytes irrespective of whether it is needed or not.

There is a unicode encoding utf-32, because 32 bits equals 4 bytes. utf-32 is a straight forward encoding, it takes each unicode character and represents the character with a number. The most important advantage of fixed-width (4 byte) encoding is that you can fine the `Nth` character of a string in constant time

utf-16 is a 2 byte encoding of unicode character because most people don't need/use more than 64k characters. It supports additional mechanisms/hacks for characters outside the 64k range.

Because utf-32 and utf-16 are multibyte encodings, programs and operating systems have to deal with endianness.

In utf-16, due to variable byte encoding it is not possible to find the `Nth` character in a string in constant time.

**UTF-8** is the solution that people came up with that trades off memory requirements, it is a variable length encoding systems for unicode. utf-8 uses just one byte per character for ascii, _extended latin_ character like ö utilize two bytes. Chinese characters take up to three bytes. The rarely used characters take four bytes.

Disadvantages: because each character can take a different number of bytes, finding the Nth character is an O(N) operation — that is, the longer the string, the longer it takes to find a specific character. Also, there is bit-twiddling involved to encode characters into bytes and decode bytes into characters.

Advantages: Lower memory footprint. Due to the exact nature of bit twiddling there is no byte-ordering issues. A document encoded in utf-8 uses the same exact stream of bytes on any computer.

#### Diving in

In Python, all string are sequences of Unicode character.  Bytes are not character; bytes are bytes. Characters as an abstraction.

In [5]:
s = '深入 Python'
len(s)

9

In [6]:
s[0]

'深'

To reduce memory consumption and improve performance, Python uses three kinds of internal **fixed length encoding** representation for Unicode string

* 1 byte per char (latin-1 encoding)
* 2 bytes per char (ucs-2 encoding)
* 3 bytes per char (ucs-4 enconding)

In [28]:
import sys
# 1 byte encoding (since the string only contain latin characters)
s = 'a'
print(sys.getsizeof(s)-sys.getsizeof(''))
# 2 byte encoding (contains 2 byte character)
s = s + '深'
print(sys.getsizeof(s)-sys.getsizeof('深'))
# 3 byte encoding (contains astral plane character)
s = s + '🐍'
print(sys.getsizeof(s)-sys.getsizeof('深🐍'))

1
2
4


**Why Python doesn't use UTF-8 encoding internally**

The most well-known and popular Unicode encoding is UTF-8, but Python doesn't use it internally.

When a string is stored in the UTF-8 encoding, each character is encoded using 1-4 bytes depending on the character it is representing. It's a storage efficient encoding, but it has one significant disadvantage. Since each character can vary in length of bytes, there is no way to randomly access an individual character by index without scanning the string. So, to perform a simple operation such as `string[5]` with UTF-8 Python would need to scan a string until it finds a required character. Fixed length encodings don't have such problem, to locate a character by index Python just multiplies an index number by the length of one character (1, 2 or 4 bytes).

##### Formatting Strings

Python supports formatting values into strings. Although this can include very complicated expressions, the most basic usage is to insert a value into a string with a single placeholder.

In [29]:
user = 'bhaskar'
password = 'testing'
"{0}'s password is {1}".format(user, password)

"bhaskar's password is testing"

In the above, `format` is a method call on a string literal. *Strings* are objects, and objects have methods. The method evaluates to a string. `{0}` and `{1}` are placeholders which are replaces by the arguments passed to the format method.

Replacement field are more powerful than simply using positional arguments.

In [34]:
suffixes = {1000: ['KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB'],
            1024: ['KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB', 'ZiB', 'YiB']}

'1000{0[1000][0]} = 1{0[1000][1]}'.format(suffixes)

'1000KB = 1MB'

This looks complicated but it is not `{0}` refers to the first argument passed to the `format` methods, `suffixes`. `suffixes` is a dictionary with keys `1000` and `1024`, `{0[1000]}` returns the value associated with the key `1000` is `suffixes`, the type of which is `List[str]`. The final `[0]` or `[1]` indexes into the `list`.

What this examples shows is that format specifiers *can access items and properties of data structures using (almost) Python syntax*. This is called *compound field names*. The following compound field names "just work"

* Passing a list and accessing an item of the list by index
* Passing a dictionary and accessing a value of the dictionary by key
* Passing a module and accessing its valirable and functions by name
* Passing a class instance and accessing its properties and methods by name
* Any combination of the above

In [44]:
import sys
# passing in the module sys to the format method
"Name of program is {0.argv[0]}".format(sys)

'Name of program is /Users/bhaskar/Library/Caches/pypoetry/virtualenvs/dive_into_python3-NDjgBark-py3.8/lib/python3.8/site-packages/ipykernel_launcher.py'

##### Format specifiers

Format specifier allow you to print the text is a variety of useful ways, similar to `printf` is C. You can add zero or space padding, align string, control decimal precision and even convert numbers to hexadecimal.

Within a replacement field (`{}`), `:` marks the start of a format specifier. `.1f` means *round to the nearest tenth*.

Thus, given a size of `698.26` and suffix of `GB` the formatted string would be `698.3 GB`.

In [46]:
"{0:.1f} {1}".format(698.26, 'GB')

'698.3 GB'

##### Other common string methods.

Besides formatting strings can do a number of useful tricks. Let's say you have a list of key-value pairs in the form `key1=value1&key2=value2` and you wan to split them up to make a dictionary of the form `{key1: value1, key2: value2}`

In [50]:
query='user=bhaskar&database=blog&password=testing'

a_list = query.split('&')
# the second argument to split specifies the number of time to split
a_list_of_lists = [v.split('=', 1) for v in a_list if '=' in v]
a_dict = dict(a_list_of_lists)
a_dict

{'user': 'bhaskar', 'database': 'blog', 'password': 'testing'}

#### Slicing a string

Once you have define a string, you can get any part of it as a new string. This is called *slicing*. Slicing works exactly the same a *lists*, which makes sense, strings are a sequence of characters

In [51]:
a_string = 'My alphabet starts where you alphabet ends'
a_string[3:11]

'alphabet'

In [53]:
a_string[3:-5]

'alphabet starts where you alphabet'

In [54]:
a_string[:2]

'My'

In [55]:
a_string[:18]

'My alphabet starts'

##### Strings vs. Bytes

Bytes are bytes, characters are an abstraction. A immutable sequence of unicode character is a string. A immutable sequence of number between 0 and 255 is called a *bytes* object.

To define a bytes object use the `b''` literal syntax.

In [58]:
by = b'abcd\x65'
by

b'abcde'

In [59]:
type(by)

bytes

In [60]:
len(by)

5

In [62]:
by[0]

97

In [64]:
barr = bytearray(by)
barr

bytearray(b'abcde')

In [67]:
barr[0] = 0x66
barr

bytearray(b'fbcde')

Converting between bytes and string can be done using the `encode` and `decode` methods. These method take an optional character encoding

In [74]:
print(type(by.decode('ascii')))
by.decode('ascii')

<class 'str'>


'abcde'

In [75]:
"abcde".count(b'd'.decode('ascii'))

1

In [89]:
a_string = '深入 Python'
len(a_string)

9

In [90]:
by = a_string.encode('utf-32')
len(by)

40

In [91]:
# lets screw this up by decoding with a different character set than the encoding.
# the result should be gibberish
by.decode('utf-16')

'\x00深\x00入\x00 \x00P\x00y\x00t\x00h\x00o\x00n\x00'

In [92]:
by.decode('utf-32')

'深入 Python'

In [95]:
by2 = a_string.encode('gb18030')
print(len(by2))
by2

11


b'\xc9\xee\xc8\xeb Python'

In [94]:
by2.decode('gb18030')

'深入 Python'

In [100]:
a_string == by2.decode('gb18030')

True