### Bytes vs Strings

- Bytes are bytes; characters are an abstraction. 
- An immutable sequence of Unicode characters is called a string. 
- An immutable sequence of numbers-between-0-and-255 is called a bytes object.
- To define a bytes object, use the **b'' “byte literal” syntax**. Each byte within the byte literal can be **an ascii character** or an encoded hexadecimal number from **\x00 to \xff (0–255)**.
- The type of a bytes object is bytes.

### Return the hexadecimal representation of the binary data. 

    - https://docs.python.org/3/library/binascii.html 
    - binascii.b2a_hex(data[, sep[, bytes_per_sep=1]]) or binascii.hexlify(data[, sep[, bytes_per_sep=1]])
    - Every byte of data is converted into the corresponding 2-digit hex representation. The returned bytes object is therefore twice as long as the length of data.
    
### Return the binary data represented by the hexadecimal string 

    - binascii.a2b_hex(hexstr) or binascii.unhexlify(hexstr)
    - This function is the inverse of b2a_hex(). 
    - hexstr must contain an even number of hexadecimal digits (which can be upper or lower case), otherwise an Error exception is raised.

In [None]:
import binascii
str1 = binascii.b2a_hex(b'\xb9\x01\xef')
str1

In [None]:
by1 = binascii.a2b_hex(str1)
by1

In [None]:
str2 = binascii.b2a_hex(b'aXEk12') # a:\x61, X:\x58, E:\x45, k:\x6b, 1:\x31, 2:\x32
#str2 = binascii.b2a_hex(b'\x61\x58\x45\x6b\x31\x32')
str2

In [None]:
by2 = binascii.a2b_hex(str2)
by2

### Example
- Checking if the given image is jpeg or not
- reference： https://www.geeksforgeeks.org/working-with-binary-data-in-python/

In [None]:
import binascii
  
# use binascii.a2b_hex() function to generate bytes value or 
# directly generate byte values using the binary literal format

jpeg_signatures = [
    binascii.a2b_hex(b'FFD8FFD8'),
    binascii.a2b_hex(b'FFD8FFE0'),
    binascii.a2b_hex(b'FFD8FFE1')
]

'''
jpeg_signatures = [
    b'\xFF\xD8\xFF\xD8',
    b'\xFF\xD8\xFF\xE0',
    b'\xFF\xD8\xFF\xE1',
]
''' 

with open('metaverse.jpg', 'rb') as file:
    first_four_bytes = file.read(4)
  
    if first_four_bytes in jpeg_signatures:
        print("JPEG detected.")
    else:
        print("File does not look like a JPEG.")

### Practice : check an image is png or not

- eight byte signature : b'\x89PNG\r\n\x1a\n'

In [None]:
png_signature = b'\x89PNG\r\n\x1a\n'
print(binascii.b2a_hex())
#png_signature = binascii.a2b_hex('89504E470D0A1A0A')
with open('line.png','rb') as img:
    signature = img.read(8)
    if signature == png_signature:
        print('PNG file detected.')
    else:
        print('It is not a PNG image file.')

### Python bytes concatenation

- Bytes don't work quite like strings. 
    - When you index with a single value (rather than a slice), you get an integer, rather than a length-one bytes instance. 
    - a[0] gives you an int , 20 (hex 0x14).
- bytes constructor. 
    - If you pass a single integer in as the argument (rather than an iterable), you get a bytes instance that consists of that many null bytes ("\x00"). 
    - Using curly brackets works because it creates a set (which is iterable).
- Using slicing to concat : a += a[0:1]
    - rather than using indexing with a single value. 
    - This will give you a bytes instance that you can concatenate onto your existing value.


In [None]:
a = b'\x14\xf6' 
a += a[0]

In [None]:
bytes(a[0])

In [None]:
bytes({a[0]})

In [None]:
a += a[0:1]
a

### transform between bytes and string

- bytes objects have a decode() method that takes a character encoding and returns a string
- strings have an encode() method that takes a character encoding and returns a bytes object. 
- ```by.decode('ascii')``` : converting a sequence of bytes in the ascii encoding into a string of characters. 
- But the same process works with any encoding that supports the characters of the string — even legacy (non-Unicode) encodings.


In [None]:
a_string = '深入 Python'  #\u6df1\u5165
len(a_string) 

In [None]:
#The first 4 bytes, ff fe 00 00, are BOM in little-endian order
by = a_string.encode('utf32') 
print(len(by))
by

In [None]:
#The first 2 bytes, ff fe, are BOM in little-endian order
by = a_string.encode('utf16') 
print(len(by))
by

In [None]:
for b in by:
    print(bin(b))

In [None]:
by = a_string.encode('utf8') 
print(len(by))
by

In [None]:
for b in by:
    print(f'{bin(b)} : {hex(b)}')

In [None]:
b_string = by.decode('utf8')
b_string

### Unicode Literals in Python Source Code

- In Python source code, specific Unicode code points can be written using the **\u escape sequence**, which is followed by **four** hex digits giving the code point. 
- The **\U escape sequence** is similar, but expects **eight** hex digits, not four:

```
s = "a\xac\u1234\u20ac\U00008000"
#     ^^^^ two-digit hex escape
#         ^^^^^^ four-digit Unicode escape
#                     ^^^^^^^^^^ eight-digit Unicode escape
[ord(c) for c in s]
```

In [3]:
s = "a\xac\u1234\u20ac\U00008000"
#     ^^^^ two-digit hex escape
#         ^^^^^^ four-digit Unicode escape
#                     ^^^^^^^^^^ eight-digit Unicode escape
[ord(c) for c in s]

[97, 172, 4660, 8364, 32768]

In [5]:
face = '\U0001F610'
print(face)
face_by = face.encode('utf-8')
print(face_by)

😐
b'\xf0\x9f\x98\x90'


In [7]:
#U+1F610 😐 NEUTRAL FACE
face_by = b'\xf0\x9f\x98\x90'
face = face_by.decode('utf-8')
print(face)

😐
