**DeapSECURE module 5: Cryptography for Privacy-Preserving Computation (part A: Data Protection)**

# Session 1: Encoding: Representing Data on Computers

Welcome to the DeapSECURE online training program!
This is a Jupyter notebook for the hands-on learning activities of the
["Cryptography" module](https://deapsecure.gitlab.io/deapsecure-lesson05-crypt/),
Episode 3: ["Encoding and Encrypting Data Using Python Libraries"](https://deapsecure.gitlab.io/deapsecure-lesson05-crypt/03-python-library/index.html).
Please visit the [DeapSECURE](https://deapsecure.gitlab.io/) website to learn more about our training program.

In this session, we will learn the technical know-how to store & represent data on computers.

<a id="TOC"></a>
**Quick Links** (sections of this notebook):

* 1 [Setup](#sec-Setup)
* 2 [Integers, Bits, and Hexadecimal Numbers](#sec-Integers-Bits-Hex)
* 3 [Strings and Bytes](#sec-Strings-Bytes)
* 4 [Hexadecimal Representation](#sec-Hex-Representation)
* 5 [Encoding and Decoding Integers](#sec-Encode-Decode-Ints)
* 6 [Padding and Unpadding](#sec-Pad-Unpad)

<a id="sec-Setup"></a>
## 1. Setup Instructions

If you are opening this notebook from Wahab cluster's OnDemand interface, you're all set.

If you see this notebook elsewhere and want to perform the exercises on Wahab cluster, please follow the steps outlined in our setup procedure.

1. Make sure you have activated your HPC service.
2. Point your web browser to https://ondemand.wahab.hpc.odu.edu/ and sign in with your MIDAS ID and password.
3. Create a new Jupyter session with the following parameters: Python version **3.7**, Python suite `tensorflow 2.6 + pytorch 1.10`, Number of Cores **1**, Number of GPU **0**, Partition `main`, and Number of Hours at least **4**. (See <a href="https://wiki.hpc.odu.edu/en/ood-jupyter" target="_blank">ODU HPC wiki</a> for more detailed help.)
4. From the JupyterLab launcher, start a new Terminal session. Then issue the following commands to get the necessary files:

       mkdir -p ~/CItraining/module-crypt
       cp -pr /shared/DeapSECURE/module-crypt/. ~/CItraining/module-crypt
       
The file name of this notebook is `Crypt-session-1.ipynb`.

### 1.1 Reminder

* Throughout this notebook, `#TODO` is used as a placeholder where you need to fill in with something appropriate. 
* To run a code in a cell, press `Shift+Enter`.
* Use `ls` to view the contents of a directory.

<a id="sec-Integers-Bits-Hex"></a>
## 2. Integers, Bits, and Hexadecimal Numbers

Encryption and decryption is a low-level operation: It operates on data at its lowest level of representation: *binary bits*.
In modern computers, bits are bundled into *bytes*.
(An example: the size of a computer drive is often measured in the units of gigabytes = 1 billion bytes.)

A *byte* consists of eight bits (each bit can be 0 or 1);
these eight bits make up a short integer:

    8-bits        hexadecimal   decimal value
    00000000   =>     0x00    =>     0
    00000001   =>     0x01    =>     1
    00000010   =>     0x02    =>     2
    00000011   =>     0x03    =>     3
    ...
    11111110   =>     0xFE    =>   254
    11111111   =>     0xFF    =>   255

A byte, therefore, is equivalent to a short integer, whose value can be 0, 1, 2, ... through 255.
Integers on computer are often expressed in the *hexadecimal* (base-16) representation instead of bits (binary, or base-2 representation) due to its conciseness.
A hexadecimal number system uses 16 possible values (0 through 9, followed by A through F):

    decimal values:
    0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
    hexadecimal:
    0   1   2   3   4   5   6   7   8   9   A   B   C   D   E   F

Any byte can be represented using just two hexadecimal digits.
A hexadecimal number is often written with either the `0x` prefix or the `h` suffix to prevent confusion with the customary decimal numbers (base-10).

In a nutshell, *at the lowest level, all types of data on computers are represented as integers* (i.e. as bits and bytes), regardless whether they are interpreted by us as numbers, texts, images, sounds, videos, etc.

<a id="sec-Strings-Bytes"></a>
## 3. Strings and Bytes

Let us learn some things about strings and bytes, as they will help us working correctly later in the encryption and decryption process.

In Python, a *string* defines a sequence of characters as understood by human. Recent internationalization effort resulted in a universal coding called *Unicode*, whereby Python `str` can contain characters from arbitrary scripts (not just Latin, but Arabic, Japanese, Chinese, etc.):

In [None]:
"""Create a "Hello world" string and save that as a variable named `S`:""";

#S = "#TODO"

In [None]:
"""Print the value of S""";

#print(#TODO)

Try to include other characters in the string, for example Arabic, Japanese, Chinese, French etc. (This may be tricky to do if your OS supports only US English language. You can browse characters on this website: http://www.unicode.org/charts/ . Alternatively, you can open your favorite word processor and use the "Insert" -> "Special Symbol" functionality.)

Here's an example with a greeting in Chinese:

>  早上好, friend!

In [None]:
"""Create a string contains non-latin characters, then print it:""";

#S_ZH = #TODO

Python defines a special datatype called `bytes`, which is simply a string of bytes.
Any variable and value in Python (whether an integer or a string) can be converted into its `bytes` equivalent.

Encryption works on the byte-representation of the data.
Therefore, before we try to encrypt and decrypt data, it is very important that we understand how to:

* *encode* the high-level data into the low-level `bytes`, and
* *decode* (reconstruct) `bytes` back into the high-level data

Now let us encode the characters in `S` to get their low-level byte representation:

In [None]:
"""Encode the string `S` to bytes:""";

B = bytes(S, encoding="utf-8")
print(B)

In [None]:
# This is a different way to accomplish the same result:
# the "utf-8" argument can even be omitted; it is already the default

B_alt = S.encode("utf-8")
print(B_alt)

In contrast to strings, Python will enclose the contents of a `bytes` variable with `b'...'`.

How long is the string `S` (how many characters)?
How long is `B` (how many bytes)?
(*Hint*: In both cases, use the `len()` function to find out.)

In [None]:
"""Print the length of `S` and `B`""";

#print(len(#TODO))

Now observe what happens when we encode the other string (`S_ZH`) that has non-latin characters:

In [None]:
"""Try to encode B_ZH string to bytes as well.""";

#B_ZH = #TODO 

Print the lengths of `S_ZH` and `B_ZH`:

In [None]:
#TODO

**QUESTIONS:**

* Why does `B_ZH` look different from `B`?
* Why do the lengths of `S_ZH` and `B_ZH` differ? Which one is longer?

The array of bytes can be decoded back to a string (using UTF-8 encoding, by default):

In [None]:
"""Decode bytes into string""";

B_ZH.decode()

Historically, a character was represented as a byte (each byte value corresponding to a particular character). **ASCII** was the most widely accepted standard of character mapping.
Only values 0..127 have a defined meaning in ASCII, as you can see from [the ASCII character table](https://commons.wikimedia.org/wiki/File:ASCII_Code_Chart.svg) below.
For example, the letters `A` and `a` have the ASCII codes of `0x41` and `0x61`, respectively (65 and 97 in decimal).

![ASCII character table](fig/500px-ASCII_Code_Chart.png)

(from: https://commons.wikimedia.org/wiki/File:ASCII_Code_Chart.svg)

In Python, bytes that have the printable ASCII character representation will be printed using the corresponding characters.
The other values have no printable representation, therefore they appear as hexadecimal numbers.
For example, in

    b'\xe6\x97\xa9\xe4\xb8\x8a\xe5\xa5\xbd, friend!'

the substring `\xe6` stands for a byte value `0xE6` = 230 in decimal.

Today, [**UTF-8**](https://en.wikipedia.org/wiki/UTF-8) is widely used in computing systems, extending and superseding ASCII by accommodating more than just 256 possible characters. UTF-8 is backward compatible with ASCII.

For example, in UTF-8, each Chinese letter has a three-byte representation:

In [None]:
print(str(b'\xe6\x97\xa9', 'utf-8'))

Some other characters such as accented Latin letters (`Ě`, `à`, `é`, and so on) occupy two bytes per character.

In [None]:
"""print other characters""";

#print(bytes("#TODO", encoding="utf-8"))

**Lesson**: Strings that contain non-Latin letters have longer byte representations.

<a id="sec-Hex-Representation"></a>
## 4. Hexadecimal Representation

Strictly speaking, a hexadecimal representation of binary data is not a separate encoding from bytes.
Instead of printing some bytes as literal characters and the rest in hexadecimal numbers, everything is simply printed as hexadecimal digits.
For example:

In [None]:
B.hex()

The hex string line `48656c6c6f20776f726c64` means a sequence of bytes with hex values: `48`, `65`, `6c`, `6c`, `6f`, and so on. Here, `48` hex = 16&times;4 + 8 = 72 decimal, the character code for letter `H`. Similarly, the subsequent hex values map to the subsequent characters (`e`, `l`, `l`, and so on...). You can check the [ASCII table](https://www.ascii-code.com/).

A hex string representation is exactly twice as long as the byte representation:

In [None]:
"""Check the length of hex string""";

#len(#TODO)

In [None]:
"""Print the length of the hex string of `B_ZH` defined earlier.""";
#TODO

In [None]:
##COMMENTED
#
# More experiments with bytes and encoding
#
B2 = bytes('Good morning', 'utf-8').hex()
print(B2)

Conversely, a `bytes` object can be created from a hex string or from an array of integers:

In [None]:
S2 = bytes.fromhex("476f6f64206d6f726e696e67")
print(S2)

In [None]:
S3 = bytes([0x47, 0x6f, 0x6f, 0x64, 0x20,
            0x6d, 0x6f, 0x72, 0x6e, 0x69, 0x6e, 0x67])
print(S3)

<a id="sec-Encode-Decode-Ints"></a>
## 5. Encoding and Decoding Integers

Now let us turn to the integer data type.
Let us define two routines to help us with the mundane tasks of encoding and decoding integers:

**Function: `encode_int`**

This function converts an integer of arbitrary magnitude to a `bytes` object.
In the crypto convention, the most significant digit will be the first byte in the array.
If the conversion results in fewer than `minlength` bytes, it will be padded from the left with zeros (i.e., ASCII NULL characters).
By default, we will create a 16-byte long string because the (1) AES routines process 16-byte blocks of data and (2) we will be using AES with 128-bit keys.

In [None]:
def encode_int(C, minlength=16):
    """Encodes an arbitarily long integer into a `bytes` object.
    The minimum length is by default 16 bytes (128 bits)."""
    # The hex function will add `0x` prefix, they must be stripped off
    C_hex = hex(C)[2:]
    if len(C_hex) % 2:
        C_hex = '0' + C_hex
    C_bytes = bytes.fromhex(C_hex)
    if len(C_bytes) < minlength:
        # pad the left side with NULLs
        C_bytes = C_bytes.rjust(minlength, b'\x00')
    return C_bytes

Here is an example to encode an integer to bytes:

In [None]:
A = 0x2A3749
print("A in decimal =", A)  # in decimal
A_bytes = encode_int(A)
A_bytes

In [None]:
A_bytes.hex()

> An integer object has a `to_bytes` method that does almost the same thing:

In [None]:
A.to_bytes(16, 'big')

> The `big` argument refers to the exact data layout known as [*endianness*](https://en.wikipedia.org/wiki/Endianness). We do not need to delve into that issue here, except that we are working with the *big endian* layout.
The `encode_int` function is useful if
(1) you may have an arbitrarily long integer to encode; and/or
(2) you do not want the NULL padding (set `minlength=0`).

**Function: `decode_int`**

This function converts a `bytes` object into a long integer. This is the converse of the `encode_int` function.

In [None]:
def decode_int(B):
    """Decodes a `bytes` object into a long integer.
    This is the converse of the `encode_int` function."""
    return int(B.hex(), 16)

In [None]:
"""Here is an example to reconstitute `A`"""

decode_int(A_bytes)

In [None]:
"""Verify if the operation results in the same value:"""

decode_int(A_bytes) == A

Once we successfully encode a text string or an integer in terms of bytes, then we can proceed with encryption.

### Strings as Long Integers?

Earlier we have hex representation of a byte strings:

In [None]:
B

In [None]:
B.hex()

This shows that we can also represent a byte string as a very long integer:

In [None]:
"""Converts the hex representation of B to an integer"""
# 16 is the base of the number (base-16)

B_int = int(B.hex(), 16)
B_int

**CONCLUSION**: the four objects below represent the same data:

* a Unicode string: `'Hello world'`
* a byte string: `b'Hello world'`
* a hex string: `48656c6c6f20776f726c64`  (optionally prefixed with `0x`)
* a long integer: `87521618088882671231069284`

Now that we know how to convert between the different representations, we can convert a string message to a form suitable for encryption.

<a id="sec-Pad-Unpad"></a>
## 6. Padding and Unpadding

AES is a *block cipher*:
An AES cipher operates on a 16-byte block of data at once,
and it expects an input that is a multiple of 16 bytes.
If the input size does not conform to this requirement, we must pad the input data so that its size is a multiple of 16 bytes.

*Padding* address the concerns such as:
"What if my message is shorter than 16 characters?"
"What if my message is 30 bytes long?"

In this hands-on example, we simply pad the non-conforming block with enough zeros (NULLs) from the left, so that the length of the `bytes` become 16, 32, 48, and so on.
To unpad a data block, we will simply strip all the leading NULLs.
(We can do this, because our valid messages will not contain any NULL character.
For paddings used in real-world applications, please see the next notebook.)

**Function: `leftpad16`**

This function pads the values in the `bytes` object with just enough zeros from the left so that the length of the resulting array is a multiple of 16.
We need this padding because AES is a *block cipher*,
which means it operates on a 16-byte block of data at once.
AES cipher expects an input that is a multiple of 16 bytes.

In [None]:
def leftpad16(B):
    """Pad a bytes array from the left with NULL chars so that
    the length is a multiple of 16 bytes."""
    padlength = len(B) % 16
    if padlength > 0:
        return (b'\x00' * (16 - padlength)) + B
    else:
        return B

In [None]:
"""Here is an example to do leftpadding""";

msg = b'ODU is great'
msg_pad = leftpad16(msg)
msg_pad

In [None]:
"""Try to padding msg2"""

msg2 = b'We need padding because AES is a block cipher'
#msg2_pad = #TODO

In [None]:
"""Print the length of bytes"""

# print("Length of original msg:", #TODO)
# print("Length of original msg with padding:", #TODO)
# print("Length of original msg2:", #TODO)
# print("Length of original msg2 with padding:", #TODO)

In [None]:
"""Removal of the leading NULLs can be done using the lstrip method:"""

msg_pad.lstrip(b'\x00')

In [None]:
msg2_pad.lstrip(b'\x00')

### You Did It!

This is the end of the first notebook!
If you have reach this point, you can continue the next notebook by opening `Crypt-session-2.ipynb` in your Jupyter session.