# Information Theory Lab 04: Source Coding,  Part I: Encoding

## About

This file is designed to be viewed and run online in a browser.

This file is a Jupyter Notebook file usign `xeus-cling`, a Jupyter kernel for C++ based on the `cling` C++ interpreter and the `xeus` native implementation of the Jupyter protocol, xeus.

- GitHub repository: https://github.com/jupyter-xeus/xeus-cling/
- Online documentation: https://xeus-cling.readthedocs.io/ 

<!-- <img src="images/xeus-cling.png" alt="xeus-cling logo" style="width: 100px;"/> -->

## Usage

To run a selected code cell:

- Ctrl  + Enter = Run cell and remain at current cell
- Shift + Enter = Run cell and advance to next cell

<!--
<div style="background: #efffed;
            border: 1px solid grey;
            margin: 8px 0 8px 0;
            text-align: center;
            padding: 8px; ">
    <i class="fa-play fa" 
       style="font-size: 40px;
              line-height: 40px;
              margin: 8px;
              color: #444;">
    </i>
    <div>
    To run the selected code cell, hit <pre style="background: #efffed">Shift + Enter</pre>
    </div>
</div>
-->


## 1. Objective

Understand linear block codes by implementing a basic encoding application.


## 2. Practical considerations, Part I: Working with bits

### Read individual bits in a variable

In C we can read and modify **individual bits** in a variable by applying a AND / OR / XOR mask.

Consider a number $x$ with binary representation `11101101`. To read the value of bit number 3 (from right to left, LSB to MSB), we apply the following **binary mask** with a single 1 on the position of the bit we want to read:
$$
\begin{align}
1110b101&&  \\
    \&  \\
00001000&&  \\
    =   &&  \\
0000b000&&    
\end{align}
$$

The result is 0 (e.g False) if that bit in $x$ has value 0, and is a non-zero value (e.g True) if that bit is 1.

We can do this in C as follows:

In [1]:
int x = 78745;  
int bit = ((x & (1U << 0)) != 0);  // Read bit from position 3 (right to left)

// Display the result
bit

1

Explained: 
- x is the number
- `1U << 3` means `00000001` (number 1, unsigned) shifted to the left with 3 positions, which produces the mask `00001000`
- we have a bitwise AND (operator `&`) between `x` and this mask `00001000`
- we obtain 0 or non-zero (but not necessarily 1); we compare it with 0 so that in the end we get either a False (0) or a True (1).

To use it multiple times, we can encode this operation with a **macro** defined as:

In [2]:
#define READ_BIT(x,i)       (int)(((x) & (1U << (i))) != 0)

// Let's use this macro:
READ_BIT(78740, 8)          // read bit 3 from number 78745

1

#### **Exercise**: convert number to binary
Find the binary representation of number 212 (run READ_BIT 8 times and note down the results).

In [3]:
// TODO: run READ_BIT 
READ_BIT(212, 0)

0

### Set individual bits in a variable (= make 1)

We can change a certain bit in a variable to 1 ("set the bit") by using a bitwise OR with a similar mask:
$$
\begin{align}
bbbbbbbb&&  \\
    |  \\
00010000&&  \\
    =   &&  \\
bbb1bbbb&&    
\end{align}
$$

Every (bit OR 0) leaves that bit value unchanged, but when we make OR with 1 the result is 1.

We can package this as a macro as well. The macro does not return anything, but it changes bit number $i$ from variable $x$ to 1.

In [4]:
// Set to 1 the bit number i from value x
#define SET_BIT(x,i)        ((x) = (x) | (1U << (i)))

int x = 15;
SET_BIT(x, 7)   // set to 1 the bit number 7. Will print the result

143

#### **Exercise**: convert binary to base-10 number
Find the value defined in binary as `10100000`. Start from `00000000` and use SET_BIT to set the corresponding bit positions.

In [5]:
int x = 0;

// TODO: set bits

    
// Display the value
x

0

### Clear individual bits in a variable (= make 0)

We can change a certain bit in a variable to 0 ("clear the bit") by using a bitwise AND with an inverted mask:
$$
\begin{align}
bbbbbbbb&&  \\
    \&  \\
11101111&&  \\
    =   &&  \\
bbb0bbbb&&    
\end{align}
$$

Every (bit AND 1) leaves that bit value unchanged, but when we make AND with 0 the result is 0.

We can package this as a macro as well.

In [6]:
// Clear (make 0) the bit number i from value x
#define CLEAR_BIT(x,i)      ((x) = (x) & ~(1U << (i)))     // note the bineary negation operator ~ which inverts every bit of the mask

int x = 240;
CLEAR_BIT(x, 7)   // clear bit number 7. Will print the result

112

### Writing individual bits in a variable (= set them to a given value)

We can combine SET_BIT and CLEAR_BIT into a common macro WRITE_BIT which takes the desired value as an argument. The macro checks the desired value, and decides:
- if we want a 1, it calls SET_BIT 
- if we want a 0, it calls CLEAR_BIT

In [7]:
#define WRITE_BIT(x,i,val)  ((val) ? SET_BIT((x),(i)) : CLEAR_BIT((x),(i)))

int x = 150;
WRITE_BIT(x, 5, 0);    // Make bit number 5 from x equal to 0
WRITE_BIT(x, 6, 1);    // And then make bit number 6 from x equal to 1

214

### Toggle individual bits in a variable (= make opposite)

We can toggle a certain bit in a variable by using a bitwise XOR with a similar mask:
$$
\begin{align}
bbbbbbbb&&  \\
    \hat{}  \\
00010000&&  \\
    =   &&  \\
bbb\overline{b}bbbb&&    
\end{align}
$$

Every (bit XOR 0) leaves that bit value unchanged, but when we make XOR with 1 the result is the opposite value $\overline{b}$ to the original $b$.

We can package this as a macro as well.

In [8]:
#define TOGGLE_BIT(x,i)     ((x) = (x) ^ (1U << (i)))

int x = 150;
TOGGLE_BIT(x, 0);    // Toggle bit number 0 from x

151

### Working with bit vectors

We'll often have to work with very long sequences of bits. We cannot use a single variable for holding all of them. Instead, we will usually define a **vector of bytes** (unsigned char). Let's define a vector with 1000 bytes, capable of holding 8000 individual bits.

In [9]:
unsigned char vector[1000];   // 1000 bytes = 8000 bits

We need a quick way to read and write but values from such a vector. 

How do we read the bit on position 14 from this vector? We figure out on which element (byte) in the vector falls bit number 14, and which of the 8 bits of that element it is:
- $14 \; / \; 8 = 1.75$, so we know position 2166 is somewhere in the second element of the vector, e.g. in `vector[1]`, i.e. `vector[position/8]`
- $14 \; \% \; 8 = 6$, so we know it's the 6th bit inside `vector[1]`

Thus, in general, we can access bit number $i$ from the whole vector by accessing bit number `i%8` from element `vector[i/8]`. We can use READ_BIT and WRITE_BIT to read and write at this position.

We define two macros VECREAD_BIT and VECWRITE_BIT to read and write bit values **in a whole vector**, as follows:

In [10]:
// Read bit number i from vector v
#define VECREAD_BIT(v,i)       (READ_BIT((v[(i)/8]),(i)%8))

// Write value val in bit number i from vector v
#define VECWRITE_BIT(v,i,val)  (WRITE_BIT((v[(i)/8]),((i)%8),val))

#### **Exercise**: read / write bits in a vector

Consider a vector with 20 bytes (160 bits). Make the first 100 bit values all equal to 1, and the remaining 60 bits all equal to 0

In [12]:
unsigned char vec[20];

// TODO: write here
for (int i = 0; i < 100; i++)
    VECWRITE_BIT(vec, i, 1);

for (int i = 100; i < 160; i++)
    VECWRITE_BIT(vec, i, 0);



//-------------------------
// Check: display vec[11], vec[12], vec[13]
// Should be 255, 15 and 0
printf("vec[11] = %d \nvec[12] = %d \nvec[13] = %d", vec[11], vec[12], vec[13]);

vec[11] = 255 
vec[12] = 15 
vec[13] = 0

## 2. Practical considerations, Part II: Encoding

Encoding a certain sequence of messages with a code means replacing each message $s_i$ with its codeword $c_i$.

In this lab, we will replace **each character** with a specially crafted code. We will build this code in Lab 6.

### Codewords

In these labs, a codeword is defined using the following structure data type:
```
typedef struct 
{
    int len;                /* length of code, in bits */
    unsigned long code;     /* the first "len" bits are the codeword */
} CODE32BIT;
```

All the codewords are available as a vector in file `codero.dat`. We can read such a vector with `fread()`, as follows

In [16]:
// Define the structure type
typedef struct 
{
    int len;               /* length of code, in bits */
    unsigned int code;     /* the first "len" bits are the codeword */
} CODE32BIT;

// Define a vector of 256 codewords, one for each ASCII character
CODE32BIT cb[256];

// Open the file
FILE* f = fopen("codero.dat", "rb");

// Read the vector from the file
fread(&cb[0], sizeof(CODE32BIT), 256, f);   // Read from f, 256 elements, each of size "sizeof(CODE32BIT)" bytes, and place them in cb

fclose(f);
//sizeof(unsigned int)

0

From the structure definition we see that each codeword element has two components:
- `len`: length of the codeword (number of bits)
- `code`: the actual bits of the codeword (only the first `len` bits are written)

Let's check the codeword for letter `a`, which is `cb[97]` (ASCII code of 'a is 97).

The codeword has length:

In [18]:
cb['a'].len    // instead of its code 97 we can use directly char 'a' 

3

The bits in `a`'s codeword  are:

In [19]:
READ_BIT(cb['a'].code, 0)  // First bit

0

In [20]:
READ_BIT(cb['a'].code, 1)  // Second bit

0

In [21]:
READ_BIT(cb['a'].code, 2) // Third bit

1

#### **Exercise**: compare codeword lengths

Check the codeword lengths for `a`, `b`, and `x`. Can you think of a reason why they would be different? 

In [33]:
// TODO: write here
cb['z'].len
//READ_BIT(cb['b'].code, 1)

8

#### **Exercise**: print one codeword

Write all the bits of the codeword of `z` into the vector `vec`, one by one, using a for loop:

In [34]:
unsigned char vec[20];  // vector of 20 bytes, has space for 160 bits

// TODO: write below
for (int i=0; i < cb['z'].len; i++)
{
    int bit = READ_BIT(cb['z'].code, i);  // Read every bit from codeword of 'z'
    VECWRITE_BIT(vec, i, bit);            //  and save it into vector 'vec'
}   
    
// Display vector
vec

{ '}', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00' }

### Encoding procedure

For encoding a sequence of letters, we simply write the codeword of every letter into an output vector.

#### **Exercise**: encode a sentence

Encode the sentence *"Humpty-Dumpty sat on a wall"*, and write the binary output sequence in the vector `vec`.

Questions: 
  - how many bits did we use? 
  - how many bits would be used if we encode the letters with ASCII code (8 bits / letter)? 
  - what is the compression ration achieved with this code? 

In [None]:
const char* s = "Humpty-Dumpty sat on a wall";          // the input sequence
unsigned char vec[1000];                                // output vector for holding the bits

// TODO: write here



#### **Exercise**: encode a file

Read a text file and encode every byte (character), writing the binary output sequence in the vector `vec`. Reuse the code from the previous labs in order to open the file and read every byte.


In [None]:
const char* filename = "textro.txt";    // the text file to encode
unsigned char vec[1000000];                 // one milion bytes, holds up to 1MB of data

// TODO: write here


### Writing the encoded bit sequence to an output file

Finally, once we have all the bits in the vector `vec`, let's write the data into an output file.

From the encoding above, we need **the total number of bits written**. Let's call this `len`. We save the data as follows:
1. Open the output file with `fopen()`, for writing in binary mode
2. Write the integer `len`, using `fwrite()`
3. Write the vector `vec` using `fwrite()`, but only the bytes actually written (number of bytes to write = `ceil (len/8)`)
4. Close the file

In [None]:
//int len = 5;

FILE* f = fopen("textroencoded.enc", "wb");                      // open file
fwrite(&len, sizeof(len), 1, f);                           // write len
fwrite(vec, sizeof(unsigned char), ceil(len/8), f);        // write the encoded bitstring, only the actual written bytes ceil(len/8)
fclose(f);                                                 // close file                 

## 3. Final Exercises


1. Put everything into a dedicated program `encode.c`, to encode every byte from a given data file.

   The program shall be called as follows: 

   `encode.exe codero.dat input.txt output.txt`
    
   The arguments are:
      - `code.dat`: a file containing the code to be used (known as the "codebook" file)
      - `input.txt`: the file to encode
      - `output.txt`: the output (encoded) file
   
   
   The codebook file contains a vector of 256 elements of the following structure type:
   
   
   
        typedef struct 
        {
            int len;                /* length of code, in bits */
            unsigned long code;     /* the first "len" bits are the codeword */
        } CODE32BIT;
   
   
   The program will follow the following steps:
   
    - Allocate an array named `out` of `unsigned char` of max size 10MB (i.e. 10000000 bytes);
    - Open and read the full vector from the codebook file;
    - Then, open the input file and read every byte in a loop. For each byte do the following:
    
        - Write the codeword for the byte, bit by bit, in the `out` vector. Use the `VECWRITE_BIT()` macro
        
        - Keep track of the number of bits written, in order to continue writing from where the previous codeword stopped.
        
        
    - Write the array `out` to the output file, as follows:
    
        - Open the output file for writing
        
        - Write first the total number of bits
        - Write afterwards the vector `out`, but only the number of bytes actually used for coding
        - *Note: when decoding the file, we will read back the data in the same order*.

2. Encode the file `textro.txt` with the provided codebook `codero.dat`. Check the size of the output file and compute the compression ratio.

3. Repeat 2. for `texten.txt` with codebook `codeen.dat`.

4. Encode a file with the codebook from the other language. Check the size of the output file and compute the compression ratio. Compare
with the one using the same language codebook. Which case is better?

## 4. Final questions

1. TBD