# Information Theory Lab 05: Source Coding,  Part I: Encoding --- Episode 2

## About

This file is designed to be viewed and run online in a browser.

This file is a Jupyter Notebook file usign `xeus-cling`, a Jupyter kernel for C++ based on the `cling` C++ interpreter and the `xeus` native implementation of the Jupyter protocol, xeus.

- GitHub repository: https://github.com/jupyter-xeus/xeus-cling/
- Online documentation: https://xeus-cling.readthedocs.io/ 

<!-- <img src="images/xeus-cling.png" alt="xeus-cling logo" style="width: 100px;"/> -->

## Usage

To run a selected code cell:

- Ctrl  + Enter = Run cell and remain at current cell
- Shift + Enter = Run cell and advance to next cell



## 1. Objective

Understand source coding by implementing a basic encoding application.


## 2. Encoding data

First let's define in all place all the useful macros developed last week.

In [1]:
#define READ_BIT(x,i)       (int)(((x) & (1U << (i))) != 0)                                 /* read bit i from x */
#define SET_BIT(x,i)        ((x) = (x) | (1U << (i)))                                       /* set bit i from x to 1 */
#define CLEAR_BIT(x,i)      ((x) = (x) & ~(1U << (i)))                                      /* clear bit i from x to 0 */
#define WRITE_BIT(x,i,val)  ((val) ? SET_BIT((x),(i)) : CLEAR_BIT((x),(i)))                 /* write 'val' in bit i from x */
#define TOGGLE_BIT(x,i)     ((x) = (x) ^ (1U << (i)))                                       /* toggle bit i from x */
#define VECREAD_BIT(v,i)       (READ_BIT((v[(i)/8]),(i)%8))                                 /* read bit i from byte vector v */
#define VECWRITE_BIT(v,i,val)  (WRITE_BIT((v[(i)/8]),((i)%8),val))                          /* write 'val' in bit i from byte vector v */

### Reading codewords table

In these labs, a codeword is defined using the following structure data type:
```
typedef struct 
{
    int len;                /* length of code, in bits */
    unsigned long code;     /* the first "len" bits are the codeword */
} CODE32BIT;
```

All the codewords are available as a vector in file `codero.dat`. We can read such a vector with `fread()`, as follows

In [2]:
// Define the structure type
typedef struct 
{
    int len;                /* length of code, in bits */
    unsigned int code;     /* the first "len" bits are the codeword */
} CODE32BIT;

// Define a vector of 256 codewords, one for each ASCII character
CODE32BIT cb[256];

// Open the file
FILE* f = fopen("codero.dat", "rb");

// Read the vector from the file
fread(&cb[0], sizeof(CODE32BIT), 256, f);   // Read from f, 256 elements, each of size "sizeof(CODE32BIT)" bytes, and place them in cb

fclose(f);
//sizeof(unsigned int)

0

From the structure definition we see that each codeword element has two components:
- `len`: length of the codeword (number of bits)
- `code`: the actual bits of the codeword (only the first `len` bits are written)

Let's check the codeword for letter `a`, which is `cb[97]` (ASCII code of 'a is 97).

The codeword has length:

In [3]:
cb['a'].len    // instead of its code 97 we can use directly char 'a' 

3

The bits in `a`'s codeword  are:

In [4]:
READ_BIT(cb['a'].code, 0)  // First bit

0

In [5]:
READ_BIT(cb['a'].code, 1)  // Second bit

0

In [6]:
READ_BIT(cb['a'].code, 2) // Third bit

1

### Print all codewords

Let's print all codewords, one codeword per line, like thid:
```
a (97): 001
b (98): 01011
...
```

In [7]:
// Go through all characters

for (int i=0; i < 128; i++) // bug at 128 
{
    // Print the codeword for character with code i
    //
    // cb[i].len  = the length of the codeword
    // cb[i].code = contains the bits
    //
    // TODO: write below
    printf("%c: (%d)", i, i); 
    for (int j=0; j < cb[i].len; j++)
    {
        printf("%d" ,READ_BIT(cb[i].code, j));
    }
printf("\n");
}


 : (0)0001100101101010001100
: (1)0001100101101010001101
: (2)0001100101101010001110
: (3)0001100101101010001111
: (4)0001100101101010010000
: (5)0001100101101010010001
: (6)0001100101101010010010
: (7)0001100101101010010011
: (8)0001100101101010010100
	: (9)1111111

: (10)11101
: (11)0001100101101010010101
: (12)0001100101101010010110
: (13)11110
: (14)0001100101101010010111
: (15)0001100101101010011000
: (16)0001100101101010011001
: (17)0001100101101010011010
: (18)0001100101101010011011
: (19)0001100101101010011100
: (20)0001100101101010011101
: (21)0001100101101010011110
: (22)0001100101101010011111
: (23)0001100101101010100000
: (24)0001100101101010100001
: (25)0001100101101010100010
: (26)0001100101101010100011
: (27)0001100101101010100100
: (28)0001100101101010100101
: (29)0001100101101010100110
: (30)0001100101101010100111
: (31)0001100101101010101000
 : (32)100
!: (33)00011000
": (34)111111000000
#: (35)00011001011011
$: (36)0001100101101010101001
%

### Encoding procedure

For encoding a sequence of letters, we simply write the codeword of every letter into an output vector.

#### **Exercise**: encode a sentence

Encode the sentence *"Humpty-Dumpty sat on a wall"*, and write the binary output sequence in the vector `vec`.

Questions: 
  - how many bits did we use? 
  - how many bits would be used if we encode the letters with ASCII code (8 bits / letter)? 
  - what is the compression ration achieved with this code? 

In [10]:
const char* s = "Salut!";          // the input sequence
unsigned char vec[1000];                                // output vector for holding the bits
int pos=0;
// TODO: write here
for (int i=0; i< strlen(s); i++)
{
   for (int j=0; j < cb[s[i]].len; j++)
    {
        printf("%d" ,READ_BIT(cb[s[i]].code, j));
       VECWRITE_BIT(vec,pos,READ_BIT(cb[s[i]].code, j) );
       pos++;
    } 
    
}


// Let's look at the vector vec
vec

101111110011011000001110000011000

{ '0xfd', 'l', 'p', '0', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00',

#### **Exercise**: encode a file

Read a text file and encode every byte (character), writing the binary output sequence in the vector `vec`. We reuse the code from the previous labs in order to open the file and read every byte.


In [11]:
const char* filename = "textro.txt";    // the text file to encode
unsigned char vec[1000000];             // one milion bytes, holds up to 1MB of data
unsigned char c;

// Open file, for reading ("rb")
FILE* f = fopen(filename, "rb");  
    
// (TODO: check if it actually opened)

// Read every character, stop when fread() returns 0 
while( fread(&c, 1, 1, f) )
{
    // Next character is in c, write its codeword in vector vec
    //
    // TODO: write here

   for (int j=0; j < cb[c].len; j++)
    {
        printf("%d" ,READ_BIT(cb[c].code, j));
       VECWRITE_BIT(vec,poz,READ_BIT(cb[c].code, j) );
       pos++;
    } 
    
    
}
// We're done with the file
fclose(f);

// Let's look at the vector vec (it might take a while)
//vec

1111110101111110111101100000111010101011101010000011110101101111001000001110001000011100110101001010010110101011010110100110111111100110111001110111111000001000101000001110101110110110111110110011000001010010100101010111111110000011100111001111110110011100100000110101111101100010011010111101101000011101111011110111111101011111101111000111110011011111110011011100111011111100000110011000111110010100110111001011001110011111101111011000000011111101100011101010011100001010010101111111011100100110110000101001010111111110101111110111111011110101011101100001111001000100011101000110101110011111110010000101000110010010101111110110011100111111011111010011011011111001110011111110010001000111010001011101011111110001011100011011110011000111001000111110010101111110111111011001111000111101110110100011111101010110100001100100000110010100101110000010011111000110101111111100011010111010101010101111111100110011100000101001010110110000011111101100110011100100001101011111100011010111111110001110010111101010

0

### Writing the encoded bit sequence to an output file

Finally, once we have all the bits in the vector `vec`, let's write the data into an output file.

From the encoding above, we need **the total number of bits written**. Let's call this `len`. We save the data as follows:
1. Open the output file with `fopen()`, for writing in binary mode
2. Write the integer `len`, using `fwrite()`
3. Write the vector `vec` using `fwrite()`, but only the bytes actually written (number of bytes to write = `ceil (len/8)`)
4. Close the file

In [None]:
// int len = the total number of bits written in the vector vec
int len=pos;
FILE* f = fopen("textroencoded.enc", "wb");                // open file
fwrite(&len, sizeof(len), 1, f);                           // write len
fwrite(vec, sizeof(unsigned char), ceil(len/8.0), f);        // write the encoded bitstring, only the actual written bytes ceil(len/8)
fclose(f);                                                 // close file                 

## 3. Final Exercises


1. Put everything into a dedicated program `encode.c`, to encode every byte from a given data file.

   The program shall be called as follows: 

   `encode.exe codero.dat input.txt output.txt`
    
   The arguments are:
      - `code.dat`: a file containing the code to be used (known as the "codebook" file)
      - `input.txt`: the file to encode
      - `output.txt`: the output (encoded) file
   
   
   The codebook file contains a vector of 256 elements of the following structure type:
   
   
   
        typedef struct 
        {
            int len;                /* length of code, in bits */
            unsigned long code;     /* the first "len" bits are the codeword */
        } CODE32BIT;
   
   
   The program will follow the following steps:
   
    - Allocate an array named `out` of `unsigned char` of max size 10MB (i.e. 10000000 bytes);
    - Open and read the full vector from the codebook file;
    - Then, open the input file and read every byte in a loop. For each byte do the following:
    
        - Write the codeword for the byte, bit by bit, in the `out` vector. Use the `VECWRITE_BIT()` macro
        
        - Keep track of the number of bits written, in order to continue writing from where the previous codeword stopped.
        
        
    - Write the array `out` to the output file, as follows:
    
        - Open the output file for writing
        
        - Write first the total number of bits
        - Write afterwards the vector `out`, but only the number of bytes actually used for coding
        - *Note: when decoding the file, we will read back the data in the same order*.

2. Encode the file `textro.txt` with the provided codebook `codero.dat`. Check the size of the output file and compute the compression ratio.

3. Repeat 2. for `texten.txt` with codebook `codeen.dat`.

4. Encode a file with the codebook from the other language. Check the size of the output file and compute the compression ratio. Compare
with the one using the same language codebook. Which case is better?

## 4. Final questions

1. TBD