# Information Theory Lab 06: Source Coding,  Part II: Decoding

## About

This file is designed to be viewed and run online in a browser.

This file is a Jupyter Notebook file usign `xeus-cling`, a Jupyter kernel for C++ based on the `cling` C++ interpreter and the `xeus` native implementation of the Jupyter protocol, xeus.

- GitHub repository: https://github.com/jupyter-xeus/xeus-cling/
- Online documentation: https://xeus-cling.readthedocs.io/ 

<!-- <img src="images/xeus-cling.png" alt="xeus-cling logo" style="width: 100px;"/> -->

## Usage

To run a selected code cell:

- Ctrl  + Enter = Run cell and remain at current cell
- Shift + Enter = Run cell and advance to next cell

<!--
<div style="background: #efffed;
            border: 1px solid grey;
            margin: 8px 0 8px 0;
            text-align: center;
            padding: 8px; ">
    <i class="fa-play fa" 
       style="font-size: 40px;
              line-height: 40px;
              margin: 8px;
              color: #444;">
    </i>
    <div>
    To run the selected code cell, hit <pre style="background: #efffed">Shift + Enter</pre>
    </div>
</div>
-->


## Objective

Understand source coding by implementing a basic encoding application.


## Decoding data with instantaneous codes

First let's define in one place all the macros needed for working with bits.

In [1]:
#define READ_BIT(x,i)       (int)(((x) & (1U << (i))) != 0)                                 /* read bit i from x */
#define SET_BIT(x,i)        ((x) = (x) | (1U << (i)))                                       /* set bit i from x to 1 */
#define CLEAR_BIT(x,i)      ((x) = (x) & ~(1U << (i)))                                      /* clear bit i from x to 0 */
#define WRITE_BIT(x,i,val)  ((val) ? SET_BIT((x),(i)) : CLEAR_BIT((x),(i)))                 /* write 'val' in bit i from x */
#define TOGGLE_BIT(x,i)     ((x) = (x) ^ (1U << (i)))                                       /* toggle bit i from x */
#define VECREAD_BIT(v,i)       (READ_BIT((v[(i)/8]),(i)%8))                                 /* read bit i from byte vector v */
#define VECWRITE_BIT(v,i,val)  (WRITE_BIT((v[(i)/8]),((i)%8),val))                          /* write 'val' in bit i from byte vector v */

### Reminder: Reading codewords table

In these labs, a codeword is defined using the following structure data type:
```
typedef struct 
{
    int len;                /* length of code, in bits */
    unsigned long code;     /* the first "len" bits are the codeword */
} CODE32BIT;
```

All the codewords are available as a vector in file `codero.dat`. We can read such a vector with `fread()`, as follows:

In [2]:
// Define the structure type
typedef struct 
{
    int len;                /* length of code, in bits */
    unsigned int code;     /* the first "len" bits are the codeword */
} CODE32BIT;

// Define a vector of 256 codewords, one for each ASCII character
CODE32BIT cb[256];

// Open the file
FILE* f = fopen("codero.dat", "rb");

// Read the vector from the file
fread(&cb[0], sizeof(CODE32BIT), 256, f);   // Read from f, 256 elements, each of size "sizeof(CODE32BIT)" bytes, and place them in cb

fclose(f);
//sizeof(unsigned int)

0

From the structure definition we see that each codeword element has two components:
- `len`: length of the codeword (number of bits)
- `code`: the actual bits of the codeword (only the first `len` bits are written)

Let's check the codeword for letter `a`, which is `cb[97]` (ASCII code of 'a is 97).

The codeword has length:

In [3]:
cb['a'].len    // instead of its code 97 we can use directly char 'a' 

3

The bits in `a`'s codeword  are:

In [4]:
READ_BIT(cb['a'].code, 0)  // First bit

0

In [5]:
READ_BIT(cb['a'].code, 1)  // Second bit

0

In [6]:
READ_BIT(cb['a'].code, 2) // Third bit

1

### Reminder: Print all codewords

Let's print all codewords, one codeword per line, like this:
```
a (97): 001
b (98): 01011
...
```

Printing character with code 128 seems to crash the environment. We'll stop the printing before character 128.

In [7]:
// Go through all characters

for (int i=0; i < 128; i++) // bug at 128 
{
    // Print the codeword for character with code i
    //
    // cb[i].len  = the length of the codeword
    // cb[i].code = contains the bits
    //
    // TODO: write below
    printf("%c (%d): ", i, i); 
    for (int j=0; j < cb[i].len; j++)
    {
        printf("%d" ,READ_BIT(cb[i].code, j));
    }
printf("\n");
}

  (0): 0001100101101010001100
 (1): 0001100101101010001101
 (2): 0001100101101010001110
 (3): 0001100101101010001111
 (4): 0001100101101010010000
 (5): 0001100101101010010001
 (6): 0001100101101010010010
 (7): 0001100101101010010011
 (8): 0001100101101010010100
	 (9): 1111111

 (10): 11101
 (11): 0001100101101010010101
 (12): 0001100101101010010110
 (13): 11110
 (14): 0001100101101010010111
 (15): 0001100101101010011000
 (16): 0001100101101010011001
 (17): 0001100101101010011010
 (18): 0001100101101010011011
 (19): 0001100101101010011100
 (20): 0001100101101010011101
 (21): 0001100101101010011110
 (22): 0001100101101010011111
 (23): 0001100101101010100000
 (24): 0001100101101010100001
 (25): 0001100101101010100010
 (26): 0001100101101010100011
 (27): 0001100101101010100100
 (28): 0001100101101010100101
 (29): 0001100101101010100110
 (30): 0001100101101010100111
 (31): 0001100101101010101000
  (32): 100
! (33): 00011000
" (34): 111111000000
# (35): 0001100101

### Decoding procedure for instantaneous codes

Decoding means the following starting from a sequence of bits (encoded), figure out what messages are there.

For instantaneous codes, it is easy, based on the following property: **there is a single codeword which matches perfectly the beginning of the binary sequence**.

We'll use the decoding procedure:
 1. Find the codeword which matches perfectly the beginning of the binary sequence. This is the first message.
 2. Advance to the remaining part of the sequence and go to step 1.

The decoded characters shall be written in a separate text file.

#### **Exercise**: decode first character

Decode the first character in the binary sequence available in vector `vec`:


In [8]:
unsigned char vec[1000] = {253, '\154', '\160', '6', '\0', '\0'};        // the encoded binary sequence

// TODO: write here, print the character

// Try every character ch
for ( int ch = 0; ch<256; ch++ )
{
    // Check codeword of character ch
    int match = 1;     // assume it matches
    for (int pos = 0; pos < cb[ch].len; pos++)                        // go through the codeword
    {
        if (READ_BIT(cb[ch].code, pos) !=  VECREAD_BIT(vec, pos))     // compare each bit from codeword with the bit from the encoded sequence
        { 
            match = 0;                                                // we found a bit which doesn't match
            break;
        }
    }
    // `match` will tell us if the codeword fully matches the sequence or not
    
    if (match == 1)        // we found it! The character is ch
    {
        printf("%c", ch);
        break;             // we can stop the search (the for loop), we found our first character
    }
    // if there was no match, the for loop will try the next character
}

S

#### **Exercise**: decode the first 5 characters

Now go on and decode the first 5 character in the same binary sequence:

In [11]:
unsigned char vec[1000] = {253, '\154', '\160'};        // the encoded binary sequence

// TODO: write here, print the first 5 characters

int vp = 0; // the current position in the sequence

// Decode 5 characters
for (int n=0; n<5; n++)
{
    // Decode the next character
    
    for ( int ch = 0; ch<256; ch++ )
    {
        // Check codeword of character ch
        int match = 1;     // assume it matches
        for (int pos = 0; pos < cb[ch].len; pos++)                        // go through the codeword
        {
            if (READ_BIT(cb[ch].code, pos) !=  VECREAD_BIT(vec, vp+pos))  // compare each bit from codeword with the bit from the encoded sequence, following vp
            { 
                match = 0;                                                // we found a bit which doesn't match
                break;
            }
        }
        // `match` will tell us if the codeword fully matches the sequence or not

        if (match == 1)        // we found it! The character is ch
        {
            printf("%c", ch);
            vp = vp + cb[ch].len;   // advance vp with the length of the codeword we just decoded
            break;                  // we can stop the search (the for loop), we found our first character
        }
        // if there was no match, the for loop will try the next character
    }
}

Salut

### Loading a binary sequence from a file and decode it

We shall load a binary sequence from an encoded file from last week, *"output.enc"*, and decode it.

Remember from last week that we saved two things in the file:
 - First we wrote an integer specifying how many bits we actually used
 - Afterwards we wrote the full data vector

Now we shall read the data back, in the same order:
  - First we read an integer, specifying how many bits are actually used
  - Afterwards we read the full data vector

The full procedure is as follows::
1. Open the input encoded file with `fopen()`, for reading in binary mode
2. Read an integer `len`, using `fread()`
3. Read a vector `vec` using `fread()`
4. Close the file

In [13]:
// Load a binary sequence from a file

unsigned char vec[1000000];                                // define a vector to hold the data we read
int len;                                                   // will hold the total number of bits used from the vector

FILE* f = fopen("output.enc", "rb");                       // open file
fread(&len, sizeof(int), 1, f);                            // read one integer, place it in len
fread(vec, sizeof(unsigned char), 1000000, f);             // read the encoded bitstring, up to 1000000 bytes. fread() will stop when it reaches the file end.
fclose(f);                                                 // close file                 

0

Now, let's decode the sequence using the same procedure. Stop when you have processed `len` bits from the vector (i.e. only the amount which were actually encoded). Print the decoded characters, and also save them in a vector `decoded`.

In [17]:
// Decode all the loaded sequence

char decoded[1000000];                            // place the decoded characters in here
int j=0;                                          // the current position in the decoded vector

// TODO: decode the sequence in 'vec', print the decoded characters 
//       and also save them in vector `decoded`

int vp = 0; // the current position in the sequence

// Decode until we have processed all the encoded bits (the number is `len`)
while( vp < len )
{
    // Decode the next character
    
    for ( int ch = 0; ch<256; ch++ )
    {
        // Check codeword of character ch
        int match = 1;     // assume it matches
        for (int pos = 0; pos < cb[ch].len; pos++)                        // go through the codeword
        {
            if (READ_BIT(cb[ch].code, pos) !=  VECREAD_BIT(vec, vp+pos))  // compare each bit from codeword with the bit from the encoded sequence, following vp
            { 
                match = 0;                                                // we found a bit which doesn't match
                break;
            }
        }
        // `match` will tell us if the codeword fully matches the sequence or not

        if (match == 1)        // we found it! The character is ch
        {
            printf("%c", ch);   // print the decoded character
            decoded[j] = ch;     // and also save it in a vector `decoded`
            j++;
            
            vp = vp + cb[ch].len;   // advance vp with the length of the codeword we just decoded
            break;                  // we can stop the search (the for loop), we found our first character
        }
        // if there was no match, the for loop will try the next character
    }
}


The Project Gutenberg eBook, Poezii, by Mihai Eminescu


This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org




Title: Poezii

Author: Mihai Eminescu

Editor: prof. Nicolae Gutan, acad. Mihai Cimpoi, Chisinau

Release Date: February 18, 2011  [eBook #35323]

Language: Romanian


Produced by: Liviu Munteanu, Luminita Catanus, Alexandra Baciu, Anamaria Petrea, Liviu Jalba

***START OF THE PROJECT GUTENBERG EBOOK POEZII***

		POEZII 

PUBLICATE IN TIMPUL VIETII


	LA MORMANTUL LUI ARON PUMNUL

Imbraca-te in doliu, frumoasa Bucovina,
Cu cipru verde-ncinge antica fruntea ta;
C-acuma din pleiada-ti auroasa si senina
Se stinse un luceafar, se stinse o lumina,
	Se stinse-o dalba stea!

Metalica, vibranda a clopotelor jale
Vuieste in cadenta si suna intristat;
Caci, ah! geniul mare al desteptarii tal

Finally, let's write the decoded characters from vector `decoded` into an output file called `decoded_ro.txt`:
  1. Open the file for writing in binary mode
  2. Write all the vector `decoded` there
  3. Close the file

In [18]:
// TODO: write here

f = fopen( "decoded_ro.txt", "wb" );
fwrite( decoded , sizeof(char) , j , f);
fclose(f);

## 3. Final Exercises


1. Put everything into a dedicated program `decode.c`, to decode an encoded file from last week.

   The program shall be called as follows: 

   `decode.exe code.dat input.enc decoded.txt`
    
   The arguments are:
      - `code.dat`: a file containing the code to be used (known as the "codebook" file)
      - `input.enc`: the file to decode, obtained with the encoding application from last week
      - `decoded.txt`: the output decoded file
   
   The program will follow the following steps:
   
    1. Read the full vector from the codebook file;
    2. Read the full input encoded file, in the same order as we wrote the data last week:
        - read first an integer
        - then read everything else into an array `vec` of type `unsigned char`, of max size 1MB (i.e. 1.000.000 bytes);
    3. Decode the characters from the `vec` array, as follows:
         
           While we haven't processed all bits
             - Check all codewords and see which one matches the next bits;
             - When you find the codeword, print that character and write it into a decoded vector
             - Advance in the `data` array with the size of the matched codeword;

    4. Save the decoded character vector in the output file.
    
2. Decode the file `output.enc` with the provided codebook `codero.dat`. 
Open the output and check that the data is recovered correctly.
Check the size of the input and output files and compute the compression ratio.

3. Encode and decode again the English text, with the English codebook `codeen.dat`.

4. Decode a file with the codebook from the other language. How does the output look like?

5. Open an encoded file and randomly make an error (for example, delete one character). Then attempt to decode it.
How does the error affect the decoded output?

## 4. Final questions

1. TBD