In [1]:
import numpy as np
import matplotlib.pyplot as plt 
import pandas as pd
import io

### 8.a) Working with bits

Text is normally stored as 8 bit ASCII codes (of which only the lowest seven bits are used to store the basic character set). For this exercise, you will create a lossy text codec that stores text using only five bits per character. In doing so, you will gain experience in dealing with reading/writing binary coded files.

#### 1. Write a function that takes the M lowest bits from an unsigned integer and writes them starting at the Nth bit location in an array of BYTES (unsigned character variables).

In [2]:
def insert_uint_in_bytearray(i: int, m: int, n: int, byte_array: bytearray) -> bytearray:
    # unsigned int has 32 bits
    uns_i = i & 0xffffffff
    bin_uns_i = bin(uns_i).split('b')[1].zfill(32)
    m_bits = bin_uns_i[-m:]
    
    # byte limit
    limit = 8 - n
    
    # bits are added respecting the byte limit
    m_bits = m_bits[:limit]
    
    
    int_array = []
    for b in byte_array:
        # getting signal bit
        s_bit = bin(b).split('b')[0]
        
        # getting info bits with byte length
        bits = bin(b).split('b')[1].zfill(8)
        
        # adding m_bits in bits
        new_bits = bits[:n] + m_bits
        
        # each set of 8 bits + signal is a int
        # that will be turned into one byte
        int_from_bits = int(s_bit + 'b' + new_bits, base=2)
        int_array.append(int_from_bits)
    
    return bytearray(int_array)

In [3]:
prime_numbers = [2, 3, 5, 7]
byte_array = bytearray(prime_numbers)

In [4]:
print(byte_array)

bytearray(b'\x02\x03\x05\x07')


In [5]:
print(insert_uint_in_bytearray(15, 5, 7, byte_array))

bytearray(b'\x02\x02\x04\x06')


#### 2. Define a mapping from the basic ASCII character set onto only five bits. (Obviously, you will need to sometimes map multiple ASCII characters onto the same five-bit code. For example, you will need to map both capital and small letters onto the same code.)

In [6]:
char2bits = {
    'a': '00000', 'b': '00001', 'c': '00010',
    'd': '00011', 'e': '00100', 'f': '00101',
    'g': '00110', 'h': '00111', 'i': '01000',
    'j': '01001', 'k': '01010', 'l': '01011',
    'm': '01100', 'n': '01101', 'o': '01110',
    'p': '01111', 'q': '10000', 'r': '10001',
    's': '10010', 't': '10011', 'u': '10100',
    'v': '10101', 'w': '10110', 'x': '10111',
    'y': '11000', 'z': '11001', '1': '01011',
    '2': '11010', '3': '00100', '4': '00000',
    '5': '10010', '6': '00001', '7': '11011',
    '8': '11100', '9': '11101', '0': '01110',
}

bits2char = {v: k for k, v in char2bits.items()}

#### 3. Write a text encoder/decoder that allows you to read in ASCII text files (e.g., .txt files from Notepad), map the text into five bit codes, pack the coded text into arrays of BYTES, write the packed arrays into a coded file, read in your coded files, decode your coded files back into ASCII codes, and write out your decoded text file.

#### 4. Test your five-bit text codec on several sample text files. Check file sizes to see what compression ratio you achieved. How readable is your decoded file?

#### 5. Change your mapping to use two of your codes as control characters. Let one code signify that the next character is capitalized. Let the other code signify that the next character comes from a different set of mappings onto five-bit codes (i.e. include some important characters that weren't included in the basic mapping). How does this change impact compression ratio? How does this change impact readability?