# Trivial Compression

**Compression**: Taking data and encoding it in such a way that it takes less space.

**Decompression**: Revesing the compressed(encoded) data back to it's orignal form



> Why compress?
>
> Memory/space savings


> Why **not** compress?
>
> Compute expense

## Example: Gene storage in Python

Neucliotides have 4 values: **A, C, G, or T**

One character requires 8 bits of storage. A human has 1.2 million nuceotides. So it's about 1.2mb/human

Since this still a sequence, turning it into a categorical variable is not really an option.

Another option is _bit string_.

There is no built in construct for working with bit string of arbitrary length. There is a pypi library `bitstring`. I will try that out later

We will go the route of building our own class.

In [6]:
class CompressedGene:
    def __init__(self, gene: str) -> None:
        self._compress(gene)
        
    def _compress(self, gene: str) -> None:
        self.bit_string: int = 1  # start with sentinel
        for nucleotide in gene.upper():
            self.bit_string <<=2  # shift two bits
            if nucleotide == "A":
                # change last two bits to 00
                self.bit_string |= 0b00
            elif nucleotide == "C":
                #change last two bits to 01
                self.bit_string |= 0b01
            elif nucleotide == "G":
                # change last two bits to 10
                self.bit_string |= 0b10
            elif nucleotide == "T":
                # change last two bits to 11
                self.bit_string |= 0b11
            else:
                raise ValueError(f"Invalid Nucleotide: {nucleotide}")
    
    def decompress(self) -> str:
        gene: str = ""
        # indexing evey two bit from 0 to the length -1; the -1 is because of the sentinel in _compress
        for idx in range(0, self.bit_string.bit_length() - 1, 2):
            bits: int = self.bit_string >> idx & 0b11
            if bits == 0b00:
                gene += 'A'
            elif bits == 0b01:
                gene += 'C'
            elif bits == 0b10:
                gene += 'G'
            elif bits == 0b11:
                gene += 'T'
            else:
                raise ValueError(f"Invalid bits: {bits}")
        return gene[::-1]
    
    def __str__(self) -> str:
        return self.decompress()

In [3]:
from sys import getsizeof

original: str = "TAAAAAAAAGGTTTTAAATATTTATATAGGGGTATATAGCGCGCTATGCACACACACACA" * 100

In [5]:
print(f"Original gene is {getsizeof(original)} bytes")

compressed: CompressedGene = CompressedGene(original)

print(f"Compressed is {getsizeof(compressed.bit_string)} bytes")

Original gene is 6049 bytes
Compressed is 1628 bytes


In [7]:
print(f"Original and decompressed are the same: {original == compressed.decompress()}")

Original and decompressed are the same: True


## Quick Review of Bitwise operations

In [8]:
0b01

1

In [9]:
type(0b01)

int

So a bit is basically stored as an `int`

In [18]:
0b01 == 0b1

True

### OR

In [10]:
1|1

1

In [11]:
0b01 | 0b01

1

In [12]:
0b001 | 0b010

3

In [13]:
0b001 | 0b010 == 0b011

True

So bit-wise **or** (a\|b) works the following way:

| a | b | a\|b |
|----|---|----|
|0b0 | 0b0 | 0b0 |
|0b0 | 0b1 | 0b1 |
|0b1 | 0b0 | 0b1 |
|0b1 | 0b1 | 0b1 |
|0b0 | 0b10 | 0b10 |
|0b10 | 0b0 | 0b10 |
|0b10 | 0b10 | 0b10 |
|0b11 | 0b10 | 0b11 |
|0b10 | 0b11 | 0b11 |
|0b11 | 0b11 | 0b11 |

In [21]:
0b11 | 0b10

3

In [23]:
0b11 & 0b01

1

In [10]:
print(bin(0b10 << 0))
print(bin(0b10 << 1))
print(bin(0b10 << 2))
print(bin(0b11 << 1))
print(bin(0b11 << 2))

0b10
0b100
0b1000
0b110
0b1100


In [1]:
print(bin(0b10 >> 0))
print(bin(0b10 >> 1))
print(bin(0b10 >> 2))
print(bin(0b11 >> 1))
print(bin(0b11 >> 2))

0b10
0b1
0b0
0b1
0b0


In [6]:
print(bin(0b01))

0b1
