## trivial compression example using dna nucleotides

<br>

***

<br>

simply put, the portrayal of each nucleotide as a string-char is inefficient, considering one can store four different elements using **dual bits**

$$
2^2 = 4
$$

| nucleotide | bits |
| --- | --- |
| A | 00 |
| C | 01 |
| G | 10 |
| T | 11 |

In [8]:
class CompressedGene:
    def __init__(self, gene: str) -> None:
        self._compress(gene)

    def _compress(self, gene: str) -> None:
        self.bit_string: int = 1
        for nucleotide in gene.upper():
            # shifting two bits to the left
            self.bit_string <<= 2
            if nucleotide == "A":
                self.bit_string |= 0b00
            elif nucleotide == "C":
                self.bit_string |= 0b01
            elif nucleotide == "G":
                self.bit_string |= 0b10
            elif nucleotide == "T":
                self.bit_string |= 0b11
            else:
                raise ValueError("unvalid nucleotid:{}".format(nucleotide))
                
    def decompress(self) -> str:
        gene: str = ""
        for i in range(0, self.bit_string.bit_length() - 1, 2):
            bits: int = self.bit_string >> i & 0b11
            if bits == 0b00:
                gene += "A"
            elif bits == 0b01:
                gene += "C"
            elif bits == 0b10:
                gene += "G"
            elif bits == 0b11:
                gene += "T"
            else:
                raise ValueError("unvalid bits:{}".format(bits))
        return gene[::-1] # reverses str
    
    def __str__(self) -> str:
        return self.decompress()

In [9]:
from sys import getsizeof
original: str = "TAGGATTATTATTATTAGGATCGATTATA" * 100

print("original: {} byte".format(getsizeof(original)))

compressed: CompressedGene = CompressedGene(original)
print("compressed: {} byte".format(getsizeof(compressed.bit_string)))

print("original and decompressed are identical: {}".format(original == compressed.decompress()))

original: 2941 byte
compressed: 800 byte
original and decompressed are identical: True
