### Trivial compression

**Compression** is the act of taking data and encoding it (changing its form) in such a way that it takes up less space. 

**Decompression** is reversing the process, returning the data to its original form.


The easiest data compression wins come about when you realize that data storage types use more bits than are strictly required for their contents.

### Nucleotides that form a gene in DNA.

Each nucleotide can only be one of four values: A, C, G or T.

Instead of storing our nucleotides as a `str`, they can be stored as a *bit string* (a sequence of 1s and 0s). 

Unfortunately, the Python standard library contains no off-the-shelf construct for working with bitstring of arbitrary lenght. The following code converts a `str` composed of As, Cs, Gs, and Ts into a string of bits and back again. The string of bits is stored within an `int`. 
Because the `int` in Python can be of anny lenght, it can be used as a bit string of any lenght. To convert back into `str`, we will implement the Python `__str__()` special method. 

In [1]:
class CompressedGene:
    def __init__(self, gene: str) -> None:
        self._compress(gene)

A `CompressedGene` is provided a `str` of characters representing the nucleotides in a gene, and it internally stores the sequence of nucleotides as a bit string. 

The `__init__()` method's main responsibility is to initialize the bit-string construct with the appropiate data. 

`__init__()` calls `_compress()` to do the dirty work of actually converting the provided `str` of nucleotides into a bit string. 

Note that `_compress()` starts with an underscore. Python has no concept of truly private methods or variables (no strict enforcement of privacy). A leading underscore is used as *a convention to indicate that the implementation of a method should not be relied on by actors outside the classs.* 

### Performing the compression

In [2]:
def _compress(self, gene: str) -> None:
    self.bit_string: int = 1   #starts with sentinel
    for nucleotide in gene.upper():
        self.bit_string <<= 2   # Shift left two bits
        if nucleotide == "A":   #change last two bits to 00
            self.bit_string |= 0b00
        elif nucleotide == "C":  # change last two bits to 01
            self.bit_string |= 0b01
        elif nucleotide == "G":  # change last two bits to 10
            self.bit_string |= 0b10
        elif nucleotide == "T":  # change last two bits to 11
            self.bit_string |= 0b11
        else:
            raise ValueError('Invalid Nucleotide:{}'.format(nucleotide))        

The `_compress()` method looks at each character in the `str` of nucleotides sequentially. When it sees an A, it adds 00 to the bits string. When it sees a C, it adds 01, and so on. Remember that two bits are needed for each nucleotide. As a result, before we add each new nucleotide, we shift the bit string two bits to the left (`self.bit_string <<= 2`).

Every nucleotide is added using an "or" operation (`|`). After the left shift, two 0s are added to the right side of the bit string. 

### Decompression

... and the special `__str__()` method that uses it.

In [3]:
def decompress(self) -> str:
    gene: str = ""
    for i in range(0, self.bits_string.bit_length() - 1, 2):  # -1 to exclude sentinel 
        bits : int = self.bit_string >> i & 0b11   #get just 2 relevant bits
        if bits == 0b00:   # A
            gene += 'A'
        elif bits == 0b01: #C
            gene += 'C'
        elif bits == 0b10: #G
            gene += 'G'
        elif bits == 0b11: #T
            gene += 'T'
        else:
            raise ValueError('Invalid bits:{}'.format(bits))
    return gene[::-1]    # reverse

def __str__(self) -> str:    #String representation for pretty printing
    return self.decompress()
            

`decompress` reads two bits from the bit string at a time and uses those two bits to determine which character to add to the end of the `str` representation of the gene. Because the bits are being read backward the `str` representation is ultimately reversed. 

Finally, note how the convenient `int` method `bit_lengh()` aided in the development of `decompress()`. Let's test it out.

### All together:

In [18]:
class CompressedGene:
    def __init__(self, gene: str) -> None:
        self._compress(gene)
    def _compress(self, gene: str) -> None:
        self.bit_string: int = 1   #starts with sentinel
        for nucleotide in gene.upper():
            self.bit_string <<= 2   # Shift left two bits
            if nucleotide == "A":   #change last two bits to 00
                self.bit_string |= 0b00
            elif nucleotide == "C":  # change last two bits to 01
                self.bit_string |= 0b01
            elif nucleotide == "G":  # change last two bits to 10
                self.bit_string |= 0b10
            elif nucleotide == "T":  # change last two bits to 11
                self.bit_string |= 0b11
            else:
                raise ValueError('Invalid Nucleotide:{}'.format(nucleotide))    
    def decompress(self) -> str:
        gene: str = ""
        for i in range(0, self.bit_string.bit_length() - 1, 2):  # -1 to exclude sentinel 
            bits : int = self.bit_string >> i & 0b11   #get just 2 relevant bits
            if bits == 0b00:   # A
                gene += 'A'
            elif bits == 0b01: #C
                gene += 'C'
            elif bits == 0b10: #G
                gene += 'G'
            elif bits == 0b11: #T
                gene += 'T'
            else:
                raise ValueError('Invalid bits:{}'.format(bits))
        return gene[::-1]    # reverse
    def __str__(self) -> str:    #String representation for pretty printing
        return self.decompress()

### Test

In [19]:
from sys import getsizeof

original: str = "TAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATATAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATA" * 100

print("Original is {} bytes".format(getsizeof(original)))
compressed: CompressedGene = CompressedGene(original)  # Compress
print("Compressed is {} bytes".format(getsizeof(compressed.bit_string)))
print(compressed)  #decompress
print("Original and decpmpressed are the same: {}".format(original == compressed.decompress))

Original is 8649 bytes
Compressed is 2320 bytes
TAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATATAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATATAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATATAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATATAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATATAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATATAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATATAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATATAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATATAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATATAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATATAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATATAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATATAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATATAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATATAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATATAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATATAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATATAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATATAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATATAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATATAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATATAGGGA