# Lab 7

## Part 1: Create binary/nucleotide files

Will create three types of files, a 100 megabyte binary file with varying amounts of 0 and 1, and a DNA file with 100 million letters and a protein file with 100 million letters. 

The binary files will have 100%, 90%, 70%, 60%, and 50% 0s each. 

In [12]:
import numpy as np
import pandas as pd

In [28]:
[1-.5,.5]

[0.5, 0.5]

In [49]:
def binfile(prob, out):
    probs=[prob,1-prob]
    print(probs)
    myvar = np.random.choice([0,1], size=800000000, replace=True, p=probs)
    myvar = np.packbits(myvar)
    open(out, 'wb').write(myvar)

In [50]:
binfile(.5,'test')

[0.5, 0.5]


In [40]:
binfile(.9,'zero90')
binfile(.8,'zero80')
binfile(.7,'zero70')
binfile(.6,'zero60')
binfile(.5,'zero50')

[0.9, 0.09999999999999998]
[0.8, 0.19999999999999996]
[0.7, 0.30000000000000004]
[0.6, 0.4]
[0.5, 0.5]


In [45]:
def seqfile(types, out):
    if 'p' in types.lower():
        seqs=['A','R','N','D','B','C','E','Q','Z','G','H','I','L','K','M','F','P','S','T','W','Y','V']
    elif 'n' in types.lower():
        seqs=['A','C','G','T']
    myvar = np.random.choice(seqs, size=100000000, replace=True)
    print(len(myvar))
    open(out, 'w').write(''.join(myvar))

In [46]:
seqfile('nucleotide','nucltest.fa')
seqfile('protein','protest.fa')

100000000
100000000


## Part 2: Compression of files

I ran the following line in terminal to compress all of the files:

`time gzip -k nucl.fa && time bzip2 -k nucl.fa && ls -l && time pbzip2 -f -k nucl.fa && time ArithmeticCompress nucl.fa nucl.art`

This line ran all of the compression algorithms for a specified filename, with the above being the nucl.fa file, though the filename was switched to be the relevant files (zero100, zero90, ...). Since pbzip2 outputs a file with the same naming convention of bzip2, I included ls -l before pbzip2 to see the size of the bzip2 output before pbzip2 deleted the bzip2 file. 

In [42]:
d = {'File':['zero100','zero90','zero80','zero70','zero60','zero50','nucl.fa','pro.fa'],
    'gzip time':[.687,17.943,12.775,5.968,4.123,3.46,12.14,4.174],
    'bzip2 time':[.999,10.147,11.382,13.168,14.942,15.87,9.513,10.125],
    'pbzip2 time':[.105,.764,.948,1.108,1.383,1.418,.644,.774],
    'arithmetic compress time':[14.29,27.453,33.781,37.807,39.24,39.019,21.426,28.928]}
df = pd.DataFrame(data=d)
df

Unnamed: 0,File,gzip time,bzip2 time,pbzip2 time,arithmetic compress time
0,zero100,0.687,0.999,0.105,14.29
1,zero90,17.943,10.147,0.764,27.453
2,zero80,12.775,11.382,0.948,33.781
3,zero70,5.968,13.168,1.108,37.807
4,zero60,4.123,14.942,1.383,39.24
5,zero50,3.46,15.87,1.418,39.019
6,nucl.fa,12.14,9.513,0.644,21.426
7,pro.fa,4.174,10.125,0.774,28.928


In [44]:
m = {'File':['zero100','zero90','zero80','zero70','zero60','zero50','nucl.fa','pro.fa'],
    'original space':[100000000,100000000,100000000,100000000,100000000,100000000,100000000,100000000],
    'gzip space':[97079,56021359,77399089,89283217,97673081,100015959,29221462,61755268],
    'bzip2 space':[113,58334020,82629366,95141110,100051721,100445844,27334470,56995814],
    'pbzip2 space':[5375,58355762,82644332,95147435,100056832,100450966,27342082,57005023],
    'arithmetic compress space':[1028,46905685,72193021,88132664,97095393,100001009,25001028,55743921]}
mem = pd.DataFrame(data=m)
mem

Unnamed: 0,File,original space,gzip space,bzip2 space,pbzip2 space,arithmetic compress space
0,zero100,100000000,97079,113,5375,1028
1,zero90,100000000,56021359,58334020,58355762,46905685
2,zero80,100000000,77399089,82629366,82644332,72193021
3,zero70,100000000,89283217,95141110,95147435,88132664
4,zero60,100000000,97673081,100051721,100056832,97095393
5,zero50,100000000,100015959,100445844,100450966,100001009
6,nucl.fa,100000000,29221462,27334470,27342082,25001028
7,pro.fa,100000000,61755268,56995814,57005023,55743921


1. For all the files except for the 100% zero file, arithmetic coding saves the most space. For the all zero file, bzip2 is the best algorithm. 
2. pbzip2 is the fastest algorithm for all file types. 
3. pbzip2 is a version of bzip2 that is intended to run the algorithm in parallel to create a linear increase in speed. So I would expect pbzip2 to be faster since it is running its processes in parallel, unlike bzip2. 
4. The compression gets worse as the number of zeros decreases, and that is because the average entropy of the file is increasing, so the number of bits required to encode the file is also increasing.
5. 2
6. 4.32.
7. gzip required 2.337 bits per character for DNA and 4.94 bits for protein. bzip required 2.186 bits per character of DNA and 4.559 bits per character for protein.
8. They are getting close to the ideal code but are still not reaching it. bzip2 does get closer to being ideal than gzip.

## Part 3: Compression of gp120 sequences

I expect that the compression of this file will be better than the compression of random sequence files because random sequences contain more information than non-random sequences, and genetic sequences are usually not completely random. Since these sequences are homologs, we should also expect compression of all these files to be high because they can be compressed in similar ways.

I couldn't find gp120 sequences, so I instead took gp160 sequences, which are then cleaved into gp120 sequences. I obtained my sequences from https://www.ncbi.nlm.nih.gov/genome/genomes/10319

In [47]:
compressed=np.array([6757,6403,8623])
og=25872
og/compressed

array([3.82891816, 4.04060597, 3.00034791])

In [48]:
compressedran=np.array([27334470,27342082,25001028])
ogran=100000000
ogran/compressedran

array([3.65838445, 3.65736596, 3.99983553])

original=25872

gzip=6757

bzip=6403

arithcoding=8623

The compression ratio for the non random sequences was much better for nearly every algorithm except for arithmetic coding, which actually did worse at compressing the non random data in comparison to the random data. 

## Part 4: Estimating Compression

For the genome data, I would use bzip2 to compress the data because for similar, non-random sequences, bzip performed the best and had the greatest compression ratio. Since the ratio is 4.04, this means that the 800 terabytes of data will ideally be compressed by a factor of 4.04, leaving a total of about 198 terabytes instead. 

For the protein sequences, I would use arithmetic coding, since I do not know if the protein sequences are random or not and so I will assume the worst case scenario and assume they aren't random. Arithmetic coding compressed the random protein sequences by a factor of 1.79, so the 100 terabytes of data will be compressed into 55.87 terabytes. 

For the image files, I will also use arithmetic coding because these files are random binary files, and compression with arithmetic coding of random binary files saves more space in comparison to other files. However, since the compressed file of the binary file is actually bigger than the uncompressed file, it would make more sense to leave the binary files uncompressed to maximize the amount of space saved on the hard drive. This is because these files are completely random, so I am comparing it to the 50% zero binary files. So these files will still be 100 terabytes.

The amount of terabytes taken up now will be 198+55.87+100, or about 353.87 terabytes. This is a 64.6% increase in the amount of free space. So the bonus for this 64.6% increase in space will be $11,791,946.93, calculated by multiplying 64.6 by 500 (dollars per day per 1%) and then multiplying by 365 (number of days in a year)