### Simulating the data
Fortunately, you know that most of the data you’ll be compressing will be nucleic acid
sequences. Some of it, however, will be binary files, and some will be protein sequences. Start
by writing some code to simulate files containing random DNA, protein, and binary data.

1. Using np.random.choice, generate 100 megabytes (8 bits/byte * 1024 bytes/kilobyte * 1024
kilobytes/megabyte * 100) of random data containing 100%, 90%, 80%, 70%, 60%, and 50%
zeros.    
The number of percentage of zeros can be adjusted by changing the p values of 0 and 1.    
Packbits are used to pack up the bits that are been generated.

In [4]:
# Doing 100% zeros
import numpy as np
myvar1 = np.random.choice([0, 1], size=(8,1024,1024,100), replace=True, p=[1, 0])
myvar1 = np.packbits(myvar1)

# Doing 90% zeros
myvar2 = np.random.choice([0, 1], size=(8,1024,1024,100), replace=True, p=[0.9, 0.1])
myvar2 = np.packbits(myvar2)

# Doing 80% zeros
myvar3 = np.random.choice([0, 1], size=(8,1024,1024,100), replace=True, p=[0.8, 0.2])
myvar3 = np.packbits(myvar3)

# Doing 70% zeros
myvar4 = np.random.choice([0, 1], size=(8,1024,1024,100), replace=True, p=[0.7, 0.3])
myvar4 = np.packbits(myvar4)

# Doing 60% zeros
myvar5 = np.random.choice([0, 1], size=(8,1024,1024,100), replace=True, p=[0.6, 0.4])
myvar5 = np.packbits(myvar5)

# Doing 50% zeros
myvar6 = np.random.choice([0, 1], size=(8,1024,1024,100), replace=True, p=[0.5, 0.5])
myvar6 = np.packbits(myvar6)

2. Then write this data to a file in your home directory
This produces 6 files containing the random generated zeros and ones with 105MB file size for each.

In [27]:
open('zeros_100p','wb').write(myvar1)
open('zeros_90p','wb').write(myvar2)
open('zeros_80p','wb').write(myvar3)
open('zeros_70p','wb').write(myvar4)
open('zeros_60p','wb').write(myvar5)
open('zeros_50p','wb').write(myvar6)

104857600

3. Next, generate DNA and protein sequences 100 million letters long and write those to your
home directory.    
Similar to what was did above. Instead, use ATCG nucleotides instead of 0 and 1 for generating nucleotide sequences, and the 20 amino acid letter alphabets for generating the protein sequence.    
The result should be 2 files of 100 MB large random generated nucleotide and protein sequences that the sequence is 100 milion letters long.

In [21]:
import numpy as np
my_nt_seq= np.random.choice(['A','T','C','G'], size=100000000, replace=True, p=[0.25, 0.25,0.25,0.25])
my_pro_seq= np.random.choice(['A','R','N','D','C','E','Q','G','H','I','L','K','M','F','P','S','T','W','Y','V'], size=100000000, replace=True)

In [23]:
open('nt_seq.fa','w').write(''.join(my_nt_seq))
open('pro_seq.fa','w').write(''.join(my_pro_seq))

100000000

### Compressing the data
You’ll have to do this from the terminal on bioe131.com via SSH (PuTTY on Windows) or via
iPython in your web browser. On each of the files you generated above, run gzip, bzip, pbzip2
and ArithmeticCompress as follows:
```
time gzip –k zeros_100p
time bzip2 –k zeros_100p
time pbzip2 –k zeros_100p
time ArithmeticCompress zeros_100p zeros_100p.art
```

#### The 0/1 input file sizes are 105 MB, and the nucleotide and protein file size are 100 MB!!!

### Raw data
#### 100 percent 0:           
1. time gzip –k zeros_100p      

real    0m0.702s      
user    0m0.690s     
sys     0m0.012s     

file output size: 102kb

2. time bzip2 –k zeros_100p

real    0m1.051s     
user    0m1.015s     
sys     0m0.036s      

file output size: 113b

3. time pbzip2 –k zeros_100p

real    0m0.099s     
user    0m1.787s           
sys     0m0.093s        

file output size :5.62kb

4. time ArithmeticCompress zeros_100p zeros_100p.art

real    0m14.932s      
user    0m14.895s      
sys     0m0.037s    

file output size: 1.03kb

----------------
#### 90 percent 0:       
1. time gzip –k zeros_90p

real    0m19.051s     
user    0m18.756s      
sys     0m0.124s    

file output size: 58.7MB

2. time bzip2 –k zeros_90p

real    0m11.303s     
user    0m11.215s     
sys     0m0.088s     

file output size: 61.2MB

3. time pbzip2 –k zeros_90p

real    0m0.759s     
user    0m18.815s     
sys     0m0.775s     

file output size: 61.2 MB

4. time ArithmeticCompress zeros_90p zeros_90p.art

real    0m0.002s     
user    0m0.000s     
sys     0m0.002s     

file output size: 1.02 kb

----------------
#### 80 percent 0:      
1. time gzip –k zeros_80p

real    0m13.569s     
user    0m13.291s    
sys     0m0.148s     

file output size: 81.2MB

2. time bzip2 –k zeros_80p

real    0m12.141s     
user    0m12.029s     
sys     0m0.112s     

file output size: 86.6MB

3. time pbzip2 –k zeros_80p

real    0m0.933s     
user    0m23.333s    
sys     0m0.799s     

file output size: 86.7MB

4. time ArithmeticCompress zeros_80p zeros_80p.art

real    0m35.749s    
user    0m35.584s      
sys     0m0.136s    

file output size: 75.7MB

----------------
#### 70 percent 0:      
1. time gzip –k zeros_70p

real    0m6.224s    
user    0m5.989s    
sys     0m0.100s     

file output size: 93.6 MB

2. time bzip2 –k zeros_70p

real    0m14.216s    
user    0m14.095s    
sys     0m0.120s     

file output size: 99.8 MB

3. time pbzip2 –k zeros_70p

real    0m1.183s    
user    0m30.074s     
sys     0m0.977s     

file output size: 99.8 MB

4. time ArithmeticCompress zeros_70p zeros_70p.art

real    0m39.501s    
user    0m39.228s    
sys     0m0.272s      

file output size: 92.4 MB

----------------
#### 60 percent 0:      
1. time gzip -k zeros_60p          

real    0m4.314s    
user    0m4.165s     
sys     0m0.144s      

file output size: 102 MB     

2. time bzip2 -k zeros_60p     

real    0m16.093s     
user    0m15.975s     
sys     0m0.117s     

file output size: 105 MB

3. time pbzip2 –k zeros_60p     

real    0m1.377s     
user    0m35.670s     
sys     0m0.841s      

file output size: 105 MB

4. time ArithmeticCompress zeros_60p zeros_60p.art

real    0m41.795s     
user    0m41.537s      
sys     0m0.200s      

file output size: 102 MB

----------------
#### 50 percent 0:             
1. time gzip -k zeros_50p     

real    0m3.503s     
user    0m3.372s     
sys     0m0.124s     

file output size: 105 MB

2. time bzip2 -k zeros_50p      

real    0m16.777s     
user    0m16.629s      
sys     0m0.148s      

file output size: 105 MB

3. time pbzip2 -k zeros_50p    

real    0m1.470s     
user    0m38.506s     
sys     0m0.732s      

file output size: 105 MB

4. time ArithmeticCompress zeros_50p zeros_50p.art

real    0m40.964s     
user    0m40.739s     
sys     0m0.224s      

file output size: 105 MB


----------------
#### nucleotide seq:                
1. time gzip -k nt_seq.fa

real    0m12.153s     
user    0m12.085s     
sys     0m0.068s      

file output size: 29.2 MB

2. time bzip2 -k nt_seq.fa

real    0m9.500s     
user    0m9.440s     
sys     0m0.061s     

file output size: 27.3 MB

3. time pbzip2 -k nt_seq.fa

real    0m0.662s     
user    0m16.189s      
sys     0m0.805s      

file output size: 27.3 MB

4. time ArithmeticCompress nt_seq.fa nt_seq.fa.art

real    0m21.512s     
user    0m21.292s     
sys     0m0.220s     

file output size: 25 MB


----------------
#### protein seq:                
1. time gzip -k pro_seq.fa

real    0m5.358s    
user    0m5.280s     
sys     0m0.049s     

file output size: 60.6 MB

2. time bzip2 -k pro_seq.fa

real    0m10.152s   
user    0m10.059s    
sys     0m0.093s     

file output size: 55.3 MB

3. time pbzip2 -k pro_seq.fa

real    0m0.783s     
user    0m18.937s     
sys     0m0.845s     

file output size: 55.3 MB

4. time ArithmeticCompress pro_seq.fa pro_seq.fa.art

real    0m28.622s    
user    0m28.333s     
sys     0m0.288s      

file output size: 54 MB


### Table of results

In [65]:
print("   Table of contents: file output size / compression time")

print("              ")

print("Compression     gzip        bzip2       pbzip2      Arithmetic")

print("              ")

print("100% 0         102 KB      113 B       5.62 KB     1.03 KB")

print("               0.702s      1.051s      0.099s      14.932s")

print("              ")

print("90%  0         58.7 MB     61.2 MB     61.2 MB      1.02 KB")

print("               19.051s     11.303s      0.759s      0.002s")

print("              ")
  
print("80%  0         81.2 MB     86.6 MB      86.7 MB      75.7 MB")

print("               13.569s     12.141s      0.933s       35.749s")

print("              ")

print("70%  0         93.6 MB     99.8 MB      99.8 MB      92.4 MB")

print("               6.224s     14.216s       1.183s       39.501s")

print("              ")

print("60%  0         102 MB      105 MB      105 MB       102 MB")

print("               4.314s      16.093s     1.377s       41.795s")

print("              ")

print("50%  0         105 MB      105 MB      105 MB       105 MB")

print("               3.503s      16.777s     1.470s       40.964s")

print("              ")

print("nucleotide    29.2 MB      27.3 MB      27.3 MB      25 MB")

print("               12.153s     9.500s       0.662s       21.512s")

print("              ")

print("protein       60.6 MB      55.3 MB      55.3 MB      54 MB")

print("              5.358s       10.152s      0.783s       28.622s")


   Table of contents: file output size / compression time
              
Compression     gzip        bzip2       pbzip2      Arithmetic
              
100% 0         102 KB      113 B       5.62 KB     1.03 KB
               0.702s      1.051s      0.099s      14.932s
              
90%  0         58.7 MB     61.2 MB     61.2 MB      1.02 KB
               19.051s     11.303s      0.759s      0.002s
              
80%  0         81.2 MB     86.6 MB      86.7 MB      75.7 MB
               13.569s     12.141s      0.933s       35.749s
              
70%  0         93.6 MB     99.8 MB      99.8 MB      92.4 MB
               6.224s     14.216s       1.183s       39.501s
              
60%  0         102 MB      105 MB      105 MB       102 MB
               4.314s      16.093s     1.377s       41.795s
              
50%  0         105 MB      105 MB      105 MB       105 MB
               3.503s      16.777s     1.470s       40.964s
              
nucleotide    29.2 MB      27.3 MB      

#### Questions

#### Which algorithm achieves the best level of compression on each file type?
- Arithmetic compression shows to have the lest level of compression (the result file is the smallest compare to other methods)

#### Which algorithm is the fastest?
- The pbzips algorithm is the fastest for every file. 

#### What is the difference between bzip2 and pbzip2? Do you expect one to be faster and why?
- bzip2 compresses data in blocks of size between 100 and 900 kB and uses the Burrows–Wheeler transform to convert frequently-recurring character sequences into strings of identical letters.
- bzip2 and pbzips have similar compression method. Except that pbzip2 supports multi-threading, giving almost linear speed improvements on multi-CPU and multi-core computers. Therefore, pbzips shoudl run faster due to its multi-thread compression speed. 

#### How does the level of compression change as the percentage of zeros increases? Why does this happen?
- As the percentage of zeros increase, the file compressed size becomes smaller (no matter any algorithms), and the overall compression time become less (except for 100% of zeros file, the arithmetic compression shows to be a little longer). This can be due to that as there are more zeros contained in the file, this takes up less percentage of the file (since 0 = null data) thus also requires less time to compress the file.

#### What is the minimum number of bits required to store a single DNA base?
- 2 bits.

#### What is the minimum number of bits required to store an amino acid letter?
- A amino acid letter  = 3 codons = 2* 3 = 6 bits.

#### In your tests, how many bits did gzip and bzip2 actually require to store your random DNA and protein sequences?
- for DNA, the gzip compression uses 29.2 MB * 1024 * 1024 * 8 = 244,947,354 bits.    
- for DNA, the bzip compression uses 27.3 MB * 1024 * 1024 * 8 = 229,008,998 bits.     
- for protein, the gzip compression uses 60.6 MB * 1024 * 1024 * 8 = 508,349,645 bits.     
- for protein, the gzip compression uses 55.3 MB * 1024 * 1024 * 8 = 463,890,022 bits.

#### Are gzip and bzip2 performing well on DNA and proteins?
- For DNA, the methods compressed the file from 100 MB to around 20 MB. For protein, it is from 100 MB to around 60 MB. Therefore the compression shown to be better for nucleotide sequence. However, if comparing in terms of compression time (9.5s), for DNA, bzip2 has a less compression time. For protein, it is gzip that has a less compresion time (10.358s).

### Compressing real data
Now that you have a sense of how random data can be compressed, let’s have a look at some
real biological data. Using what you’ve learned about querying biological databases, find the
nucleic acid sequences of gp120 homologs from at least 10 different HIV isolates and
concatenate them together into a single multi-FASTA.

In [71]:
seq = []
from Bio import SeqIO
from Bio import Entrez
Entrez.email = 'joanne91218@berkeley.edu'
handle = Entrez.esearch(db = 'protein',
                       term = 'gp 120',
                       sort = 'relevance',
                       idtype = 'acc')
for i in Entrez.read(handle)['IdList']:
    handle = Entrez.efetch(db = 'protein', id=i, rettype = 'gb', retmode = 'text')
    temp = SeqIO.read(handle,'gb')
    print(temp.seq)
    seq.append(temp.seq)

open('gp120.fa','w').write(''.join(str(seq)))

FAIRKCNDREFNGTGPCRNVSTVQCTHGIRPVVSTQLLLNGSLAENKTMIRSENITNNAKNILVQLNXPVNITCIRPNNNTRKSVRIGPGQSFYATGDIIGDIRQAHCNVTRAKWNGTLQKVADQLRT
YCTPAGFXILKCNDKNFNGSGPCTNVSTVQCTHGIRPVVSTQLLLNGSLAEEDVVIRSENFTNNAKTILVQLNKPINITCVRPNNNTRKSIHIAPGRAFYATGEIIGDIRQAYCTLNRTQWNDTLRKIAIKLREQFGNKTIAL
FAILKCNAKNFNGTGXCKNVSTVQCTHGIRPVVSTQLLLNGSLAEDEVVIRSKNFTDNAQNIIVQLNATVNITCTRPNNNTRRSIHMGPGRSFYTTGDIIGDIRQAHCNISRSAWNDTLEQIAAKLRE
EEEVVIRSDNFSDNAKTIIVQLNVSVQINCTRPNNNTRKSINIGPGRAFYATGDIIGDIRQAHCNLSKTQWNNTLNQIVIKLGEQFKINIVF
KLTPLCVTLNCTDGLRNATNATDSRLTNANNSSWGGEMKNCSFKITTSIRDKVQKEYALFYTLDIVPIDKDNNNSTTYRLINCNTSVITQACPKMS
KLTPLCVTLNCTDATNTTKSNTSSWETMEKGEIQNCSFNVTTSIRDRVQREYALFYKLDVVPIDNEKNTTSYRLISCNTSVITQACPKIS
KLTPLCVTLNCTDLQNVTYANSSSERMMEREEMKNCSFNITTSIRDKVQKEYALFYKLDIVPIDDNNNTNTSYRLISCNTSVITQACPKVS
KLTPLCVILNCTDLGNATNTNSSNTNSSSGEMMMEKGEIKNCSFNISTSIRGKVQKEYAFFYKLDIIPIDNDTTSYTLTSCNTSVITQACPKVS
EVQLLESGGGLVKPGGSLRLSCAASGFTLINYRMNWVRQAPGKGLEWVSSISSSSSYIHYADSVKGRFTISRDNAENSLYLQMNSLRAEDTAVYYCVREGPRATGYSMADVFDIWGQGTMVTVSSASTKG



SILAKEGDIGPNLDGLINTEIDFDPIPNTETIFDESPSFNTSTNEEQHTPPNISLTFSYFPDKNGDTAYSGENENDCDAELRIWSVQEDDLAAGLSWIPFFGPGIEGLYTAGLIKNQNNLVCRLRRLANQTAKSLELLLRVTTEERTFSLINRHAIDFLLTRWGGTCKVLGPDCCIGIEDLSKNISEQIDKIRKDEQKEETGDDDDKAGWSHPQFEKGGGSGGGSGGGSWSHPQFEK
MTVRVAINGFGRIGRLVLRAIVESKRSDIEVVGINDLGSAEANAHLLKYDSVHGVLANQVSATADSITIDGKAIRVTAERDPAKLPWKELNIDVALECTGLFTKRDKAAAHLEAGAKRVLVSAPADGADLTVVYGVNHDKLTRDHLVVSNGSCTTNCLAPVAQILNELGGIKHGFMTTVHAFTGDQRTVDTLHKDLRRARAASLSIIPTSTGAAKAIGLVLPELAGKLDGTSVRVPTPNVSMIDLVVVTERPVTAEQINQAMADAANGRLKGVLGAESAELVSVDFNHNPNSSTFDLTQTKVIDGTFVRVLSWYDNEWGFSNRMADTAVAIAKLI
MTDASKSDVQIILSIKEGRGFEQVLQPTILIGTLDGHSLQTDRIESCPTPQYAIDLVWETNKMMLRRMRSGHASLKLECFAFKENEARERIGYVLLNIRSAQIISKYQDLSPRTSWHKLLGLRSDLKIQKPELLIGLRVEDRKDINLNPTTERCPEIESHRNPGGDFGKSNFRNIPSEVKCMTEYKNINKQSVDCQCIKHQTDKITYSRSCDRMFCNHINKNYIKDVGSYHCYCLHISLVAITLISETLSGREMEFRFHHPKTEIMSTAYATMPALLNEKIKLQNIVCQFHFISAPDEIRQLLQSFPPKISMYDANKGDVPLSLLMLDIKLLFHQEKSECQYKLSLFDTDRNKIADMDIVLKLEDRGPHCILKKETFDKNLGPPILDDSLAYKIVDELETWKERQKEMFKVELKKKEERHLNMLSE

1700

In [48]:
a='FAIRKCNDREFNGTGPCRNVSTVQCTHGIRPVVSTQLLLNGSLAENKTMIRSENITNNAKNILVQLNXPVNITCIRPNNNTRKSVRIGPGQSFYATGDIIGDIRQAHCNVTRAKWNGTLQKVADQLRTYCTPAGFXILKCNDKNFNGSGPCTNVSTVQCTHGIRPVVSTQLLLNGSLAEEDVVIRSENFTNNAKTILVQLNKPINITCVRPNNNTRKSIHIAPGRAFYATGEIIGDIRQAYCTLNRTQWNDTLRKIAIKLREQFGNKTIALFAILKCNAKNFNGTGXCKNVSTVQCTHGIRPVVSTQLLLNGSLAEDEVVIRSKNFTDNAQNIIVQLNATVNITCTRPNNNTRRSIHMGPGRSFYTTGDIIGDIRQAHCNISRSAWNDTLEQIAAKLREEEEVVIRSDNFSDNAKTIIVQLNVSVQINCTRPNNNTRKSINIGPGRAFYATGDIIGDIRQAHCNLSKTQWNNTLNQIVIKLGEQFKINIVFKLTPLCVTLNCTDGLRNATNATDSRLTNANNSSWGGEMKNCSFKITTSIRDKVQKEYALFYTLDIVPIDKDNNNSTTYRLINCNTSVITQACPKMSKLTPLCVTLNCTDATNTTKSNTSSWETMEKGEIQNCSFNVTTSIRDRVQREYALFYKLDVVPIDNEKNTTSYRLISCNTSVITQACPKISKLTPLCVTLNCTDLQNVTYANSSSERMMEREEMKNCSFNITTSIRDKVQKEYALFYKLDIVPIDDNNNTNTSYRLISCNTSVITQACPKVSKLTPLCVILNCTDLGNATNTNSSNTNSSSGEMMMEKGEIKNCSFNISTSIRGKVQKEYAFFYKLDIIPIDNDTTSYTLTSCNTSVITQACPKVSEVQLLESGGGLVKPGGSLRLSCAASGFTLINYRMNWVRQAPGKGLEWVSSISSSSSYIHYADSVKGRFTISRDNAENSLYLQMNSLRAEDTAVYYCVREGPRATGYSMADVFDIWGQGTMVTVSSASTKGPSVFPLAPSSKSTSGGTAALGCLVKDYFPEPVTVSWNSGALTSGVHTFPAVLQSSGLYSLSSVVTVPSSSLGTQTYICNVNHKPSNTKVDKKVEPKSILAKEGDIGPNLDGLINTEIDFDPIPNTETIFDESPSFNTSTNEEQHTPPNISLTFSYFPDKNGDTAYSGENENDCDAELRIWSVQEDDLAAGLSWIPFFGPGIEGLYTAGLIKNQNNLVCRLRRLANQTAKSLELLLRVTTEERTFSLINRHAIDFLLTRWGGTCKVLGPDCCIGIEDLSKNISEQIDKIRKDEQKEETGDDDDKAGWSHPQFEKGGGSGGGSGGGSWSHPQFEKMTVRVAINGFGRIGRLVLRAIVESKRSDIEVVGINDLGSAEANAHLLKYDSVHGVLANQVSATADSITIDGKAIRVTAERDPAKLPWKELNIDVALECTGLFTKRDKAAAHLEAGAKRVLVSAPADGADLTVVYGVNHDKLTRDHLVVSNGSCTTNCLAPVAQILNELGGIKHGFMTTVHAFTGDQRTVDTLHKDLRRARAASLSIIPTSTGAAKAIGLVLPELAGKLDGTSVRVPTPNVSMIDLVVVTERPVTAEQINQAMADAANGRLKGVLGAESAELVSVDFNHNPNSSTFDLTQTKVIDGTFVRVLSWYDNEWGFSNRMADTAVAIAKLIMTDASKSDVQIILSIKEGRGFEQVLQPTILIGTLDGHSLQTDRIESCPTPQYAIDLVWETNKMMLRRMRSGHASLKLECFAFKENEARERIGYVLLNIRSAQIISKYQDLSPRTSWHKLLGLRSDLKIQKPELLIGLRVEDRKDINLNPTTERCPEIESHRNPGGDFGKSNFRNIPSEVKCMTEYKNINKQSVDCQCIKHQTDKITYSRSCDRMFCNHINKNYIKDVGSYHCYCLHISLVAITLISETLSGREMEFRFHHPKTEIMSTAYATMPALLNEKIKLQNIVCQFHFISAPDEIRQLLQSFPPKISMYDANKGDVPLSLLMLDIKLLFHQEKSECQYKLSLFDTDRNKIADMDIVLKLEDRGPHCILKKETFDKNLGPPILDDSLAYKIVDELETWKERQKEMFKVELKKKEERHLNMLSEEWRMQKENLESKLAYSVEQCKTLANSLNNATEDLRKRRLKSLENEARLIKANEDLQWRYDTKLQELKDRLHATQSDLTSKVAKLEENKIALEAKVEILSYENESMKSSINKQMNELQMYQKGSLTQDQTASLLQEVKILEEKLDNAQKGKEFFREQWSKAVRELHRMKVDYQEAMQVQIKNSREELMSTDLAEILSADRKALNNDQILLNELQKEIDVIKPKQSFAPTETCTEIFTTTDDICYNFPYNGAERGTMEQIICGRFGHGILLVAKTSRHPITIVSTRKLMSGKFSRHCRPLLASPQLRHFANPHRGYAMVITRILRGALKIRYLLLGGAVGGGVTLQKKYEQWKEGLPDLEWLENLMPSEKQWQDFRESLMSIKNVADKIEIDPRIKEFGVTKYREYRNWFDQRLDDAIKAAESQNSYQENSSIQSAFDTISMVAKQTYSESSQQRMNAMQDELMQMQLKYQREIERLERENKELRKQILLRGNQKFNNKKIKKSLIDMYSDVLDELNDYDSAYSTADHLPRVVVVGDQSSGKTSVLEMIAQARIFPRGGGEMMTRAPVKVTLSEGPYHIAQFKDSSREFDLTKESELAELRREVELRMKNSVKNGKTVSPDVISMTVKGPGLQRMVLVDLPGIISTVTVDMAEDTREAIKQMSQQYMSNPNAIILCIQDGSVDAERSNVTDLVAQMDPSGKRTIFVLTKVDMAEENLTNPERLRKILSGKLFPMKALGYFAVVTGRGKQEDSIQTIKDYEEKFFRNSKLFKDGLAMSGQVTTKNLSLAVAECFWKMVRETVEQQADAFKATRFNLETEWKNNFPRLRELDRDELFERARGEILDEIVNLSQVSPRHWEEVLMTRNWEKVSMHVFENIYLPAAQSGNPNTFNTTVDIKLRHWAEQQLPARSVESGWECLQQEFQNFMSQARLSPDHDDIFDNLKNAVVNEAMRRHSWEEKASEMLRVIQLNILEDRSVNDKREWDQAVRFLETSVKEKLQSTEQILRDMLGPDRKERWMYWQSQTEEQQKRTSVKNELDKILYSDKKHAPTLTHDELTTVRKNLQRNGLEVDNEFIRETWHPVYRRFFLQQSLARAYDCRKGYYLYHTGHENEMECNDVVLFWRIQQMLKVTANALRQQIMNREARRLDKEIKEVLEDYSQDNEIKQKLLTGRRVTLAEELKRVRQIQEKLEEFIQALNKEKMEQIICGRFGHGILLVAKTSRHPITIVSTRKLMSGKFSRHCRPLLASPQLRHFANPHRGYAMVITRILRGALKIRYLLLGGAVGGGVTLQKKYEQWKEGLPDLEWLENLMPSEKQWQDFRESLMSIKNVADKIEIDPRIKEFGVTKYREYRNWFDQRLDDAIKAAESQNSYQENSSIQTETTEQLSKNVVAFARPLASDNNTDEERKKAQSSQQRMNAMQDELMQMQLKYQREIERLERENKELRKQILLRGNQKFNNKKIKKSLIDMYSDVLDELNDYDSAYSTADHLPRVVVVGDQSSGKTSVLEMIAQARIFPRGGGEMMTRAPVKVTLSEGPYHIAQFKDSSREFDLTKESELAELRREVELRMKNSVKNGKTVSPDVISMTVKGPGLQRMVLVDLPGIISTVTVDMAEDTREAIKQMSQQYMSNPNAIILCIQDGSVDAERSNVTDLVAQMDPSGKRTIFVLTKVDMAEENLTNPERLRKILSGKLFPMKALGYFAVVTGRGKQEDSIQTIKDYEEKFFRNSKLFKDGLAMSGQVTTKNLSLAVAECFWKMVRETVEQQADAFKATRFNLETEWKNNFPRLRELDRDELFERARGEILDEIVNLSQVSPRHWEEVLMTRNWEKVSMHVFENIYLPAAQSGNPNTFNTTVDIKLRHWAEQQLPARSVESGWECLQQEFQNFMSQARLSPDHDDIFDNLKNAVVNEAMRRHSWEEKASEMLRVIQLNILEDRSVNDKREWDQAVRFLETSVKEKLQSTEQILRDMLGPDRKERWMYWQSQTEEQQKRTSVKNELDKILYSDKKHAPTLTHDELTTVRKNLQRNGLEVDNEFIRETWHPVYRRFFLQQSLARAYDCRKGYYLYHTGHENEMECNDVVLFWRIQQMLKVTANALRQQIMNREARRLDKEIKEVLEDYSQDNEIKQKLLTGRRVTLAEELKRVRQIQEKLEEFIQALNKEKMTDASKSDVQIILSIKEGRGFEQVLQPTILIGTLDGHSLQTDRIESCPTPQYAIDLVWETNKMMLRRMRSGHASLKLECFAFKENEARERIGYVLLNIRSAQIISKYQDLSPRTSWHKLLGLRSDLKIQKPELLIGLRVEDRKDINLNPTTERCPEIESHRNPGGDFGKSNFRNIPSEVKCMTEYKNINKQSVDCQCIKHQTDKITYSRSCDRMFCNHINKNYIKDVGSYHCYCLHISLVAITLISETLSGREMEFRFHHPKTEIMSTAYATMPALLNEKIKLQNIVCQFHFISAPDEIRQLLQSFPPKISMYDANKGDVPLSLLMLDIKLLFHQEKSECQYKLSLFDTDRNKIADMDIVLKLEDRGPHCILKKETFDKNLGPPILDDSLAYKIVDELETWKERQKEMFKVELKKKEERHLNMLSEEWRMQKENLESKLAYSVEQCKTLANSLNNATEDLRKRRLKSLENEARLIKANEDLQWRYDTKLQELKDRLHATQSDLTSKVAKLEENKIALEAKVEILSYENESMKSSINKQMNELQMYQKGSLTQDQTASLLQEVKILEEKLDNAQKGKEFFREQWSKAVRELHRMKVDYQEAMQVQIKNSREELMSTDLAEILSADRKALNNDQILLNELQKEIDVIKPKQSFAPTETCTEIFTTTDDICYNFPYNGNKDVYDKSEKYNERLQALREERESLLRTGNYTADDIVIKKLNAEICSLLVSRMEQIICGRFGHGILLVAKTSRHPITIVSTRKLMSGKFSRHCRPLLASPQLRHFANPHRGYAMVITRILRGALKIRYLLLGGAVGGGVTLQKKYEQWKEGLPDLEWLENLMPSEKQWQDFRESLMSIKNVADKIEIDPRIKEFGVTKYREYRNWFDQRLDDAIKAAESQNSYQENSSIQKSSQQRMNAMQDELMQMQLKYQREIERLERENKELRKQILLRGNQKFNNKKIKKSLIDMYSDVLDELNDYDSAYSTADHLPRVVVVGDQSSGKTSVLEMIAQARIFPRGGGEMMTRAPVKVTLSEGPYHIAQFKDSSREFDLTKESELAELRREVELRMKNSVKNGKTVSPDVISMTVKGPGLQRMVLVDLPGIISTVTVDMAEDTREAIKQMSQQYMSNPNAIILCIQDGSVDAERSNVTDLVAQMDPSGKRTIFVLTKVDMAEENLTNPERLRKILSGKLFPMKALGYFAVVTGRGKQEDSIQTIKDYEEKFFRNSKLFKDGLAMSGQVTTKNLSLAVAECFWKMVRETVEQQADAFKATRFNLETEWKNNFPRLRELDRDELFERARGEILDEIVNLSQVSPRHWEEVLMTRNWEKVSMHVFENIYLPAAQSGNPNTFNTTVDIKLRHWAEQQLPARSVESGWECLQQEFQNFMSQARLSPDHDDIFDNLKNAVVNEAMRRHSWEEKASEMLRVIQLNILEDRSVNDKREWDQAVRFLETSVKEKLQSTEQILRDMLGPDRKERWMYWQSQTEEQQKRTSVKNELDKILYSDKKHAPTLTHDELTTVRKNLQRNGLEVDNEFIRETWHPVYRRFFLQQSLARAYDCRKGYYLYHTGHENEMECNDVVLFWRIQQMLKVTANALRQQIMNREARRLDKEIKEVLEDYSQDNEIKQKLLTGRRVTLAEELKRVRQIQEKLEEFIQALNKEKMEQIICGRFGHGILLVAKTSRHPITIVSTRKLMSGKFSRHCRPLLASPQLRHFANPHRGYAMVITRILRGALKIRYLLLGGAVGGGVTLQKKYEQWKEGLPDLEWLENLMPSEKQWQDFRESLMSIKNVADKIEIDPRIKEFGVTKYREYRNWFDQRLDDAIKAAESQNSYQENSSIQKTTEQLSKNVVAFARPLASDNNTDEERKKAQSSQQRMNAMQDELMQMQLKYQREIERLERENKELRKQILLRGNQKFNNKKIKKSLIDMYSDVLDELNDYDSAYSTADHLPRVVVVGDQSSGKTSVLEMIAQARIFPRGGGEMMTRAPVKVTLSEGPYHIAQFKDSSREFDLTKESELAELRREVELRMKNSVKNGKTVSPDVISMTVKGPGLQRMVLVDLPGIISTVTVDMAEDTREAIKQMSQQYMSNPNAIILCIQDGSVDAERSNVTDLVAQMDPSGKRTIFVLTKVDMAEENLTNPERLRKILSGKLFPMKALGYFAVVTGRGKQEDSIQTIKDYEEKFFRNSKLFKDGLAMSGQVTTKNLSLAVAECFWKMVRETVEQQADAFKATRFNLETEWKNNFPRLRELDRDELFERARGEILDEIVNLSQVSPRHWEEVLMTRNWEKVSMHVFENIYLPAAQSGNPNTFNTTVDIKLRHWAEQQLPARSVESGWECLQQEFQNFMSQARLSPDHDDIFDNLKNAVVNEAMRRHSWEEKASEMLRVIQLNILEDRSVNDKREWDQAVRFLETSVKEKLQSTEQILRDMLGPDRKERWMYWQSQTEEQQKRTSVKNELDKILYSDKKHAPTLTHDELTTVRKNLQRNGLEVDNEFIRETWHPVYRRFFLQQSLARAYDCRKGYYLYHTGHENEMECNDVVLFWRIQQMLKVTANALRQQIMNREARRLDKEIKEVLEDYSQDNEIKQKLLTGRRVTLAEELKRVRQIQEKLEEFIQALNKEKMEQIICGRFGHGILLVAKTSRHPITIVSTRKLMSGKFSRHCRPLLASPQLRHFANPHRGYAMVITRILRGALKIRYLLLGGAVGGGVTLQKKYEQWKEGLPDLEWLENLMPSEKQWQDFRESLMSIKNVADKIEIDPRIKEFGVTKYREYRNWFDQRLDDAIKAAESQNSYQENSSIQSAFDTISMVAKQTYSETTEQLSKNVVAFARPLASDNNTDEERKKAQSSQQRMNAMQDELMQMQLKYQREIERLERENKELRKQILLRGNQKFNNKKIKKSLIDMYSDVLDELNDYDSAYSTADHLPRVVVVGDQSSGKTSVLEMIAQARIFPRGGGEMMTRAPVKVTLSEGPYHIAQFKDSSREFDLTKESELAELRREVELRMKNSVKNGKTVSPDVISMTVKGPGLQRMVLVDLPGIISTVTVDMAEDTREAIKQMSQQYMSNPNAIILCIQDGSVDAERSNVTDLVAQMDPSGKRTIFVLTKVDMAEENLTNPERLRKILSGKLFPMKALGYFAVVTGRGKQEDSIQTIKDYEEKFFRNSKLFKDGLAMSGQVTTKNLSLAVAECFWKMVRETVEQQADAFKATRFNLETEWKNNFPRLRELDRDELFERARGEILDEIVNLSQVSPRHWEEVLMTRNWEKVSMHVFENIYLPAAQSGNPNTFNTTVDIKLRHWAEQQLPARSVESGWECLQQEFQNFMSQARLSPDHDDIFDNLKNAVVNEAMRRHSWEEKASEMLRVIQLNILEDRSVNDKREWDQAVRFLETSVKEKLQSTEQILRDMLGPDRKERWMYWQSQTEEQQKRTSVKNELDKILYSDKKHAPTLTHDELTTVRKNLQRNGLEVDNEFIRETWHPVYRRFFLQQSLARAYDCRKGYYLYHTGHENEMECNDVVLFWRIQQMLKVTANALRQQIMNREARRLDKEIKEVLEDYSQDNEIKQKLLTGRRVTLAEELKRVRQIQEKLEEFIQALNKEKMEQIICGRFGHGILLVAKTSRHPITIVSTRKLMSGKFSRHCRPLLASPQLRHFANPHRGYAMVITRILRGALKIRYLLLGGAVGGGVTLQKKYEQWKEGLPDLEWLENLMPSEKQWQDFRESLMSIKNVADKIEIDPRIKEFGVTKYREYRNWFDQRLDDAIKAAESQNSYQENSSIQSAFDTISMVAKQTYSAETTEQLSKNVVAFARPLASDNNTDEERKKAQSSQQRMNAMQDELMQMQLKYQREIERLERENKELRKQILLRGNQKFNNKKIKKSLIDMYSDVLDELNDYDSAYSTADHLPRVVVVGDQSSGKTSVLEMIAQARIFPRGGGEMMTRAPVKVTLSEGPYHIAQFKDSSREFDLTKESELAELRREVELRMKNSVKNGKTVSPDVISMTVKGPGLQRMVLVDLPGIISTVTVDMAEDTREAIKQMSQQYMSNPNAIILCIQDGSVDAERSNVTDLVAQMDPSGKRTIFVLTKVDMAEENLTNPERLRKILSGKLFPMKALGYFAVVTGRGKQEDSIQTIKDYEEKFFRNSKLFKDGLAMSGQVTTKNLSLAVAECFWKMVRETVEQQADAFKATRFNLETEWKNNFPRLRELDRDELFERARGEILDEIVNLSQVSPRHWEEVLMTRNWEKVSMHVFENIYLPAAQSGNPNTFNTTVDIKLRHWAEQQLPARSVESGWECLQQEFQNFMSQARLSPDHDDIFDNLKNAVVNEAMRRHSWEEKASEMLRVIQLNILEDRSVNDKREWDQAVRFLETSVKEKLQSTEQILRDMLGPDRKERWMYWQSQTEEQQKRTSVKNELDKILYSDKKHAPTLTHDELTTVRKNLQRNGLEVDNEFIRETWHPVYRRFFLQQSLARAYDCRKGYYLYHTGHENEMECNDVVLFWRIQQMLKVTANALRQQIMNREARRLDKEIKEVLEDYSQDNEIKQKLLTGRRVTLAEELKRVRQIQEKLEEFIQALNKEKMDFDAETCLKDWNDLAQDYKELEALNREYVMKLEEISELQAKCIKGISHQKYRMGIIRKSLKQPSARETREELKKSIIRREQQLQEIEQTLPKSNGTYLQIILGSVNVSILNKNDKFKYKDEYEKFKLVLSVIGFVLSVLNLFTNVRTLELSFMFLLVWYYCTLTIRESILKVNGSRIKGWWRFHHFLSTVVAAVLLVWPNTGPWYQFRTQFMWFNVYISVVQYLQFRYQRGVLYRLKALGERDNMDITIEGFHSWMWRGLSFLLPFLFAGYLFQLYNAYTLYELAYHPDATWHVSVLSAMFLVLFLGNTTTTIMVIPQKLVKERIKDHLFASKPDRRKEGRMRNNPDNKDEKSE'
open('gp1201.fa','w').write(''.join(a))

9169

### Estimating compression of 1000 terabytes
Let’s make some assumptions about the contents of the data at your biotech company. Most of
the data, say 80%, is re-sequencing of genomes and plasmids that are very similar to each
other. Another 10% might be protein sequences, and the last 10% are binary microscope
images which we’ll assume follow the worst-case scenario of being completely random.

Given the benchmarking data you obtained in this lab, which algorithm do you propose to use
for each type of data? Provide an estimate for the fraction of space you can save using your
compression scheme. How much of a bonus do you anticipate receiving this year?

#### re-sequencing of genomes and plasmids

#### protein sequences

#### binary microscope images