### 1. Simulating the data
Fortunately, you know that most of the data you’ll be compressing will be nucleic acid
sequences. Some of it, however, will be binary files, and some will be protein sequences. Start
by writing some code to simulate files containing random DNA, protein, and binary data.

1. Using np.random.choice, generate 100 megabytes (8 bits/byte * 1024 bytes/kilobyte * 1024
kilobytes/megabyte * 100) of random data containing 100%, 90%, 80%, 70%, 60%, and 50%
zeros.    
The number of percentage of zeros can be adjusted by changing the p values of 0 and 1.    
Packbits are used to pack up the bits that are been generated.

In [4]:
# Doing 100% zeros
import numpy as np
myvar1 = np.random.choice([0, 1], size=(8,1024,1024,100), replace=True, p=[1, 0])
myvar1 = np.packbits(myvar1)

# Doing 90% zeros
myvar2 = np.random.choice([0, 1], size=(8,1024,1024,100), replace=True, p=[0.9, 0.1])
myvar2 = np.packbits(myvar2)

# Doing 80% zeros
myvar3 = np.random.choice([0, 1], size=(8,1024,1024,100), replace=True, p=[0.8, 0.2])
myvar3 = np.packbits(myvar3)

# Doing 70% zeros
myvar4 = np.random.choice([0, 1], size=(8,1024,1024,100), replace=True, p=[0.7, 0.3])
myvar4 = np.packbits(myvar4)

# Doing 60% zeros
myvar5 = np.random.choice([0, 1], size=(8,1024,1024,100), replace=True, p=[0.6, 0.4])
myvar5 = np.packbits(myvar5)

# Doing 50% zeros
myvar6 = np.random.choice([0, 1], size=(8,1024,1024,100), replace=True, p=[0.5, 0.5])
myvar6 = np.packbits(myvar6)

2. Then write this data to a file in your home directory
This produces 6 files containing the random generated zeros and ones with 105MB file size for each.

In [27]:
open('zeros_100p','wb').write(myvar1)
open('zeros_90p','wb').write(myvar2)
open('zeros_80p','wb').write(myvar3)
open('zeros_70p','wb').write(myvar4)
open('zeros_60p','wb').write(myvar5)
open('zeros_50p','wb').write(myvar6)

104857600

3. Next, generate DNA and protein sequences 100 million letters long and write those to your
home directory.    
Similar to what was did above. Instead, use ATCG nucleotides instead of 0 and 1 for generating nucleotide sequences, and the 20 amino acid letter alphabets for generating the protein sequence.    
The result should be 2 files of 100 MB large random generated nucleotide and protein sequences that the sequence is 100 milion letters long.

In [21]:
import numpy as np
my_nt_seq= np.random.choice(['A','T','C','G'], size=100000000, replace=True, p=[0.25, 0.25,0.25,0.25])
my_pro_seq= np.random.choice(['A','R','N','D','C','E','Q','G','H','I','L','K','M','F','P','S','T','W','Y','V'], size=100000000, replace=True)

In [23]:
open('nt_seq.fa','w').write(''.join(my_nt_seq))
open('pro_seq.fa','w').write(''.join(my_pro_seq))

100000000

### 2. Compressing the data
You’ll have to do this from the terminal on bioe131.com via SSH (PuTTY on Windows) or via
iPython in your web browser. On each of the files you generated above, run gzip, bzip, pbzip2
and ArithmeticCompress as follows:
```
time gzip –k zeros_100p
time bzip2 –k zeros_100p
time pbzip2 –k zeros_100p
time ArithmeticCompress zeros_100p zeros_100p.art
```

#### The 0/1 input file sizes are 105 MB, and the nucleotide and protein file size are 100 MB!!!

### Table of results

Compression | gzip | bzip | pbzip2 | Arithmetic
--- | --- | --- | --- | ---
100% zero output size | 102 KB | 113 B | 5.62 KB | 1.03 KB
100% zero time | 0.702s | 1.051s | 0.099s | 14.932s
90% zero output size| 58.7 MB | 61.2 MB | 61.2 MB | 1.02 KB
90% zero time | 19.051s | 11.303s | 0.759s | 0.002s
80% zero output size | 81.2 MB | 86.6 MB | 86.7 MB | 75.7 MB
80% zero time | 13.569s | 12.141s | 0.933s | 35.749s
70% zero output size | 93.6 MB | 99.8 MB | 99.8 MB | 92.4 MB
70% zero time | 6.224s | 14.216s | 1.183s | 39.501s
60% zero output size | 102 MB | 105 MB | 105 MB | 102 MB
60% zero time | 4.314s | 16.093s | 1.377s | 41.795s
50% zero output size | 105 MB | 105 MB | 105 MB | 105 MB
50% zero time | 3.503s | 16.777s | 1.470s | 40.964s
nucleotide output size | 29.2 MB | 27.3 MB | 27.3 MB | 25 MB
nucleotide time | 12.153s | 9.500s | 0.662s | 21.512s
protein output size | 60.6 MB | 55.3 MB | 55.3 MB | 54 MB
protein time | 5.358s | 10.152s | 0.783s | 28.622s

#### Questions

#### Which algorithm achieves the best level of compression on each file type?
- Arithmetic compression shows to have the best level of compression (the result file is the smallest compare to other methods)

#### Which algorithm is the fastest?
- The pbzips algorithm is the fastest for every file. 

#### What is the difference between bzip2 and pbzip2? Do you expect one to be faster and why?
- bzip2 compresses data in blocks of size between 100 and 900 kB and uses the Burrows–Wheeler transform to convert frequently-recurring character sequences into strings of identical letters.
- bzip2 and pbzips have similar compression method. Except that pbzip2 supports multi-threading, giving almost linear speed improvements on multi-CPU and multi-core computers. Therefore, pbzips shoudl run faster due to its multi-thread compression speed. 

#### How does the level of compression change as the percentage of zeros increases? Why does this happen?
- As the percentage of zeros increase, the file compressed size becomes smaller (no matter any algorithms), and the overall compression time become less (except for 100% of zeros file, the arithmetic compression shows to be a little longer). This can be due to that as there are more zeros contained in the file, this takes up less percentage of the file (since 0 = null data) thus also requires less time to compress the file.

#### What is the minimum number of bits required to store a single DNA base?
- 2 bits.

#### What is the minimum number of bits required to store an amino acid letter?
- A amino acid letter = single letter codes = 8 bits.

#### In your tests, how many bits did gzip and bzip2 actually require to store your random DNA and protein sequences?
- for DNA, the gzip compression uses 29.2 MB * 1024 * 1024 * 8 = 244,947,354 bits.    
- for DNA, the bzip compression uses 27.3 MB * 1024 * 1024 * 8 = 229,008,998 bits.     
- for protein, the gzip compression uses 60.6 MB * 1024 * 1024 * 8 = 508,349,645 bits.     
- for protein, the gzip compression uses 55.3 MB * 1024 * 1024 * 8 = 463,890,022 bits.

#### Are gzip and bzip2 performing well on DNA and proteins?
- For DNA, the methods compressed the file from 100 MB to around 20 MB. For protein, it is from 100 MB to around 60 MB. Therefore the compression shown to be better for nucleotide sequence. However, if comparing in terms of compression time (9.5s), for DNA, bzip2 has a less compression time. For protein, it is gzip that has a less compresion time (10.358s).

### 3. Compressing real data
Now that you have a sense of how random data can be compressed, let’s have a look at some
real biological data. Using what you’ve learned about querying biological databases, find the
nucleic acid sequences of gp120 homologs from at least 10 different HIV isolates and
concatenate them together into a single multi-FASTA.

This is done that using entrez to pull in information from the biological database searching gp120. By appending each founding description and seqeunce corresponding to each seqeunce in a list called seq, I print the content out and also write the list into a new fasta file called "gp120.fa" (the file can be found in the same folder lab7_Joanne)

In [94]:
seq = []
from Bio import SeqIO
from Bio import Entrez
Entrez.email = 'joanne91218@berkeley.edu'
handle = Entrez.esearch(db = 'nucleotide',
                       term = 'gp120',
                       sort = 'relevance',
                       idtype = 'acc')
for i in Entrez.read(handle)['IdList']:
    handle = Entrez.efetch(db = 'nucleotide', id=i, rettype = 'fasta', retmode = 'text')
    temp = SeqIO.read(handle,'fasta')
    seq.append(">" + temp.description + '\n' + str(temp.seq) + '\n')
    print(seq)

open('gp120.fa','w').write(''.join(seq))

['>AF236860.1 HIV-1 strain TH90129_1 from Thailand gp120 (gp120) gene, partial cds\nGTTCCTGTGTGGAAAGATGCAGAGACCACCTTATTTTGTGCATCAGATGCCAAAGCACATGAGACAGAAGTGCACAATGTCTGGGCCACACATGCCTGTGTACCCACAGACCCCAACCCACAAGAAATACAACTGAAAAATGTAACAGAGAATTTTAACATGTGGAAAAATAACATGGTAGAGCAGATGCAGGAGGATGTAATCAGTTTATGGGATCAAAGTCTAAAGCCATGTGTAAAGTTAACTCCTCTCTGCGTTACTTTAAATTGTACCGATGCTACTTTGACCAATAGCACTTACATAACCAATGTCTCTAAGATAATAGGAGATATAACAGAGGAAGTAAGAAACTGTTCTTTTAATATGACCACAGAACTAAGAGATAAGAAGCAGAAGGTCCATGCACTTTTTTTATAAGCTTGATATAGTAGAAATTGAAAAGAATAGGAATGAGTATAGGTTAATAAATTGTAATACTTCGGTCATTAAGCAGGCTTGTCCAAAGATATCCTTTGATCCAATTCCTATACATTATTGTACTCCAGCTGGTTATGCGATTTTAAAGTGTAATGATAAGAATTTCAATGGGACAGGGCCATGTAAAAATGTCAGCTCAGTACAATGCACACATGGAATTAAGCCAGTGGTATCAACTCAATTGCTGTTAAATGGCAGTCTAGCAGAAGAAGAGATAATAATCAGATCTGAAAATCTCACAAACAATGCCAAAACCATAATAGTGCACCTTAATAAATCTGTAGAAATCAATTGTACCAGACCCTTCAACAATACAAGAACAAGTATAACTATAGGACCAGGACAAATGTTCTATAGAACAGGAGAGATAATAGGAGATATAACAAAAGCATATTGTGAGATTAATGGAACAAAATGGAATGAAACTTTAAAACAGGTA

['>AF236860.1 HIV-1 strain TH90129_1 from Thailand gp120 (gp120) gene, partial cds\nGTTCCTGTGTGGAAAGATGCAGAGACCACCTTATTTTGTGCATCAGATGCCAAAGCACATGAGACAGAAGTGCACAATGTCTGGGCCACACATGCCTGTGTACCCACAGACCCCAACCCACAAGAAATACAACTGAAAAATGTAACAGAGAATTTTAACATGTGGAAAAATAACATGGTAGAGCAGATGCAGGAGGATGTAATCAGTTTATGGGATCAAAGTCTAAAGCCATGTGTAAAGTTAACTCCTCTCTGCGTTACTTTAAATTGTACCGATGCTACTTTGACCAATAGCACTTACATAACCAATGTCTCTAAGATAATAGGAGATATAACAGAGGAAGTAAGAAACTGTTCTTTTAATATGACCACAGAACTAAGAGATAAGAAGCAGAAGGTCCATGCACTTTTTTTATAAGCTTGATATAGTAGAAATTGAAAAGAATAGGAATGAGTATAGGTTAATAAATTGTAATACTTCGGTCATTAAGCAGGCTTGTCCAAAGATATCCTTTGATCCAATTCCTATACATTATTGTACTCCAGCTGGTTATGCGATTTTAAAGTGTAATGATAAGAATTTCAATGGGACAGGGCCATGTAAAAATGTCAGCTCAGTACAATGCACACATGGAATTAAGCCAGTGGTATCAACTCAATTGCTGTTAAATGGCAGTCTAGCAGAAGAAGAGATAATAATCAGATCTGAAAATCTCACAAACAATGCCAAAACCATAATAGTGCACCTTAATAAATCTGTAGAAATCAATTGTACCAGACCCTTCAACAATACAAGAACAAGTATAACTATAGGACCAGGACAAATGTTCTATAGAACAGGAGAGATAATAGGAGATATAACAAAAGCATATTGTGAGATTAATGGAACAAAATGGAATGAAACTTTAAAACAGGTA

['>AF236860.1 HIV-1 strain TH90129_1 from Thailand gp120 (gp120) gene, partial cds\nGTTCCTGTGTGGAAAGATGCAGAGACCACCTTATTTTGTGCATCAGATGCCAAAGCACATGAGACAGAAGTGCACAATGTCTGGGCCACACATGCCTGTGTACCCACAGACCCCAACCCACAAGAAATACAACTGAAAAATGTAACAGAGAATTTTAACATGTGGAAAAATAACATGGTAGAGCAGATGCAGGAGGATGTAATCAGTTTATGGGATCAAAGTCTAAAGCCATGTGTAAAGTTAACTCCTCTCTGCGTTACTTTAAATTGTACCGATGCTACTTTGACCAATAGCACTTACATAACCAATGTCTCTAAGATAATAGGAGATATAACAGAGGAAGTAAGAAACTGTTCTTTTAATATGACCACAGAACTAAGAGATAAGAAGCAGAAGGTCCATGCACTTTTTTTATAAGCTTGATATAGTAGAAATTGAAAAGAATAGGAATGAGTATAGGTTAATAAATTGTAATACTTCGGTCATTAAGCAGGCTTGTCCAAAGATATCCTTTGATCCAATTCCTATACATTATTGTACTCCAGCTGGTTATGCGATTTTAAAGTGTAATGATAAGAATTTCAATGGGACAGGGCCATGTAAAAATGTCAGCTCAGTACAATGCACACATGGAATTAAGCCAGTGGTATCAACTCAATTGCTGTTAAATGGCAGTCTAGCAGAAGAAGAGATAATAATCAGATCTGAAAATCTCACAAACAATGCCAAAACCATAATAGTGCACCTTAATAAATCTGTAGAAATCAATTGTACCAGACCCTTCAACAATACAAGAACAAGTATAACTATAGGACCAGGACAAATGTTCTATAGAACAGGAGAGATAATAGGAGATATAACAAAAGCATATTGTGAGATTAATGGAACAAAATGGAATGAAACTTTAAAACAGGTA

['>AF236860.1 HIV-1 strain TH90129_1 from Thailand gp120 (gp120) gene, partial cds\nGTTCCTGTGTGGAAAGATGCAGAGACCACCTTATTTTGTGCATCAGATGCCAAAGCACATGAGACAGAAGTGCACAATGTCTGGGCCACACATGCCTGTGTACCCACAGACCCCAACCCACAAGAAATACAACTGAAAAATGTAACAGAGAATTTTAACATGTGGAAAAATAACATGGTAGAGCAGATGCAGGAGGATGTAATCAGTTTATGGGATCAAAGTCTAAAGCCATGTGTAAAGTTAACTCCTCTCTGCGTTACTTTAAATTGTACCGATGCTACTTTGACCAATAGCACTTACATAACCAATGTCTCTAAGATAATAGGAGATATAACAGAGGAAGTAAGAAACTGTTCTTTTAATATGACCACAGAACTAAGAGATAAGAAGCAGAAGGTCCATGCACTTTTTTTATAAGCTTGATATAGTAGAAATTGAAAAGAATAGGAATGAGTATAGGTTAATAAATTGTAATACTTCGGTCATTAAGCAGGCTTGTCCAAAGATATCCTTTGATCCAATTCCTATACATTATTGTACTCCAGCTGGTTATGCGATTTTAAAGTGTAATGATAAGAATTTCAATGGGACAGGGCCATGTAAAAATGTCAGCTCAGTACAATGCACACATGGAATTAAGCCAGTGGTATCAACTCAATTGCTGTTAAATGGCAGTCTAGCAGAAGAAGAGATAATAATCAGATCTGAAAATCTCACAAACAATGCCAAAACCATAATAGTGCACCTTAATAAATCTGTAGAAATCAATTGTACCAGACCCTTCAACAATACAAGAACAAGTATAACTATAGGACCAGGACAAATGTTCTATAGAACAGGAGAGATAATAGGAGATATAACAAAAGCATATTGTGAGATTAATGGAACAAAATGGAATGAAACTTTAAAACAGGTA

['>AF236860.1 HIV-1 strain TH90129_1 from Thailand gp120 (gp120) gene, partial cds\nGTTCCTGTGTGGAAAGATGCAGAGACCACCTTATTTTGTGCATCAGATGCCAAAGCACATGAGACAGAAGTGCACAATGTCTGGGCCACACATGCCTGTGTACCCACAGACCCCAACCCACAAGAAATACAACTGAAAAATGTAACAGAGAATTTTAACATGTGGAAAAATAACATGGTAGAGCAGATGCAGGAGGATGTAATCAGTTTATGGGATCAAAGTCTAAAGCCATGTGTAAAGTTAACTCCTCTCTGCGTTACTTTAAATTGTACCGATGCTACTTTGACCAATAGCACTTACATAACCAATGTCTCTAAGATAATAGGAGATATAACAGAGGAAGTAAGAAACTGTTCTTTTAATATGACCACAGAACTAAGAGATAAGAAGCAGAAGGTCCATGCACTTTTTTTATAAGCTTGATATAGTAGAAATTGAAAAGAATAGGAATGAGTATAGGTTAATAAATTGTAATACTTCGGTCATTAAGCAGGCTTGTCCAAAGATATCCTTTGATCCAATTCCTATACATTATTGTACTCCAGCTGGTTATGCGATTTTAAAGTGTAATGATAAGAATTTCAATGGGACAGGGCCATGTAAAAATGTCAGCTCAGTACAATGCACACATGGAATTAAGCCAGTGGTATCAACTCAATTGCTGTTAAATGGCAGTCTAGCAGAAGAAGAGATAATAATCAGATCTGAAAATCTCACAAACAATGCCAAAACCATAATAGTGCACCTTAATAAATCTGTAGAAATCAATTGTACCAGACCCTTCAACAATACAAGAACAAGTATAACTATAGGACCAGGACAAATGTTCTATAGAACAGGAGAGATAATAGGAGATATAACAAAAGCATATTGTGAGATTAATGGAACAAAATGGAATGAAACTTTAAAACAGGTA

['>AF236860.1 HIV-1 strain TH90129_1 from Thailand gp120 (gp120) gene, partial cds\nGTTCCTGTGTGGAAAGATGCAGAGACCACCTTATTTTGTGCATCAGATGCCAAAGCACATGAGACAGAAGTGCACAATGTCTGGGCCACACATGCCTGTGTACCCACAGACCCCAACCCACAAGAAATACAACTGAAAAATGTAACAGAGAATTTTAACATGTGGAAAAATAACATGGTAGAGCAGATGCAGGAGGATGTAATCAGTTTATGGGATCAAAGTCTAAAGCCATGTGTAAAGTTAACTCCTCTCTGCGTTACTTTAAATTGTACCGATGCTACTTTGACCAATAGCACTTACATAACCAATGTCTCTAAGATAATAGGAGATATAACAGAGGAAGTAAGAAACTGTTCTTTTAATATGACCACAGAACTAAGAGATAAGAAGCAGAAGGTCCATGCACTTTTTTTATAAGCTTGATATAGTAGAAATTGAAAAGAATAGGAATGAGTATAGGTTAATAAATTGTAATACTTCGGTCATTAAGCAGGCTTGTCCAAAGATATCCTTTGATCCAATTCCTATACATTATTGTACTCCAGCTGGTTATGCGATTTTAAAGTGTAATGATAAGAATTTCAATGGGACAGGGCCATGTAAAAATGTCAGCTCAGTACAATGCACACATGGAATTAAGCCAGTGGTATCAACTCAATTGCTGTTAAATGGCAGTCTAGCAGAAGAAGAGATAATAATCAGATCTGAAAATCTCACAAACAATGCCAAAACCATAATAGTGCACCTTAATAAATCTGTAGAAATCAATTGTACCAGACCCTTCAACAATACAAGAACAAGTATAACTATAGGACCAGGACAAATGTTCTATAGAACAGGAGAGATAATAGGAGATATAACAAAAGCATATTGTGAGATTAATGGAACAAAATGGAATGAAACTTTAAAACAGGTA

['>AF236860.1 HIV-1 strain TH90129_1 from Thailand gp120 (gp120) gene, partial cds\nGTTCCTGTGTGGAAAGATGCAGAGACCACCTTATTTTGTGCATCAGATGCCAAAGCACATGAGACAGAAGTGCACAATGTCTGGGCCACACATGCCTGTGTACCCACAGACCCCAACCCACAAGAAATACAACTGAAAAATGTAACAGAGAATTTTAACATGTGGAAAAATAACATGGTAGAGCAGATGCAGGAGGATGTAATCAGTTTATGGGATCAAAGTCTAAAGCCATGTGTAAAGTTAACTCCTCTCTGCGTTACTTTAAATTGTACCGATGCTACTTTGACCAATAGCACTTACATAACCAATGTCTCTAAGATAATAGGAGATATAACAGAGGAAGTAAGAAACTGTTCTTTTAATATGACCACAGAACTAAGAGATAAGAAGCAGAAGGTCCATGCACTTTTTTTATAAGCTTGATATAGTAGAAATTGAAAAGAATAGGAATGAGTATAGGTTAATAAATTGTAATACTTCGGTCATTAAGCAGGCTTGTCCAAAGATATCCTTTGATCCAATTCCTATACATTATTGTACTCCAGCTGGTTATGCGATTTTAAAGTGTAATGATAAGAATTTCAATGGGACAGGGCCATGTAAAAATGTCAGCTCAGTACAATGCACACATGGAATTAAGCCAGTGGTATCAACTCAATTGCTGTTAAATGGCAGTCTAGCAGAAGAAGAGATAATAATCAGATCTGAAAATCTCACAAACAATGCCAAAACCATAATAGTGCACCTTAATAAATCTGTAGAAATCAATTGTACCAGACCCTTCAACAATACAAGAACAAGTATAACTATAGGACCAGGACAAATGTTCTATAGAACAGGAGAGATAATAGGAGATATAACAAAAGCATATTGTGAGATTAATGGAACAAAATGGAATGAAACTTTAAAACAGGTA

['>AF236860.1 HIV-1 strain TH90129_1 from Thailand gp120 (gp120) gene, partial cds\nGTTCCTGTGTGGAAAGATGCAGAGACCACCTTATTTTGTGCATCAGATGCCAAAGCACATGAGACAGAAGTGCACAATGTCTGGGCCACACATGCCTGTGTACCCACAGACCCCAACCCACAAGAAATACAACTGAAAAATGTAACAGAGAATTTTAACATGTGGAAAAATAACATGGTAGAGCAGATGCAGGAGGATGTAATCAGTTTATGGGATCAAAGTCTAAAGCCATGTGTAAAGTTAACTCCTCTCTGCGTTACTTTAAATTGTACCGATGCTACTTTGACCAATAGCACTTACATAACCAATGTCTCTAAGATAATAGGAGATATAACAGAGGAAGTAAGAAACTGTTCTTTTAATATGACCACAGAACTAAGAGATAAGAAGCAGAAGGTCCATGCACTTTTTTTATAAGCTTGATATAGTAGAAATTGAAAAGAATAGGAATGAGTATAGGTTAATAAATTGTAATACTTCGGTCATTAAGCAGGCTTGTCCAAAGATATCCTTTGATCCAATTCCTATACATTATTGTACTCCAGCTGGTTATGCGATTTTAAAGTGTAATGATAAGAATTTCAATGGGACAGGGCCATGTAAAAATGTCAGCTCAGTACAATGCACACATGGAATTAAGCCAGTGGTATCAACTCAATTGCTGTTAAATGGCAGTCTAGCAGAAGAAGAGATAATAATCAGATCTGAAAATCTCACAAACAATGCCAAAACCATAATAGTGCACCTTAATAAATCTGTAGAAATCAATTGTACCAGACCCTTCAACAATACAAGAACAAGTATAACTATAGGACCAGGACAAATGTTCTATAGAACAGGAGAGATAATAGGAGATATAACAAAAGCATATTGTGAGATTAATGGAACAAAATGGAATGAAACTTTAAAACAGGTA

['>AF236860.1 HIV-1 strain TH90129_1 from Thailand gp120 (gp120) gene, partial cds\nGTTCCTGTGTGGAAAGATGCAGAGACCACCTTATTTTGTGCATCAGATGCCAAAGCACATGAGACAGAAGTGCACAATGTCTGGGCCACACATGCCTGTGTACCCACAGACCCCAACCCACAAGAAATACAACTGAAAAATGTAACAGAGAATTTTAACATGTGGAAAAATAACATGGTAGAGCAGATGCAGGAGGATGTAATCAGTTTATGGGATCAAAGTCTAAAGCCATGTGTAAAGTTAACTCCTCTCTGCGTTACTTTAAATTGTACCGATGCTACTTTGACCAATAGCACTTACATAACCAATGTCTCTAAGATAATAGGAGATATAACAGAGGAAGTAAGAAACTGTTCTTTTAATATGACCACAGAACTAAGAGATAAGAAGCAGAAGGTCCATGCACTTTTTTTATAAGCTTGATATAGTAGAAATTGAAAAGAATAGGAATGAGTATAGGTTAATAAATTGTAATACTTCGGTCATTAAGCAGGCTTGTCCAAAGATATCCTTTGATCCAATTCCTATACATTATTGTACTCCAGCTGGTTATGCGATTTTAAAGTGTAATGATAAGAATTTCAATGGGACAGGGCCATGTAAAAATGTCAGCTCAGTACAATGCACACATGGAATTAAGCCAGTGGTATCAACTCAATTGCTGTTAAATGGCAGTCTAGCAGAAGAAGAGATAATAATCAGATCTGAAAATCTCACAAACAATGCCAAAACCATAATAGTGCACCTTAATAAATCTGTAGAAATCAATTGTACCAGACCCTTCAACAATACAAGAACAAGTATAACTATAGGACCAGGACAAATGTTCTATAGAACAGGAGAGATAATAGGAGATATAACAAAAGCATATTGTGAGATTAATGGAACAAAATGGAATGAAACTTTAAAACAGGTA

['>AF236860.1 HIV-1 strain TH90129_1 from Thailand gp120 (gp120) gene, partial cds\nGTTCCTGTGTGGAAAGATGCAGAGACCACCTTATTTTGTGCATCAGATGCCAAAGCACATGAGACAGAAGTGCACAATGTCTGGGCCACACATGCCTGTGTACCCACAGACCCCAACCCACAAGAAATACAACTGAAAAATGTAACAGAGAATTTTAACATGTGGAAAAATAACATGGTAGAGCAGATGCAGGAGGATGTAATCAGTTTATGGGATCAAAGTCTAAAGCCATGTGTAAAGTTAACTCCTCTCTGCGTTACTTTAAATTGTACCGATGCTACTTTGACCAATAGCACTTACATAACCAATGTCTCTAAGATAATAGGAGATATAACAGAGGAAGTAAGAAACTGTTCTTTTAATATGACCACAGAACTAAGAGATAAGAAGCAGAAGGTCCATGCACTTTTTTTATAAGCTTGATATAGTAGAAATTGAAAAGAATAGGAATGAGTATAGGTTAATAAATTGTAATACTTCGGTCATTAAGCAGGCTTGTCCAAAGATATCCTTTGATCCAATTCCTATACATTATTGTACTCCAGCTGGTTATGCGATTTTAAAGTGTAATGATAAGAATTTCAATGGGACAGGGCCATGTAAAAATGTCAGCTCAGTACAATGCACACATGGAATTAAGCCAGTGGTATCAACTCAATTGCTGTTAAATGGCAGTCTAGCAGAAGAAGAGATAATAATCAGATCTGAAAATCTCACAAACAATGCCAAAACCATAATAGTGCACCTTAATAAATCTGTAGAAATCAATTGTACCAGACCCTTCAACAATACAAGAACAAGTATAACTATAGGACCAGGACAAATGTTCTATAGAACAGGAGAGATAATAGGAGATATAACAAAAGCATATTGTGAGATTAATGGAACAAAATGGAATGAAACTTTAAAACAGGTA

['>AF236860.1 HIV-1 strain TH90129_1 from Thailand gp120 (gp120) gene, partial cds\nGTTCCTGTGTGGAAAGATGCAGAGACCACCTTATTTTGTGCATCAGATGCCAAAGCACATGAGACAGAAGTGCACAATGTCTGGGCCACACATGCCTGTGTACCCACAGACCCCAACCCACAAGAAATACAACTGAAAAATGTAACAGAGAATTTTAACATGTGGAAAAATAACATGGTAGAGCAGATGCAGGAGGATGTAATCAGTTTATGGGATCAAAGTCTAAAGCCATGTGTAAAGTTAACTCCTCTCTGCGTTACTTTAAATTGTACCGATGCTACTTTGACCAATAGCACTTACATAACCAATGTCTCTAAGATAATAGGAGATATAACAGAGGAAGTAAGAAACTGTTCTTTTAATATGACCACAGAACTAAGAGATAAGAAGCAGAAGGTCCATGCACTTTTTTTATAAGCTTGATATAGTAGAAATTGAAAAGAATAGGAATGAGTATAGGTTAATAAATTGTAATACTTCGGTCATTAAGCAGGCTTGTCCAAAGATATCCTTTGATCCAATTCCTATACATTATTGTACTCCAGCTGGTTATGCGATTTTAAAGTGTAATGATAAGAATTTCAATGGGACAGGGCCATGTAAAAATGTCAGCTCAGTACAATGCACACATGGAATTAAGCCAGTGGTATCAACTCAATTGCTGTTAAATGGCAGTCTAGCAGAAGAAGAGATAATAATCAGATCTGAAAATCTCACAAACAATGCCAAAACCATAATAGTGCACCTTAATAAATCTGTAGAAATCAATTGTACCAGACCCTTCAACAATACAAGAACAAGTATAACTATAGGACCAGGACAAATGTTCTATAGAACAGGAGAGATAATAGGAGATATAACAAAAGCATATTGTGAGATTAATGGAACAAAATGGAATGAAACTTTAAAACAGGTA

9949

#### Table of content (initial file size is 9.95 kb)

Compression | gzip | bzip | Arithmetic
--- | --- | --- | ---
real data output size | 2.33 kb |2.36 kb | 1.02 kb
rea data time | 0.004s | 0.007s | 0.005s
real data compression ratio | 0.234 | 0.237 | 0.103
random data compression ratio | 0.292 | 0.273 | 0.25

#### Analysis
- As a priori, compressing real data should be better than compressing random data, since real data will involve in different fasta seqeunce that are in similar, which can let compression achieving a higher efficiency to a smaller file. But for random data, the data generated is in uniform distribution.    
- By comparing the compression ratio, this proves my guess is right, which the compression ratio (file output size/ file input size) is less for real data than using random data, meaning that the real data files have been compressed into smaller file and the compression efficiency is higher.

### 4. Estimating compression of 1000 terabytes
Let’s make some assumptions about the contents of the data at your biotech company. Most of
the data, say 80%, is re-sequencing of genomes and plasmids that are very similar to each
other. Another 10% might be protein sequences, and the last 10% are binary microscope
images which we’ll assume follow the worst-case scenario of being completely random.

Given the benchmarking data you obtained in this lab, which algorithm do you propose to use
for each type of data? Provide an estimate for the fraction of space you can save using your
compression scheme. How much of a bonus do you anticipate receiving this year?

#### re-sequencing of genomes and plasmids
- Since the sequences are very similar to each other, this means that the file itself can be compressed into a smaller file size (perhaps 65 percents of bits can be saved up when compressing). The gzip algothrim can be used in this case which there are many similiarities in sequences and the compression speed is fast. 

#### protein sequences
- For the protein sequences, the pbzip algorithm can be used in this case to compress the file to fairly small size with a very fast compression speed. By this compression 70 percent of memory space can be saved. 

#### binary microscope images
- In this case as the information in the file is completely random, we will want to a better compression level of the file, which in this case the arithmetic compression algorithm would be used. Although a longer compression time may be used but 80 percents of space can be saved. 

###### I don't know exactly how much bonus I can get but probably 500 dollars!!! (hopefully)