### Simulating the data
Fortunately, you know that most of the data you’ll be compressing will be nucleic acid
sequences. Some of it, however, will be binary files, and some will be protein sequences. Start
by writing some code to simulate files containing random DNA, protein, and binary data.

1. Using np.random.choice, generate 100 megabytes (8 bits/byte * 1024 bytes/kilobyte * 1024
kilobytes/megabyte * 100) of random data containing 100%, 90%, 80%, 70%, 60%, and 50%
zeros.    
The number of percentage of zeros can be adjusted by changing the p values of 0 and 1.    
Packbits are used to pack up the bits that are been generated.

In [4]:
# Doing 100% zeros
import numpy as np
myvar1 = np.random.choice([0, 1], size=(8,1024,1024,100), replace=True, p=[1, 0])
myvar1 = np.packbits(myvar1)

# Doing 90% zeros
myvar2 = np.random.choice([0, 1], size=(8,1024,1024,100), replace=True, p=[0.9, 0.1])
myvar2 = np.packbits(myvar2)

# Doing 80% zeros
myvar3 = np.random.choice([0, 1], size=(8,1024,1024,100), replace=True, p=[0.8, 0.2])
myvar3 = np.packbits(myvar3)

# Doing 70% zeros
myvar4 = np.random.choice([0, 1], size=(8,1024,1024,100), replace=True, p=[0.7, 0.3])
myvar4 = np.packbits(myvar4)

# Doing 60% zeros
myvar5 = np.random.choice([0, 1], size=(8,1024,1024,100), replace=True, p=[0.6, 0.4])
myvar5 = np.packbits(myvar5)

# Doing 50% zeros
myvar6 = np.random.choice([0, 1], size=(8,1024,1024,100), replace=True, p=[0.5, 0.5])
myvar6 = np.packbits(myvar6)

2. Then write this data to a file in your home directory
This produces 6 files containing the random generated zeros and ones with 105MB file size for each.

In [27]:
open('zeros_100p','wb').write(myvar1)
open('zeros_90p','wb').write(myvar2)
open('zeros_80p','wb').write(myvar3)
open('zeros_70p','wb').write(myvar4)
open('zeros_60p','wb').write(myvar5)
open('zeros_50p','wb').write(myvar6)

104857600

3. Next, generate DNA and protein sequences 100 million letters long and write those to your
home directory.    
Similar to what was did above. Instead, use ATCG nucleotides instead of 0 and 1 for generating nucleotide sequences, and the 20 amino acid letter alphabets for generating the protein sequence.    
The result should be 2 files of 100 MB large random generated nucleotide and protein sequences that the sequence is 100 milion letters long.

In [21]:
import numpy as np
my_nt_seq= np.random.choice(['A','T','C','G'], size=100000000, replace=True, p=[0.25, 0.25,0.25,0.25])
my_pro_seq= np.random.choice(['A','R','N','D','C','E','Q','G','H','I','L','K','M','F','P','S','T','W','Y','V'], size=100000000, replace=True)

In [23]:
open('nt_seq.fa','w').write(''.join(my_nt_seq))
open('pro_seq.fa','w').write(''.join(my_pro_seq))

100000000

### Compressing the data
You’ll have to do this from the terminal on bioe131.com via SSH (PuTTY on Windows) or via
iPython in your web browser. On each of the files you generated above, run gzip, bzip, pbzip2
and ArithmeticCompress as follows:
```
time gzip –k zeros_100p
time bzip2 –k zeros_100p
time pbzip2 –k zeros_100p
time ArithmeticCompress zeros_100p zeros_100p.art
```

#### The 0/1 input file sizes are 105 MB, and the nucleotide and protein file size are 100 MB!!!

#### 100 percent 0:           
1. time gzip –k zeros_100p      

real    0m0.702s      
user    0m0.690s     
sys     0m0.012s     

file output size: 102kb

2. time bzip2 –k zeros_100p

real    0m1.051s     
user    0m1.015s     
sys     0m0.036s      

file output size: 113b

3. time pbzip2 –k zeros_100p

real    0m0.099s     
user    0m1.787s           
sys     0m0.093s        

file output size :5.62kb

4. time ArithmeticCompress zeros_100p zeros_100p.art

real    0m14.932s      
user    0m14.895s      
sys     0m0.037s    

file output size: 1.03kb

----------------
#### 90 percent 0:       
1. time gzip –k zeros_90p

real    0m19.051s     
user    0m18.756s      
sys     0m0.124s    

file output size: 58.7MB

2. time bzip2 –k zeros_90p

real    0m11.303s     
user    0m11.215s     
sys     0m0.088s     

file output size: 61.2MB

3. time pbzip2 –k zeros_90p

real    0m0.759s     
user    0m18.815s     
sys     0m0.775s     

file output size: 61.2 MB

4. time ArithmeticCompress zeros_90p zeros_90p.art

real    0m0.002s     
user    0m0.000s     
sys     0m0.002s     

file output size: 1.02 kb

----------------
#### 80 percent 0:      
1. time gzip –k zeros_80p

real    0m13.569s     
user    0m13.291s    
sys     0m0.148s     

file output size: 81.2MB

2. time bzip2 –k zeros_80p

real    0m12.141s     
user    0m12.029s     
sys     0m0.112s     

file output size: 86.6MB

3. time pbzip2 –k zeros_80p

real    0m0.933s     
user    0m23.333s    
sys     0m0.799s     

file output size: 86.7MB

4. time ArithmeticCompress zeros_80p zeros_80p.art

real    0m35.749s    
user    0m35.584s      
sys     0m0.136s    

file output size: 75.7MB

----------------
#### 70 percent 0:      
1. time gzip –k zeros_70p

real    0m6.224s    
user    0m5.989s    
sys     0m0.100s     

file output size: 93.6 MB

2. time bzip2 –k zeros_70p

real    0m14.216s    
user    0m14.095s    
sys     0m0.120s     

file output size: 99.8 MB

3. time pbzip2 –k zeros_70p

real    0m1.183s    
user    0m30.074s     
sys     0m0.977s     

file output size: 99.8 MB

4. time ArithmeticCompress zeros_70p zeros_70p.art

real    0m39.501s    
user    0m39.228s    
sys     0m0.272s      

file output size: 92.4 MB

----------------
#### 60 percent 0:      
1. time gzip -k zeros_60p          

real    0m4.314s    
user    0m4.165s     
sys     0m0.144s      

file output size: 102 MB     

2. time bzip2 -k zeros_60p     

real    0m16.093s     
user    0m15.975s     
sys     0m0.117s     

file output size: 105 MB

3. time pbzip2 –k zeros_60p     

real    0m1.377s     
user    0m35.670s     
sys     0m0.841s      

file output size: 105 MB

4. time ArithmeticCompress zeros_60p zeros_60p.art

real    0m41.795s     
user    0m41.537s      
sys     0m0.200s      

file output size: 102 MB

----------------
#### 50 percent 0:             
1. time gzip -k zeros_50p     

real    0m3.503s     
user    0m3.372s     
sys     0m0.124s     

file output size: 105 MB

2. time bzip2 -k zeros_50p      

real    0m16.777s     
user    0m16.629s      
sys     0m0.148s      

file output size: 105 MB

3. time pbzip2 -k zeros_50p    

real    0m1.470s     
user    0m38.506s     
sys     0m0.732s      

file output size: 105 MB

4. time ArithmeticCompress zeros_50p zeros_50p.art

real    0m40.964s     
user    0m40.739s     
sys     0m0.224s      

file output size: 105 MB


----------------
#### nucleotide seq:                
1. time gzip -k nt_seq.fa

real    0m12.153s     
user    0m12.085s     
sys     0m0.068s      

file output size: 29.2 MB

2. time bzip2 -k nt_seq.fa

real    0m9.500s     
user    0m9.440s     
sys     0m0.061s     

file output size: 27.3 MB

3. time pbzip2 -k nt_seq.fa

real    0m0.662s     
user    0m16.189s      
sys     0m0.805s      

file output size: 27.3 MB

4. time ArithmeticCompress nt_seq.fa nt_seq.fa.art

real    0m21.512s     
user    0m21.292s     
sys     0m0.220s     

file output size: 25 MB


----------------
#### protein seq:                
1. time gzip -k pro_seq.fa

real    0m5.358s    
user    0m5.280s     
sys     0m0.049s     

file output size: 60.6 MB

2. time bzip2 -k pro_seq.fa

real    0m10.152s   
user    0m10.059s    
sys     0m0.093s     

file output size: 55.3 MB

3. time pbzip2 -k pro_seq.fa

real    0m0.783s     
user    0m18.937s     
sys     0m0.845s     

file output size: 55.3 MB

4. time ArithmeticCompress pro_seq.fa pro_seq.fa.art

real    0m28.622s    
user    0m28.333s     
sys     0m0.288s      

file output size: 54 MB



#### Questions


### Compressing real data
Now that you have a sense of how random data can be compressed, let’s have a look at some
real biological data. Using what you’ve learned about querying biological databases, find the
nucleic acid sequences of gp120 homologs from at least 10 different HIV isolates and
concatenate them together into a single multi-FASTA.

In [46]:
from Bio import SeqIO
from Bio import Entrez
Entrez.email = 'joanne91218@berkeley.edu'
handle = Entrez.esearch(db = 'protein',
                       term = 'gp 120',
                       sort = 'relevance',
                       idtype = 'acc')
for i in Entrez.read(handle)['IdList']:
    handle = Entrez.efetch(db = 'protein', id=i, rettype = 'fasta', retmode = 'text')
    temp = SeqIO.read(handle,'fasta')
    print(temp.seq)
    open('gp120.fa','w').write(''.join(temp.seq))

FAIRKCNDREFNGTGPCRNVSTVQCTHGIRPVVSTQLLLNGSLAENKTMIRSENITNNAKNILVQLNXPVNITCIRPNNNTRKSVRIGPGQSFYATGDIIGDIRQAHCNVTRAKWNGTLQKVADQLRT
YCTPAGFXILKCNDKNFNGSGPCTNVSTVQCTHGIRPVVSTQLLLNGSLAEEDVVIRSENFTNNAKTILVQLNKPINITCVRPNNNTRKSIHIAPGRAFYATGEIIGDIRQAYCTLNRTQWNDTLRKIAIKLREQFGNKTIAL
FAILKCNAKNFNGTGXCKNVSTVQCTHGIRPVVSTQLLLNGSLAEDEVVIRSKNFTDNAQNIIVQLNATVNITCTRPNNNTRRSIHMGPGRSFYTTGDIIGDIRQAHCNISRSAWNDTLEQIAAKLRE
EEEVVIRSDNFSDNAKTIIVQLNVSVQINCTRPNNNTRKSINIGPGRAFYATGDIIGDIRQAHCNLSKTQWNNTLNQIVIKLGEQFKINIVF
KLTPLCVTLNCTDGLRNATNATDSRLTNANNSSWGGEMKNCSFKITTSIRDKVQKEYALFYTLDIVPIDKDNNNSTTYRLINCNTSVITQACPKMS
KLTPLCVTLNCTDATNTTKSNTSSWETMEKGEIQNCSFNVTTSIRDRVQREYALFYKLDVVPIDNEKNTTSYRLISCNTSVITQACPKIS
KLTPLCVTLNCTDLQNVTYANSSSERMMEREEMKNCSFNITTSIRDKVQKEYALFYKLDIVPIDDNNNTNTSYRLISCNTSVITQACPKVS
KLTPLCVILNCTDLGNATNTNSSNTNSSSGEMMMEKGEIKNCSFNISTSIRGKVQKEYAFFYKLDIIPIDNDTTSYTLTSCNTSVITQACPKVS
EVQLLESGGGLVKPGGSLRLSCAASGFTLINYRMNWVRQAPGKGLEWVSSISSSSSYIHYADSVKGRFTISRDNAENSLYLQMNSLRAEDTAVYYCVREGPRATGYSMADVFDIWGQGTMVTVSSASTKG

MDFDAETCLKDWNDLAQDYKELEALNREYVMKLEEISELQAKCIKGISHQKYRMGIIRKSLKQPSARETREELKKSIIRREQQLQEIEQTLPKSNGTYLQIILGSVNVSILNKNDKFKYKDEYEKFKLVLSVIGFVLSVLNLFTNVRTLELSFMFLLVWYYCTLTIRESILKVNGSRIKGWWRFHHFLSTVVAAVLLVWPNTGPWYQFRTQFMWFNVYISVVQYLQFRYQRGVLYRLKALGERDNMDITIEGFHSWMWRGLSFLLPFLFAGYLFQLYNAYTLYELAYHPDATWHVSVLSAMFLVLFLGNTTTTIMVIPQKLVKERIKDHLFASKPDRRKEGRMRNNPDNKDEKSE


In [47]:
a= "VPVWKDAETTLFCASDAKAHETEVHNVWATHACVPTDPNPQEIQLKNVTENFNMWKNNMVEQMQEDVISLWDQSLKPCVKLTPLCVTLNCTDATLTNSTYITNVSKIIGDITEEVRNCSFNMTTELRDKKQKVHALFLVPVWRDADTTLFCASDAKSHVTEAHNVWATHACVPTDPNPQEIHLENVTENFNMWKNNMVEQMQEDVISLWEQSLKPCVKLTPLCVTLNCTNANLTNANLTNANNITNVENITDEVRNCSFNVTTDLRDKQQKVHALFYRLDIVQINSKNSSDYRLINCNTSVIKQACPKISFDPIPIHYCTPAGYAILKCNDKNFNGTGPCKNVSSVQCTHGIKPVVSTQLLLNGSLAEEEIIIRSENLTNNVKTIIVHLNKSVEINCTRPSNNTRTSITIGPGQVFYRTGDIIGDIRKVSCELNGTKWNEVLKQVKEKLKEHFNKNISFQPPSGGDLEITMHHFSCRGEFFYCNTTQLFNNTYSNGTITLPCKIKQIINMWQGVGQAMYAPPISGRINCLSNITGLLLTRDGNNGTNETFRPGGGNIKDNWRSELYKCKVVQIEPLGIAPTRAKRRVVEREKKMGCLGSQLLIAILLLSVCGIYCTLYVTVFYGVPAWRNATVPLFCATENRDTWGTTQCLPDNGDYSELALNVTESFDAWNNTVTEQAIEDVWQLFETSIRPCVKLSPLCITMRCNKSETDRWGLTKSIPTTTPTTSATTSTEVDMVNETSSCIAQDNCTGLEQEQMISCKFNMTGLKRDKKKEYNETWYSADLVCEQGNNTDNESRCYMNHCNTSVIQESCDKHYWDAIRFRYCAPPGYALLRCNDTNYSGFMPNCSKVVVSSCTRMMETQTSTWFGFNGTRAENRTYIYWHGRDNRTIISLNKYYNLTMRCRRPGNKTVLPITIMSGLHFHSQPIINDRPKQAWCWFGGKWKDAIKEVKQTIVEHPRYTGTNNTDKINLTVPRGGDPEVTFMWTNCRGEFLYCKMNWFLNWVEDRNTTNQKPTERHKRNYVPCHIRQIINTWHKVGKNVYLPPREGDLTCNSTVTSLIANIDWINGNQPNFTMSAEVAELNGSLAEGEVVIRSENFTDNAKTIIVQLNESVVINCTRPNNNTRKNIHLGRGRSVYATEKIIGNVKQAHCNISRAKWNDTLKQIVEKLREQFGKNKTIVFNQSSGGDPEIASSLAEEEVMIRSENITDNTKNIIVQLKTPVNITCTRPNNNTRKGIHIAPGQALYATGDIIGDIRQAHCNISGTKWNNTLQEVVTQLGEHLNKSTIEFNHHSGGDPEIACSLAEEEVVIRSENFTDNAKTIIVQLNESVVINCTRPSNNTRKSIHLGWGRSVYATGDIIGDIRQAHCNISRAKWNDTLKQIVEKLREQFGENKTIIFNQSSGGDPEINGSLAEEGIQIRSENITNNAKTIIVQLDKAVKINCTRPNNNTRKGVRIGPGQAFYATGGIIGDIRQAHCNVSRAKWNDTLRGVAKKLREHFKNKTIIFEKSSGGDPEIACSLAEEEVVISLKISQNNAKNIIVQLKEPIKINCTRPNNNTRKSIHITPGRAFYATGDIIGDIRQAHCNLSRTRWNKTLGEIVKKLREQFKNKTIIFNSSSGGDPEIACSLAEEEVIIRSANFSNNTKTIIVQLNESVVINCTRPNNNTRRSVNIGPGRAFFTTGDIIGDIRQAHCNLSRAQWNDTLKRVVEKLKEQFVNKTIVFNQSSGGDPEINGSLAAEKVMIRSENITDNTKNIIVQLKTPVNITCARPNNNTRRSIHIGPGQAFYATGEVIGDIRQAHCNISGTKWNATLHEVVTQLGEHLNKSIIEFKPSSGGDPEINGSLAEGEIIIRSKNLTDNAKIIIVHLNESVGNVCTRPNNNTRKSIRIGPGQAFYANNDIIGDIRQAHCNITENAWNKTLQMVGKKLKEHFPNKTTIIFEPSSGGDPEIACSLAEEEVVIRSDNFTNNAKIIIVQLNASVEINCTRPNNNTRKGIHIGPGRAFYATDIIGDIRQAYCNISRREWNNTLEKIVAKLRETFGNKTIAFKPSSGGDPEINGSLAEGNVTIRSENITNNAKTIIVQLTKPVQINCTRPNNNTRQGVHIGPGQVFYRTGDIIGDIRKAHCNVSRTEWNKTLHQVATQLRRHFGNKTIIFTNSSGGDPEIACSLAEEEVVIRSDNFTNNAKTIIVQLNESVVINCTRPNNNTRRSIPIGPGRAFYATGDITGDIRQAHCTLNGTQWNNTLKQIVIKLREQFKNKTIVFSPSSGGDPEICTRPNNNTRKSIHMGPGRAFYATGDIIGDIRQAHCCTRPNNNTGKSINIGPGRAFYATGDIIGDIRQAHCCTRPNNNTRKSIPIGPGRAFYATGDIIGDIRQAHCCTRPNNNTRKSIHIGPGRAFYATGDIIGDIRQAHCCTRPNNNTRKSINIGPGRAFYATGDIIGDIRQAHCCTRPNNNTRNSINIGPGRAFYATGDIIGDIRQAHC"
open('gp1201.fa','w').write(''.join(a))

2483