Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changed the compression format to zip #18

Merged
merged 4 commits into from
Mar 31, 2020

Conversation

niveditarufus
Copy link
Contributor

@niveditarufus niveditarufus commented Mar 28, 2020

After a comparison of different compression formats, zip seems to be more consistent in terms of time and size after compression. Other formats like take a lot of time to compress large files like the biotestmine archive.
This is relation to issue #5
I have also included the comparison table below for reference along with my cpu model below.
report:

File Format Time taken to compress(s) Size after compression(MB)
biotestmine tar 1.069 231
biotestmine bztar 9.329 201
biotestmine gztar 3.463 202
biotestmine xztar 19.930 196
biotestmine zip 3.79 204
postgres tar 1.497 567
postgres bztar 2.101 77
postgres gztar 5.733 96
postgres xztar 7.962 50
postgres zip 2.703 98
solr tar 0.0255 59
solr bztar 0.0788 9.8
solr gztar 0.074 12
solr xztar 0.112 8.0
solr zip 0.177 12

cpu model:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 158
Model name: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
Stepping: 9
CPU MHz: 800.038
CPU max MHz: 3800.0000
CPU min MHz: 800.0000
BogoMIPS: 5599.85
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 6144K
NUMA node0 CPU(s): 0-7
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d

After a comparison of different compression formats, zip seems to be more consistent in terms of time and size after compression. Other formats like take a lot of time to compress large files like the biotestmine archive.
I have also included the comparison table below for reference along with my cpu model below.
report:
|     **File**    	| **Format** 	| **Time taken to compress(s)** 	| **Size after compression(MB)** 	|
|:---------------:	|:----------:	|:-----------------------------:	|:------------------------------:	|
| **biotestmine** 	|     tar    	|             1.069             	|               231              	|
| **biotestmine** 	|    bztar   	|             9.329             	|               201              	|
| **biotestmine** 	|    gztar   	|             3.463             	|               202              	|
| **biotestmine** 	|    xztar   	|             19.930            	|               196              	|
| **biotestmine** 	|     zip    	|              3.79             	|               204              	|
|                 	|            	|                               	|                                	|
|   **postgres**  	|     tar    	|             1.497             	|               567              	|
|   **postgres**  	|    bztar   	|             2.101             	|               77               	|
|   **postgres**  	|    gztar   	|             5.733             	|               96               	|
|   **postgres**  	|    xztar   	|             7.962             	|               50               	|
|   **postgres**  	|     zip    	|             2.703             	|               98               	|
|                 	|            	|                               	|                                	|
|     **solr**    	|     tar    	|             0.0255            	|               59               	|
|     **solr**    	|    bztar   	|             0.0788            	|               9.8              	|
|     **solr**    	|    gztar   	|             0.074             	|               12               	|
|     **solr**    	|    xztar   	|             0.112             	|               8.0              	|
|     **solr**    	|     zip    	|             0.177             	|               12               	|
|                 	|            	|                               	|                                	|


cpu model: 
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               158
Model name:          Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
Stepping:            9
CPU MHz:             800.038
CPU max MHz:         3800.0000
CPU min MHz:         800.0000
BogoMIPS:            5599.85
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            6144K
NUMA node0 CPU(s):   0-7
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d
@heralden
Copy link
Member

Great job! I think it would be a good idea to commit this table into a markdown file for future reference (perhaps docs/compression-formats.md). At the top of the markdown file, include a short text mentioning how you tested (which Python timing function) and your CPU model name.

@niveditarufus
Copy link
Contributor Author

Sure, I will do it.

Copy link
Member

@heralden heralden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is perfect 💯

@heralden heralden merged commit 8c10769 into intermine:master Mar 31, 2020
@niveditarufus niveditarufus deleted the Compression_format branch March 31, 2020 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants