Skip to content

Compress, decompress and benchmark using Huffman coding and LZW

License

Notifications You must be signed in to change notification settings

gotonode/compress

Repository files navigation

Compress

Compress, decompress and benchmark using Huffman coding and LZW

Travis CI BCH Codacy Codecov codebeat CodeFactor LGTM (alerts)
Build Status BCH compliance Codacy Badge codecov codebeat badge CodeFactor Total alerts

App

Please notice! The current LZW implementation works, but is very slow when using larger files (over 0.5 MB). The app will become unresponsive for the longest time if you try to compress a large file with LZW or use a large file with the benchmarking functionality. Thus, consider using smaller files for now.

In this project, you can compress files using either Huffman coding or LZW (Lempel-Ziv-Welch). Decompression, as well as comparisons between the two algorithms, are also possible.

You can download a pre-built JAR-file from here.

Documentation

General interest:

Advanced interest:

Geek interest:

  • Javadoc (code documentation, updated manually)

Weekly reports

The problem

To reduce the size of a file, and to be able to return the original file from the size-reduced file. This is also known as "lossless compression" (compared to "lossy compression", which results in irrevocable data loss such as with JPEG files).

There are two main reasons to apply compression techniques. First is that of storage; smaller-size files can more easily fit onto storage media, either HDD, SSD, optical media (CD/DVD/Blu-ray) or tape drives. The second reason is that less bandwidth is needed to transfer the data.

A compressed file must be decompressed before it's usable, and this incurs a time-delay. On slower systems, decompressing a file can take a considerable amount of time (this is why installing new software could take a long time, it's decompressing).

Algorithms

Two different algorithms are compared in this project.

algorithm GitHub (source) Wikipedia (info) time complexity space complexity
Huffman coding source info O(n) O(n)
LZW (Lempel-Ziv-Welch) source info O(n) O(n)
Algorithm "Huffman"

Huffman works by counting the instances of specific characters from the data. It then creates a tree structure that maps each character as a sequence of bits. Characters that appear more frequently get a shorter bit string representation, and uncommon/rare characters get a longer string.

Usually, each character is encoded with either 8 or 16 bits. For an example, the character 'A' is 01000001. Huffman coding can encode 'A' to be 01 or 0110 for an example, potentially saving a lot of space.

Algorithm "LZW"

LZW works differently from Huffman. It creates a dictionary of strings (preferably long strings) that map to binary sequences. For an example, if textual data contains the word "welcome" many times, each entry of that word gets a binary representation that is shorter than the original (in the optimal scenario).

Thus, LZW works best with data that has many long, repeating strings containing the same data. Usually this is the case with text files, but not the case with binary files.

Sample data

You'll find sample data that you can use in the data folder. The contents of that folder are in the public domain.

More data (not created by me) can be found online with search query "data compression corpora".

name type format size contents compresses
cities.sql SQL SQL data 6,39 KB list of cities in Finland very well
keychain.jpeg JPEG† JPEG picture 83,9 KB an old picture of my keychain very poorly
lorem_ipsum.docx DOCX Word document 19,6 KB Lorem Ipsum poorly
lorem_ipsum.pdf PDF PDF document 95,4 KB Lorem Ipsum poorly
lorem_ipsum.txt TXT text file 17,2 KB Lorem Ipsum very well
screenshot.png PNG PNG image 53,6 KB screenshot from Windows poorly

† the real name of a JPEG file is with the E ("Experts"), not just JPG without it (this was due to the 8.3 limit of older systems)

About

Compress, decompress and benchmark using Huffman coding and LZW

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages