Implement Huffman codes to compress and uncompress files via 8-bit alphabet, specially useful on UTF-8/ASCII files.
To generate executables, run:
make all
Assumes clang
compiler is installed.
./evzip "source file"
will output a compressed file with .ez
extension.
The utility would compress any file, but will achieve best compression on text files encoded as UTF-8. Due to the Huffman algorithm used on 8-bit alfabet (ASCII).
Using the command line option -v
will make the program print a summary of all the prefix codes in the tree.
For example compressing a file containing the text Mary has a little lamb
, using ./evzip -v examples/MaryLamb.txt
will output:
./evunzip "source ez file" "destination file"
Will output an identical file as the one originally compressed.
Both files equality can be checked with:
diff --report-identical-files file1 file2
Execute: ./my_diff_test
for everything in action. Compressing the full Moby Dick novel in about 43% the original size.
Use Huffman codes algorithm for recursively generating a full binary tree in which each leaf is a pre-fix code for a character, and each internal node has two children zero
and one
with a path leading to a pre-fix code.
See: https://en.wikipedia.org/wiki/Huffman_coding for further info on the algorithm rationale.
Implement a Full-Binary with prefix codes based on single character frequencies. Implements a priority_queue as a min-heap with an underliying array to define the nodes with minimum frequency. See heap/
for implementation details.
For encoding huffmantree.c
maintains a memoized array with the codes for every ASCII code, this avoids the need to look for characters in the tree. To encode compressed output as a file it is done by manipulating individual bits for writing a defined binary file, that starts with the frequencies of each ASCII code in the original uncompressed file as follows:
0xC0DE |Freq for ASCII code 0| |Freq for ASCII code 1| ..... |Freq for ASCII code 255|
0xDADA Encoded bits in a consecutive way
The program will naively assume each character in the source file is from an alphabet of 8-bits (ASCII), and will try to redefine it using shorter ones for common characters. If the source file is not made like this it will not achieve a good compression rate.
For decoding, read the compressed file, with the frequencies in the 0xC0DE
area generate the Huffman tree. The function decode_bit
is fed with consecutive bits read from the file and it returns a node in the tree, if the node is a leaf a character is written to the destination file.