-
Notifications
You must be signed in to change notification settings - Fork 23
Home
gtoubassi edited this page Jul 26, 2021
·
44 revisions
FemtoZip is a "shared dictionary" compression library optimized for small documents that may not compress well with traditional tools such as gzip. In particular, situations where a very large number of small documents (10's to 1000's of bytes) share similar characteristics, but do not compress effectively standalone.
- If gzipping 1,000 of your documents concatenated together in a single file achieves much better compression rates then individual documents, then your data is likely tailor made for FemtoZip.
- Get your documents onto the file system as discrete files, and run a test using the fzip command line tool as shown in the Tutorial.
- If you have a Lucene search index and you want to see how much FemtoZip can compress your stored fields, try the IndexAnalyzer
- Small objects serialized and stored in a database or in memory DHT such as memcached using php, json, or xml serialization format. Keys and tags are repeated across documents, but may not be repeated within a document. For example in one large scale consumer website, memcached user objects (via php serialization) were compressed to 29% of their gzipped size (8.3% of their original size).
- Urls, for example stored in a Lucene search index. Urls often start with "http://www.", and have common substrings like ".com/", ".html", "?page=". Again this structure is repeated across documents, but not within a document. For example in a large scale search engine urls in Lucene were compressed to 60% of their gzipped size (20% of their originals ize).