Skip to content
A utility for sorting really big files. http://kmkeen.com/gz-sort/
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
tests
.gitignore
LICENSE
Makefile
README.md
gz-sort.c

README.md

More details at http://kmkeen.com/gz-sort/

perform a merge sort over a multi-GB gz compressed file

Quickstart

git clone https://github.com/keenerd/gz-sort; cd gz-sort; make; ./gz-sort -h

Needs the zlib headers and probably only builds on GNU/Linux.

use: gz-sort [-u] [-S n] [-P n] source.gz dest.gz

options:
   -h: help
   -u: unique
   -S n: size of presort, supports k/M/G suffix
         a traditional in-memory sort (default n=1M)
   -P n: use multiple threads (experimental, default disabled)
   -T: pass through (debugging/benchmarks)

estimating run time, crudely:
    time gzip -dc data.gz | gzip > /dev/null
    unthreaded: seconds * entropy * (log2(uncompressed_size/S)+2)
    (where 'entropy' is a fudge-factor between 1.5 for an
    already sorted file and 3 for a shuffled file)
    S and P are the corresponding settings
    multithreaded: maybe unthreaded/sqrt(P) ?

estimated disk use:
    2x source.gz

Minimum requirements to sort a terabyte:

  • 4MB ram (yes, megabyte)
  • free disk space equal to the twice the compressed source.gz

Known bugs to fix

Email me if you are using gz-sort and any of these omissions are causing you trouble. For that matter, email me if you find something not on this list too.

  • Does not build on non-gnu systems.
  • Sqrt(threads) is a terrible ratio.
  • No support for uncompressed stdin streams.
  • Breaks if a line is longer than the buffer size.
  • Lacks all error handling.
  • Ugly code with lots of ways to refactor.
  • Output could use predictable flushes.

Performance tweaks to try

  • Profile!
  • Parallelize the final n-way merge. This will require adding IPC.
  • Filter unique lines during the earlier passes.
  • Try out zlib-ng, about half of cpu time is spent on (un)gzipping.
  • Improve memory estimation, it lowballs and that hurts the presort.
  • Byte-based seeking instead of line-counting.
You can’t perform that action at this time.