Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
branch: trunk
README
Compression with bzip2 is hideously slow. This doesn't really become apparent
until you've got a 350 gigabyte XML data dump to compress and it literally
can take days.

More background at: http://www.mediawiki.org/wiki/dbzip2

== Usage ==

  dbzip2 [options] < inline > outfile.bz2

Currently input is through stdin/stdout only.

Options:

  -1 .. -9
    Set bzip2 blocksize, in 100k increments

  -d
    [experimental] Decompression mode.
    Only works on small (single-block) files right now.

  -r<hostname>[:<port>]
    Connect to a remote compression daemon on the given host.
    The default port is 16982.
    
    If remote hosts are down, dbzip2 will attempt to reconnect
    dynamically, so a long-running compression job won't be
    interrupted by, eg, rebooting a server for maintenance.

  -p<num>
    Set the number of local compression threads. The default is
    the number of local CPUs, or 1 if the count cannot be detected.
    Detection should work on Linux and Mac OS X, may or may not
    work elsewhere.
    
    There is no benefit to setting this higher than the number of
    local processor cores available; hyperthreading usually will
    not benefit from doubling, while real dual-core processors do.
    
    If using remote threads you can set this to 0, in which case
    no local threads will be used unless all the connections fail.

  -v[v[v[v]]]
    Increase output verbosity level.
    
    By default only connection errors will be shown on stderr.
    At -v you will receive a summary of compression speed; -vv and
    higher levels will add various debugging messages.

Config:

Default option values may be overridden via a config file; see the
sample dbzip2.conf for details.


== Daemon ==

The remote compression daemon listens for connections on a TCP port
and sends back compressed (or uncompressed) data.

Usage:

  dbzip2d [options]

Options:

  -d
    Daemonize: disconnect from the controlling terminal and go into
    the background.

  -l<IP>
    Listen for connections on a single interface

  -p<port>
    Set port number. Default is 16982.

  -v[v[v[v]]]
    Increase output verbosity level.

  --pid-file=<filename>
    Output the process ID to a file; handy when daemonizing.

A sample init script is included as 'dbzip2d.service'; when using the
dbzip2d RPM this is installed as /etc/init.d/dbzip2d.


== Security considerations ==

Data sent to/from remote threads is unencrypted. Do not send sensitive
data over an unsecured network such as the Internet!


The dbzip2d daemon is a network server and has no access control.
While it provides no access to local filesystems or other information,
I cannot rule out the possibility of security bugs.

Additionally there is little error checking for now and denial-of-service
is a strong possibility if malicious entities are able to connect to the
daemon.

If running dbzip2d on a non-private or mixed public-private network,
consider port firewalling, listening on local interfaces only, running
the daemon under a limited user, and applying memory and CPU restrictions
to the daemon.


== Network considerations ==

dbzip2 was designed for use in high-speed switched local networks.
A wireless or 10-megabit ethernet LAN will likely saturate very quickly;
switched 100-megabit or gigabit should do much better.

With gigabit ethernet, speeds in excess of 20 megabytes per second should
be achievable without too much difficulty given enough CPUs to attach to.
(About 8-10 Opterons does the trick.)

Exceeding 50 megabytes per second is probably not very likely on current
hardware, but you'd need some nice disks to need that. ;)

In my initial testing I saw diminishing returns after about 10 remote
threads.

Possible bottlenecks:
* Network bandwidth on the client may saturate, given enough threads.
  Transmission collisions may reduce throughput before total saturation.
* There is some CPU overhead for I/O, connection management etc even when
  using purely remote threads for compression work. Given enough
  external threads you'll max out.

The Python code on the client probably is not as efficient as it could
be as throughput rate rises, so some profiling and optimization would
be good to do.
Something went wrong with that request. Please try again.