Fetching latest commit…
Cannot retrieve the latest commit at this time
|Failed to load latest commit information.|
Compression with bzip2 is hideously slow. This doesn't really become apparent until you've got a 350 gigabyte XML data dump to compress and it literally can take days. More background at: http://www.mediawiki.org/wiki/dbzip2 == Usage == dbzip2 [options] < inline > outfile.bz2 Currently input is through stdin/stdout only. Options: -1 .. -9 Set bzip2 blocksize, in 100k increments -d [experimental] Decompression mode. Only works on small (single-block) files right now. -r<hostname>[:<port>] Connect to a remote compression daemon on the given host. The default port is 16982. If remote hosts are down, dbzip2 will attempt to reconnect dynamically, so a long-running compression job won't be interrupted by, eg, rebooting a server for maintenance. -p<num> Set the number of local compression threads. The default is the number of local CPUs, or 1 if the count cannot be detected. Detection should work on Linux and Mac OS X, may or may not work elsewhere. There is no benefit to setting this higher than the number of local processor cores available; hyperthreading usually will not benefit from doubling, while real dual-core processors do. If using remote threads you can set this to 0, in which case no local threads will be used unless all the connections fail. -v[v[v[v]]] Increase output verbosity level. By default only connection errors will be shown on stderr. At -v you will receive a summary of compression speed; -vv and higher levels will add various debugging messages. Config: Default option values may be overridden via a config file; see the sample dbzip2.conf for details. == Daemon == The remote compression daemon listens for connections on a TCP port and sends back compressed (or uncompressed) data. Usage: dbzip2d [options] Options: -d Daemonize: disconnect from the controlling terminal and go into the background. -l<IP> Listen for connections on a single interface -p<port> Set port number. Default is 16982. -v[v[v[v]]] Increase output verbosity level. --pid-file=<filename> Output the process ID to a file; handy when daemonizing. A sample init script is included as 'dbzip2d.service'; when using the dbzip2d RPM this is installed as /etc/init.d/dbzip2d. == Security considerations == Data sent to/from remote threads is unencrypted. Do not send sensitive data over an unsecured network such as the Internet! The dbzip2d daemon is a network server and has no access control. While it provides no access to local filesystems or other information, I cannot rule out the possibility of security bugs. Additionally there is little error checking for now and denial-of-service is a strong possibility if malicious entities are able to connect to the daemon. If running dbzip2d on a non-private or mixed public-private network, consider port firewalling, listening on local interfaces only, running the daemon under a limited user, and applying memory and CPU restrictions to the daemon. == Network considerations == dbzip2 was designed for use in high-speed switched local networks. A wireless or 10-megabit ethernet LAN will likely saturate very quickly; switched 100-megabit or gigabit should do much better. With gigabit ethernet, speeds in excess of 20 megabytes per second should be achievable without too much difficulty given enough CPUs to attach to. (About 8-10 Opterons does the trick.) Exceeding 50 megabytes per second is probably not very likely on current hardware, but you'd need some nice disks to need that. ;) In my initial testing I saw diminishing returns after about 10 remote threads. Possible bottlenecks: * Network bandwidth on the client may saturate, given enough threads. Transmission collisions may reduce throughput before total saturation. * There is some CPU overhead for I/O, connection management etc even when using purely remote threads for compression work. Given enough external threads you'll max out. The Python code on the client probably is not as efficient as it could be as throughput rate rises, so some profiling and optimization would be good to do.