Skip to content

krunch3r76/gompress

Repository files navigation

gompress

compress or archive files over distributed golem nodes

gompress solves the problem of needing to compress files when doing so on the requestor side would be prohibitively time consuming (e.g. on a small virtual server).

at last, you can stop daydreaming on company time while waiting for a compression to finish on your tiny virtual server and get back to work. get excited!

when gompress is given a directory or multiple files for the target, it will first prepare a tar file before compressing

curious how golem's nodes can work for you? visit https://golem.network to learn

REQUIREMENTS

  • yapapi 0.11.0
  • a golem requestor installation with yagna client running and app-key exported etc
  • python 3.9-3.10

tested on linux, windows 10, and mac os

VIDEO DEMO

gompress_nl1.mp4
gompress020.mp4

SELF DEMO

NOTE: This self demo uses tGLM nodes on public, which tend to be better compatible with non-development versions.

() $ cd gompress
(gompress) $ git checkout v0.2.2
(gompress) $ curl https://cs.fit.edu/~mmahoney/compression/enwik8.zip | funzip > enwik8
(gompress) $ md5sum enwik8 # note this for later
(gompress) $ python3 gompress.py enwik8 --subnet-tag public
(gompress) $ cd workdir/hashvaluefromoutput/final
(gompress final) $ xz -d ls-lR.xz
(gompress final) $ md5sum enwik8 # compare with original enwik8

ABOUT XZ

xz is a free, well supported, highly effective compressor distributed with virtually every linux distribution. binaries are available to download from the project's web page. additionally, xz can be compiled on Mac OS X with gcc installed (instructions below)

homepage: https://tukaani.org/xz

windows binary (check for latest): https://tukaani.org/xz/xz-5.2.5-windows.zip

compile xz for use on mac os x command line

$ tar -xf xz-5.2.5.tar.bz2
$ cd xz-5.2.5
(xz-5.2.5) $ ./configure --prefix=$HOME/.local
(xz-5.2.5) $ make && make install

MOA

gompress partitions/divides a file into measures of 64MiB, sending them to golem nodes, where xz is invoked to compress the partitions. the parts are asynchronously retrieved and stitched together into a cohesive whole that can be decompressed via xz.

the partition ranges are tabulated and all intermediate work retained. this enables resuming a compression later, as when network conditions or prices may be more favorable. TRY IT on a file >64MiB by ctrl-c after at least one task has finished and resume

ABOUT ARCHIVING

gompress will tar an input directory and all of its contents, otherwise if multiple files are given, it will change directory to the shared common root of all targets. in the latter case, if all the target files are in the same subdirectory, the tar file will change directory so that upon decompression the files are extracted to the working directory.

ADVANCED USAGE

ask gompress to perform light local compression first to save on file transfer (xfer), i.e. upload time, significantly via --xfer-compression-level

$ python3.9 ./gompress.py --network polygon --subnet-tag public  --xfer-compression-level 1 myfile.raw

clone gc__filterms into the project root directory

it just works -- use the environment variables.

$ export GNPROVIDER_BL=fascinated-system
$ export FILTERMSVERBOSE=1
$ python3.9 ./gompress.py myfilelarge.raw --network polygon --subnet-tag public

COMMENTS

to get extreme compression on extreme sizes consider being selective of high performance nodes on the mainnet. you may find such nodes via my gc__listoffers application [1]. you may also incorporate my gc__filterms by cloning it or linking from it from the project root directory [2].

expect gompress to evolve with golem and to become more performant accordingly e.g. with improved networking. gompress is continually being optimized within current parameters however. stay tuned.

--min-cpu-threads currently might guide to more modern cpu's but will not improve timing nor compression ratio as the work is optimally divided to leverage single cores (regardless of the actual number on a provider) on the golem network. partitioning to several nodes as opposed to one with several cores is more efficient since upload streams can be simultaneous, instead of having to wait on a single stream before starting.

for very large files, utilizing 2 cores may help and this will be a future optimization but is not relevant atm because 1 core is optimized for ≤ 64 MiB pieces (current division model).