Group files in sets of same total size
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
.gitignore
COPYING
README.md
packo.py

README.md

Please consider using fpart from martymac. Inspired by packo, fpart is a much more powerful tool, written in C.

Packo

packo is a simple program from a very specific need : I needed to transfer multiple terabytes of data using rsync. In my environment (file server, connectivity, etc...) I can use multiple rsync in parallel to improve the throughput but didn't want to think about what data each rsync should transfer, that's why i wrote this program. it may be of interest to you.

It gathers file size information recursively given a path and splits the whole list in sets of approximatively equal size using a greedy algorithm with a hack. Oups, i should say "heuristic".

By default, the program just output what it can do. For example, i want to build 5 sets of files from /usr/share/ :

$ python packo.py /usr/share/doc 5
11202/11202, 220.7 MBytes (Memsize: 1.1 MBytes)

Pack 0: 44.1 MBytes / 2239 files
Pack 1: 44.1 MBytes / 2240 files
Pack 2: 44.1 MBytes / 2239 files
Pack 3: 44.1 MBytes / 2240 files
Pack 4: 44.1 MBytes / 2244 files

You can see that there are 11202 files here and the process requires approximatively 1.1 MBytes of memory to run. I print this information because my implementation is simple : i'm building the full files list in memory. It shouldn't be a concern for most usage. In this example, each set of files is about 44 MBytes.

If a want to write the list of files, i just add a filename name :

$ python packo.py /usr/share/doc 5 mylistoffiles
11202/11202, 220.7 MBytes (Memsize: 1.1 MBytes)

Pack 0: 44.1 MBytes / 2239 files
Pack 1: 44.1 MBytes / 2240 files
Pack 2: 44.1 MBytes / 2239 files
Pack 3: 44.1 MBytes / 2240 files
Pack 4: 44.1 MBytes / 2244 files

It will write 5 files that can be used with the rsync '--files-from' option :

$ head  mylistoffiles*
==> mylistoffiles0 <==
/usr/share/doc/ipython/manual/ipython.pdf.gz
/usr/share/doc/gir1.2-gtk-3.0/changelog.gz
...

==> mylistoffiles1 <==
/usr/share/doc/chromium-inspector/copyright
/usr/share/doc/libgtk-3-0/changelog.gz
...

==> mylistoffiles2 <==
/usr/share/doc/chromium/copyright
/usr/share/doc/libgtk-3-bin/changelog.gz
...

==> mylistoffiles3 <==
/usr/share/doc/chromium-browser/copyright
/usr/share/doc/libgail-3-0/changelog.gz
...

==> mylistoffiles4 <==
/usr/share/doc/xserver-common/changelog.gz
/usr/share/doc/python2.6/html/genindex-all.html
...

If you try to relaunch the command line, the program will exit to prevent overwriting :

$ python packo.py /usr/share/doc 5 mylistoffiles
'mylistoffiles0' already exists, aborting.

Nothing magic or very interesting here :)

Just for fun :

$ python packo.py /usr 42 out
112950/112950, 3.7 Gbytes (Memsize: 12.0 MBytes)

Pack 0: 134.2 MBytes / 1 files
Pack 1: 88.0 MBytes / 2592 files
Pack 2: 88.0 MBytes / 2612 files
Pack 3: 88.0 MBytes / 2655 files
Pack 4: 88.0 MBytes / 2655 files
Pack 5: 88.0 MBytes / 2655 files
Pack 6: 88.0 MBytes / 2655 files
Pack 7: 88.0 MBytes / 2655 files
Pack 8: 88.0 MBytes / 6802 files
...
Pack 41: 88.0 MBytes / 2657 files

$ cat out0
/usr/share/icons/oxygen/icon-theme.cache
$ ls -h /usr/share/icons/oxygen/icon-theme.cache
$ ls -lh /usr/share/icons/oxygen/icon-theme.cache
-rw-r--r-- 1 root root 135M Sep 30 20:54 /usr/share/icons/oxygen/icon-theme.cache

Dependencies

python>=2.6 (including python 3)