Hi there. I love this project, I am considering it as a solution for my stuff/DataHoarding/whatever :)
However, I would like that any alternative solution to keeping files on my disk doesn't perform much worse than local disk when it comes to reading and writing large files.
Analysis
I have been playing a bit with perkeep and I observe almost a 100x slowdown compared to writes on disk.
This is on a Vanilla Ubuntu installation (ext4 lvm fs, M.2 NVMe disk, 64 GB RAM):
(WARNING: deletes your config and data, maybe run in docker run --rm -it ubuntu ...)
# next line only needed for ubuntu minimal container...:
which wget || ( echo "installing dependencies..." ; apt-get update >&/dev/null && apt-get install -y sudo wget jq moreutils bc >&/dev/null ) ; su ubuntu
# ====================== SETUP ======================
unset CAMLI_CONFIG_DIR
cd ~ ; rm -rf perkeep-test ; mkdir perkeep-test ; cd perkeep-test
wget --no-clobber https://github.com/perkeep/perkeep/releases/download/v0.12/perkeep-linux-amd64.tar.gz --output-document=/tmp/perkeep-linux-amd64.tar.gz
tar -xvzf /tmp/perkeep-linux-amd64.tar.gz
# WARNING: delete/reset your users perkeep files (data & config)
rm -rf ~/var/perkeep ~/.config/perkeep
# generate explicit default config and OPTIONAL: copy it here
fuser -k 3179/tcp || true # kill everything on ports 3179
./perkeepd &
pid=$!
sleep 1 ; curl http://localhost:3179/debug/config > ./server-config-lowlevel.json # same as pk dumpconfig ?
kill -9 $pid
./pk-put init
cp --verbose ~/.config/perkeep/* .
# make config explicit (this is OPTIONAL, doesn't change the results)
export CAMLI_CONFIG_DIR="$(realpath .)"
#jq '.identitySecretRing = "'$CAMLI_CONFIG_DIR'/identity-secring.gpg"' server-config.json | sponge server-config.json # optional, if you also move
jq '.runIndex = false' server-config.json | sponge server-config.json # OPTIONAL disable index: set "runIndex": false in server-config.json
# ====================== THROUGHPUT COMPARISON TEST HARNESS ======================
./perkeepd &> perkeepd.log &
# tail -f perkeepd.log # reveals how busy this is...
# 1 GiB of random data
FILESIZE=$((1024*1024*1024))
dd if=/dev/urandom of=./testfile bs="$FILESIZE" count=1
# "$@" is the command to time
function time_and_compute_throughput() {
echo >&2 "=== Running: $@ ==="
START=$(date +%s.%N)
"$@"
END=$(date +%s.%N)
DIFF=$(echo "$END - $START" | bc)
THROUGHPUT=$(echo "scale=2; $FILESIZE / $DIFF / (1024*1024)" | bc)
echo >&2 "Time taken: $DIFF seconds"
echo >&2 "Throughput: $THROUGHPUT MiB/s"
}
# ====================== THROUGHPUT COMPARISON ======================
time_and_compute_throughput cp --verbose ./testfile ./testfile-copy-on-disk
time_and_compute_throughput cat < ./testfile > ./testfile-copy-via-cat
time_and_compute_throughput ./pk-put file ./testfile
# Cleanup
kill -9 $pid
Results
with the index (default config)
=== Running: cp --verbose ./testfile ./testfile-copy-on-disk ===
'./testfile' -> './testfile-copy-on-disk'
Time taken: .521222847 seconds
Throughput: 1964.61 MiB/s
=== Running: cat ===
Time taken: .513238355 seconds
Throughput: 1995.17 MiB/s
=== Running: ./pk-put file ./testfile ===
sha224-466b4b94a022042a0e25aa293818b10256ac221c49729e950ca5b991
Time taken: 199.234109343 seconds
Throughput: 5.13 MiB/s
with runIndex = false
=== Running: ./pk-put file ./testfile ===
sha224-b32fd3dd0811bad5a219b68a93ce9186ce09ffcc17bc0fc9002b99a6
Time taken: 58.304873106 seconds
Throughput: 17.56 MiB/s
is this an unfair comparison because I might not actually be hitting disk with the non-perkeep tests? (I am seeing activity in system monitor though)
Am I looking at the fastest possible implementation of this? I know I could of course be using a different host filesystem (maybe even tmpfs/ramfs) so that this is not bottlenecking perkeep, but I know my dataset will need a few TBs.
What's the best blobserver config/implementation for the perkeepd server when it comes to raw throughput?
I am eager to contribute, probably a blobserver implementation in C, but I would like to know what's the best I am up against :)
Hi there. I love this project, I am considering it as a solution for my stuff/DataHoarding/whatever :)
However, I would like that any alternative solution to keeping files on my disk doesn't perform much worse than local disk when it comes to reading and writing large files.
Analysis
I have been playing a bit with perkeep and I observe almost a 100x slowdown compared to writes on disk.
This is on a Vanilla Ubuntu installation (ext4 lvm fs, M.2 NVMe disk, 64 GB RAM):
(WARNING: deletes your config and data, maybe run in
docker run --rm -it ubuntu...)Results
with the index (default config)
with runIndex = false
is this an unfair comparison because I might not actually be hitting disk with the non-perkeep tests? (I am seeing activity in system monitor though)
Am I looking at the fastest possible implementation of this? I know I could of course be using a different host filesystem (maybe even tmpfs/ramfs) so that this is not bottlenecking perkeep, but I know my dataset will need a few TBs.
What's the best blobserver config/implementation for the perkeepd server when it comes to raw throughput?
I am eager to contribute, probably a blobserver implementation in C, but I would like to know what's the best I am up against :)