Skip to content

Throughput 100x lower than M.2 SSD...! #1763

@Masterxilo

Description

@Masterxilo

Hi there. I love this project, I am considering it as a solution for my stuff/DataHoarding/whatever :)

However, I would like that any alternative solution to keeping files on my disk doesn't perform much worse than local disk when it comes to reading and writing large files.

Analysis

I have been playing a bit with perkeep and I observe almost a 100x slowdown compared to writes on disk.

This is on a Vanilla Ubuntu installation (ext4 lvm fs, M.2 NVMe disk, 64 GB RAM):

(WARNING: deletes your config and data, maybe run in docker run --rm -it ubuntu ...)

# next line only needed for ubuntu minimal container...:
which wget || ( echo "installing dependencies..." ; apt-get update >&/dev/null && apt-get install -y sudo wget jq moreutils bc >&/dev/null ) ; su ubuntu

# ====================== SETUP ======================
unset CAMLI_CONFIG_DIR
cd ~ ; rm -rf  perkeep-test ; mkdir perkeep-test ; cd perkeep-test


wget --no-clobber https://github.com/perkeep/perkeep/releases/download/v0.12/perkeep-linux-amd64.tar.gz --output-document=/tmp/perkeep-linux-amd64.tar.gz
tar -xvzf /tmp/perkeep-linux-amd64.tar.gz

# WARNING: delete/reset your users perkeep files (data & config)
rm -rf ~/var/perkeep ~/.config/perkeep

# generate explicit default config and OPTIONAL: copy it here
fuser -k 3179/tcp || true # kill everything on ports 3179
./perkeepd &
pid=$!
sleep 1 ; curl http://localhost:3179/debug/config > ./server-config-lowlevel.json # same as pk dumpconfig ?
kill -9 $pid
./pk-put init
cp --verbose ~/.config/perkeep/* .

# make config explicit (this is OPTIONAL, doesn't change the results)
export CAMLI_CONFIG_DIR="$(realpath .)"
#jq '.identitySecretRing = "'$CAMLI_CONFIG_DIR'/identity-secring.gpg"' server-config.json | sponge server-config.json # optional, if you also move
jq '.runIndex = false' server-config.json | sponge server-config.json # OPTIONAL disable index: set "runIndex": false in server-config.json

# ====================== THROUGHPUT COMPARISON TEST HARNESS ======================
./perkeepd &> perkeepd.log &
# tail -f perkeepd.log # reveals how busy this is...

# 1 GiB of random data
FILESIZE=$((1024*1024*1024))
dd if=/dev/urandom of=./testfile bs="$FILESIZE" count=1

# "$@" is the command to time
function time_and_compute_throughput() {
    echo >&2 "=== Running: $@ ==="
    START=$(date +%s.%N)
    "$@"
    END=$(date +%s.%N)
    DIFF=$(echo "$END - $START" | bc)
    THROUGHPUT=$(echo "scale=2; $FILESIZE / $DIFF / (1024*1024)" | bc)
    echo >&2 "Time taken: $DIFF seconds"
    echo >&2 "Throughput: $THROUGHPUT MiB/s"
}

# ====================== THROUGHPUT COMPARISON ======================
time_and_compute_throughput cp --verbose ./testfile ./testfile-copy-on-disk
time_and_compute_throughput cat < ./testfile > ./testfile-copy-via-cat
time_and_compute_throughput ./pk-put file ./testfile

# Cleanup
kill -9 $pid

Results

with the index (default config)

=== Running: cp --verbose ./testfile ./testfile-copy-on-disk ===
'./testfile' -> './testfile-copy-on-disk'
Time taken: .521222847 seconds
Throughput: 1964.61 MiB/s
=== Running: cat ===
Time taken: .513238355 seconds
Throughput: 1995.17 MiB/s
=== Running: ./pk-put file ./testfile ===
sha224-466b4b94a022042a0e25aa293818b10256ac221c49729e950ca5b991
Time taken: 199.234109343 seconds
Throughput: 5.13 MiB/s

with runIndex = false

=== Running: ./pk-put file ./testfile ===
sha224-b32fd3dd0811bad5a219b68a93ce9186ce09ffcc17bc0fc9002b99a6
Time taken: 58.304873106 seconds
Throughput: 17.56 MiB/s

is this an unfair comparison because I might not actually be hitting disk with the non-perkeep tests? (I am seeing activity in system monitor though)

Am I looking at the fastest possible implementation of this? I know I could of course be using a different host filesystem (maybe even tmpfs/ramfs) so that this is not bottlenecking perkeep, but I know my dataset will need a few TBs.

What's the best blobserver config/implementation for the perkeepd server when it comes to raw throughput?

I am eager to contribute, probably a blobserver implementation in C, but I would like to know what's the best I am up against :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions