Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] Very disappointing performance, is this expected? #3037

Open
liyimeng opened this issue Sep 20, 2021 · 55 comments
Open

[QUESTION] Very disappointing performance, is this expected? #3037

liyimeng opened this issue Sep 20, 2021 · 55 comments
Labels
area/performance System, volume performance area/v1-data-engine v1 data engine (iSCSI tgt) kind/question Please use `discussion` to ask questions instead
Milestone

Comments

@liyimeng
Copy link

Question

I run a fio test with longhorn 1.1.2, the result is kind of disappointing. On native ssd, 4K random write could reach IOPS=42.5k, BW=166MiB/s, while on longhorn, only get about IOPS=6k, BW=16MiB/s

Is this expected longhorn performance?

Environment:

  • Longhorn version: 1.1.2
  • Kubernetes version: 1.21
  • Node config
    • OS type and version: ubuntu 20.04
    • CPU per node: 40
    • Memory per node: 512G
    • Disk type: sata SSD
    • Network bandwidth and latency between the nodes: 10G x 2 with aggregation, with jumbo frame on.
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): bare metal

Additional context
Add any other context about the problem here.

ssd test (not raw disk, already formatted as ext4 and mounted)

fio -direct=1  -iodepth 16 -thread -rw=randwrite -ioengine=libaio  -numjobs=16 -runtime=300 -group_reporting -name=4K-100Write100random -fsync=0
 -bs=4k  -end-fsync=1 -size=2g
4K-100Write100random: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.27
Starting 16 threads
4K-100Write100random: Laying out IO file (1 file / 2048MiB)
4K-100Write100random: Laying out IO file (1 file / 2048MiB)
4K-100Write100random: Laying out IO file (1 file / 2048MiB)
4K-100Write100random: Laying out IO file (1 file / 2048MiB)
4K-100Write100random: Laying out IO file (1 file / 2048MiB)
4K-100Write100random: Laying out IO file (1 file / 2048MiB)
4K-100Write100random: Laying out IO file (1 file / 2048MiB)
4K-100Write100random: Laying out IO file (1 file / 2048MiB)
4K-100Write100random: Laying out IO file (1 file / 2048MiB)
4K-100Write100random: Laying out IO file (1 file / 2048MiB)
4K-100Write100random: Laying out IO file (1 file / 2048MiB)
4K-100Write100random: Laying out IO file (1 file / 2048MiB)
4K-100Write100random: Laying out IO file (1 file / 2048MiB)
4K-100Write100random: Laying out IO file (1 file / 2048MiB)
4K-100Write100random: Laying out IO file (1 file / 2048MiB)
4K-100Write100random: Laying out IO file (1 file / 2048MiB)
^Cbs: 16 (f=16): [w(16)][23.7%][w=171MiB/s][w=43.8k IOPS][eta 02m:31s]
fio: terminating on signal 2

4K-100Write100random: (groupid=0, jobs=16): err= 0: pid=79351: Mon Sep 20 13:07:50 2021
  write: IOPS=42.5k, BW=166MiB/s (174MB/s)(7817MiB/47039msec); 0 zone resets
    slat (usec): min=4, max=183526, avg=15.49, stdev=438.11
    clat (usec): min=140, max=286711, avg=6000.97, stdev=2266.19
     lat (usec): min=161, max=286722, avg=6016.56, stdev=2344.46
    clat percentiles (msec):
     |  1.00th=[    6],  5.00th=[    6], 10.00th=[    6], 20.00th=[    6],
     | 30.00th=[    6], 40.00th=[    6], 50.00th=[    6], 60.00th=[    6],
     | 70.00th=[    6], 80.00th=[    6], 90.00th=[    7], 95.00th=[    7],
     | 99.00th=[    7], 99.50th=[   11], 99.90th=[   28], 99.95th=[   38],
     | 99.99th=[  142]
   bw (  KiB/s): min=122128, max=179328, per=100.00%, avg=170262.37, stdev=611.52, samples=1504
   iops        : min=30532, max=44832, avg=42565.56, stdev=152.88, samples=1504
  lat (usec)   : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.03%, 10=99.40%, 20=0.36%, 50=0.17%
  lat (msec)   : 100=0.02%, 250=0.02%, 500=0.01%
  cpu          : usr=0.46%, sys=3.22%, ctx=1278629, majf=0, minf=3097
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,2001151,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
  WRITE: bw=166MiB/s (174MB/s), 166MiB/s-166MiB/s (174MB/s-174MB/s), io=7817MiB (8197MB), run=47039-47039msec

Disk stats (read/write):
  sdd: ios=20/1997114, merge=0/85763, ticks=123/11832844, in_queue=8067712, util=99.85%

longhorn with 3 replicas, and directly test again longhorn volume


fio -direct=1 -filename=/dev/longhorn/testssd   -iodepth 16 -thread -rw=randwrite -ioengine=libaio  -numjobs=16 -runtime=300 -group_reporting -name=4K-100Write100random -fsync=1 -bs=4k  -end-fsync=1 -size=2g
4K-100Write100random: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.27
Starting 16 threads
^Cbs: 16 (f=16): [w(16)][15.0%][w=24.3MiB/s][w=6230 IOPS][eta 04m:15s]]
fio: terminating on signal 2
Jobs: 16 (f=0): [f(16)][100.0%][w=24.5MiB/s][w=6263 IOPS][eta 00m:00s]
4K-100Write100random: (groupid=0, jobs=16): err= 0: pid=10354: Mon Sep 20 13:10:33 2021
  write: IOPS=6167, BW=24.1MiB/s (25.3MB/s)(1105MiB/45870msec); 0 zone resets
    slat (nsec): min=1572, max=184830, avg=6255.05, stdev=3755.88
    clat (usec): min=783, max=86783, avg=38909.39, stdev=2512.99
     lat (usec): min=805, max=86790, avg=38915.77, stdev=2513.02
    clat percentiles (usec):
     |  1.00th=[34341],  5.00th=[35390], 10.00th=[35914], 20.00th=[36963],
     | 30.00th=[38011], 40.00th=[38536], 50.00th=[39060], 60.00th=[39584],
     | 70.00th=[40109], 80.00th=[40633], 90.00th=[41157], 95.00th=[42206],
     | 99.00th=[46924], 99.50th=[47449], 99.90th=[48497], 99.95th=[56361],
     | 99.99th=[73925]
   bw (  KiB/s): min=22046, max=27008, per=100.00%, avg=24673.82, stdev=64.38, samples=1456
   iops        : min= 5504, max= 6752, avg=6168.37, stdev=16.11, samples=1456
  lat (usec)   : 1000=0.01%
  lat (msec)   : 4=0.01%, 10=0.02%, 20=0.03%, 50=99.87%, 100=0.06%
  fsync/fdatasync/sync_file_range:
    sync (nsec): min=22, max=11380, avg=100.76, stdev=77.14
    sync percentiles (nsec):
     |  1.00th=[   46],  5.00th=[   47], 10.00th=[   48], 20.00th=[   54],
     | 30.00th=[   61], 40.00th=[   73], 50.00th=[   86], 60.00th=[  103],
     | 70.00th=[  119], 80.00th=[  133], 90.00th=[  169], 95.00th=[  209],
     | 99.00th=[  302], 99.50th=[  338], 99.90th=[  506], 99.95th=[  604],
     | 99.99th=[ 1128]
  cpu          : usr=0.15%, sys=0.60%, ctx=307774, majf=0, minf=16
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=199.8%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,282917,0,282693 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
  WRITE: bw=24.1MiB/s (25.3MB/s), 24.1MiB/s-24.1MiB/s (25.3MB/s-25.3MB/s), io=1105MiB (1159MB), run=45870-45870msec

Disk stats (read/write):
  sdk: ios=0/562755, merge=0/0, ticks=0/1748630, in_queue=296688, util=99.84%

longhorn with 1 replica with data locality (to exclude potential network negative impact)

fio -direct=1 -filename=/dev/longhorn/testlhlocal    -iodepth 16 -thread -rw=randwrite -ioengine=libaio  -numjobs=16 -runtime=300 -group_reporting -name=4K-100Write100random -fsync=1 -bs=4k  -end-fsync=1 -size=2g
4K-100Write100random: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.27
Starting 16 threads
^Cbs: 16 (f=16): [w(16)][18.3%][w=27.6MiB/s][w=7060 IOPS][eta 04m:05s]
fio: terminating on signal 2

4K-100Write100random: (groupid=0, jobs=16): err= 0: pid=30445: Mon Sep 20 13:15:09 2021
  write: IOPS=6922, BW=27.0MiB/s (28.4MB/s)(1493MiB/55193msec); 0 zone resets
    slat (usec): min=2, max=543, avg= 5.85, stdev= 3.61
    clat (usec): min=728, max=96857, avg=34666.00, stdev=2460.70
     lat (usec): min=832, max=96881, avg=34671.97, stdev=2461.14
    clat percentiles (usec):
     |  1.00th=[29230],  5.00th=[32637], 10.00th=[32900], 20.00th=[33162],
     | 30.00th=[33424], 40.00th=[33817], 50.00th=[33817], 60.00th=[34341],
     | 70.00th=[35390], 80.00th=[35914], 90.00th=[38011], 95.00th=[39584],
     | 99.00th=[42206], 99.50th=[42730], 99.90th=[46924], 99.95th=[48497],
     | 99.99th=[72877]
   bw (  KiB/s): min=22146, max=28856, per=99.98%, avg=27686.71, stdev=78.03, samples=1760
   iops        : min= 5530, max= 7214, avg=6921.60, stdev=19.53, samples=1760
  lat (usec)   : 750=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.02%, 50=99.91%
  lat (msec)   : 100=0.05%
  fsync/fdatasync/sync_file_range:
    sync (nsec): min=22, max=18331, avg=103.58, stdev=83.05
    sync percentiles (nsec):
     |  1.00th=[   47],  5.00th=[   50], 10.00th=[   55], 20.00th=[   63],
     | 30.00th=[   71], 40.00th=[   78], 50.00th=[   87], 60.00th=[  105],
     | 70.00th=[  113], 80.00th=[  129], 90.00th=[  171], 95.00th=[  209],
     | 99.00th=[  290], 99.50th=[  322], 99.90th=[  474], 99.95th=[  572],
     | 99.99th=[  956]
  cpu          : usr=0.15%, sys=0.60%, ctx=414609, majf=0, minf=16
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=199.9%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,382093,0,381869 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
  WRITE: bw=27.0MiB/s (28.4MB/s), 27.0MiB/s-27.0MiB/s (28.4MB/s-28.4MB/s), io=1493MiB (1565MB), run=55193-55193msec

Disk stats (read/write):
  sdl: ios=0/762806, merge=0/1, ticks=0/2094405, in_queue=92300, util=99.86%

@liyimeng liyimeng added the kind/question Please use `discussion` to ask questions instead label Sep 20, 2021
@liyimeng
Copy link
Author

According to https://longhorn.io/blog/performance-scalability-report-aug-2020/ , the bw should be expected to close the native disk, but my test seems 10 times less, what could be wrong?

@yasker
Copy link
Member

yasker commented Sep 20, 2021

Can you try https://github.com/yasker/kbench ? In your fio job, you're testing bandwidth with 4k block, which is mostly used for IOPS test (since the block size is small). Also with 16 jobs at the same time, I think the CPU resource might become a contention point.

@liyimeng
Copy link
Author

@yasker Thanks for your attention!
Yes, I was actually targeting iops. I run kbench as you suggested. here is the outcome.

kubectl logs   -f  -l kbench=fio
TEST_FILE: /volume/test
TEST_OUTPUT_PREFIX: test_device
TEST_SIZE: 30G
Benchmarking iops.fio into test_device-iops.json
Benchmarking bandwidth.fio into test_device-bandwidth.json
Benchmarking latency.fio into test_device-latency.json

=====================
FIO Benchmark Summary
For: test_device
SIZE: 30G
QUICK MODE: DISABLED
=====================
IOPS (Read/Write)
        Random:           28,832 / 5,730
    Sequential:          51,415 / 11,329
  CPU Idleness:                      86%

Bandwidth in KiB/sec (Read/Write)
        Random:      1,034,002 / 278,604
    Sequential:      1,047,018 / 221,776
  CPU Idleness:                      80%

Latency in ns (Read/Write)
        Random:      303,098 / 1,029,192
    Sequential:        371,862 / 934,854
  CPU Idleness:                      93%

I was expecting better iops. My ssd reach 50K+ iops at raw, 47K iops with formatted as ext4.

Something I have done wrong?

@yasker
Copy link
Member

yasker commented Sep 20, 2021

@liyimeng I think your latest result looks valid. There is performance overhead on using Longhorn. Though Longhorn should still be better than most other software-defined distributed storage solutions out there due to it's simple architecture. The only thing I don't quite understand is the IOPS discrepancy between the random RW and sequential RW. You can try to use the comparison mode in kbench, using local path provisioner vs longhorn to get a relative result to see what's the overhead.

In general, the read performed very well IMO. Your write latency is slightly higher than expected though it's hard for me to tell why ATM.

@liyimeng
Copy link
Author

@yasker Hi here come the result of local-path vs longhorn. Local-path can easily maintain 40K+ write iops, while longhorn show a hug drop.

kubectl logs   -f  -l kbench=fio
TEST_FILE: /volume1/test
TEST_OUTPUT_PREFIX: Local-Path
TEST_SIZE: 30G
Benchmarking iops.fio into Local-Path-iops.json
Benchmarking bandwidth.fio into Local-Path-bandwidth.json
Benchmarking latency.fio into Local-Path-latency.json
TEST_FILE: /volume2/test
TEST_OUTPUT_PREFIX: Longhorn
TEST_SIZE: 30G
Benchmarking iops.fio into Longhorn-iops.json
Benchmarking bandwidth.fio into Longhorn-bandwidth.json
Benchmarking latency.fio into Longhorn-latency.json

================================
FIO Benchmark Comparsion Summary
For: Local-Path vs Longhorn
SIZE: 30G
QUICK MODE: DISABLED
================================
                              Local-Path   vs                 Longhorn    :              Change
IOPS (Read/Write)
        Random:          58,550 / 40,665   vs           29,420 / 6,897    :   -49.75% / -83.04%
    Sequential:          61,937 / 41,463   vs          46,028 / 13,470    :   -25.69% / -67.51%
  CPU Idleness:                      96%   vs                      85%    :                -11%

Bandwidth in KiB/sec (Read/Write)
        Random:        406,288 / 262,054   vs        493,542 / 337,472    :     21.48% / 28.78%
    Sequential:        485,302 / 307,509   vs        593,200 / 366,985    :     22.23% / 19.34%
  CPU Idleness:                      97%   vs                      85%    :                -12%

Latency in ns (Read/Write)
        Random:        131,945 / 129,217   vs        392,504 / 619,945    :   197.48% / 379.77%
    Sequential:        141,914 / 114,732   vs        394,481 / 615,687    :   177.97% / 436.63%
  CPU Idleness:                      95%   vs                      92%    :                 -3%

What even confuse me is that longhorn achieves higher write bandwidth! I actually post another result early, which show similar result, but I then found out that longhorn was running on a different model of SSD. So I delete it and re-test, making sure both local-path and longhorn run on the model of SSD.

@liyimeng
Copy link
Author

Repeaating same test, seem just confirm previous test

kubectl logs   -f  -l kbench=fio
TEST_FILE: /volume1/test
TEST_OUTPUT_PREFIX: Local-Path
TEST_SIZE: 30G
Benchmarking iops.fio into Local-Path-iops.json
Benchmarking bandwidth.fio into Local-Path-bandwidth.json
Benchmarking latency.fio into Local-Path-latency.json
TEST_FILE: /volume2/test
TEST_OUTPUT_PREFIX: Longhorn
TEST_SIZE: 30G
Benchmarking iops.fio into Longhorn-iops.json
Benchmarking bandwidth.fio into Longhorn-bandwidth.json
Benchmarking latency.fio into Longhorn-latency.json

================================
FIO Benchmark Comparsion Summary
For: Local-Path vs Longhorn
SIZE: 30G
QUICK MODE: DISABLED
================================
                              Local-Path   vs                 Longhorn    :              Change
IOPS (Read/Write)
        Random:          58,386 / 43,299   vs           27,357 / 6,709    :   -53.14% / -84.51%
    Sequential:          62,129 / 42,527   vs          51,477 / 13,376    :   -17.14% / -68.55%
  CPU Idleness:                      97%   vs                      86%    :                -11%

Bandwidth in KiB/sec (Read/Write)
        Random:        414,700 / 333,020   vs        495,104 / 340,691    :      19.39% / 2.30%
    Sequential:        490,799 / 360,981   vs        598,450 / 369,216    :      21.93% / 2.28%
  CPU Idleness:                      95%   vs                      86%    :                 -9%

Latency in ns (Read/Write)
        Random:        131,863 / 122,037   vs        393,911 / 648,344    :   198.73% / 431.27%
    Sequential:        141,507 / 111,135   vs        391,161 / 639,665    :   176.43% / 475.57%
  CPU Idleness:                      96%   vs                      92%    :                 -4%

@yasker
Copy link
Member

yasker commented Sep 21, 2021

@liyimeng Higher write bandwidth is probably just due to the fluctuations of the test. Increase the test size might help.

The latest result is consistent with what we've observed and is expected. The main reason for IOPS drop is because of the latency increase, which is due to Longhorn is adding additional layers on top of native disks for HA/snapshot etc mechanism.

It's very hard to achieve near native performance in term of IOPS or latency. Longhorn is already one of the fastest SDS out there (with similar functionality set). We're working on a prototype engine which is able to do but it would likely need a couple of years to achieve the feature parity of the current mature Longhorn engine.

@liyimeng
Copy link
Author

liyimeng commented Sep 21, 2021

@yasker Thanks a lot! I understand latency is unfortunately unavoidable when more work need to be done. Is it possible to increase the iops by introducing some kind of parallelism? I guess in current implementation, the underneath disk still have bandwidth. Mayastor guys say they have borrowed some ideas nvnme implementation, which significantly improve iops. Is it something longhorn can try? I know little storage, but their number looks tempting.

btw, jumbo frame seem not helping, which usually improve for iscsi.

@yasker
Copy link
Member

yasker commented Sep 21, 2021

@liyimeng IOs are already happening in parallel. The new prototype Longhorn is building is based on SPDK (which is also used by Mayastor and other storage vendors), which should provide near-native performance. However, it's going to be hard to further optimize the current Longhorn engine. You might be able to increase performance a bit more with iSCSI queue depth, but that will result in more CPU consumption as well.

@jenting jenting added this to New in Community Issue Review via automation Sep 21, 2021
@jenting jenting moved this from New to In progress in Community Issue Review Sep 22, 2021
@liyimeng
Copy link
Author

@yasker I am still not getting the picture here. Take the example output as above. The latency is about 600us, i.e. 0.6ms. For a single write thread, it is about 1600 iops. 32 threads did not end up with 50K(1600 x 32) iops, but 6k iops. why parallelism is dropping that much, even write is not contending to the same file?

@yasker
Copy link
Member

yasker commented Sep 30, 2021

There are some inefficiencies in the data path, like contending for CPU, context switch, memory copy, the efficiency of the protocol etc. In general, it won't be 1600 x 32 even you have 32 threads. Also, Longhorn is using 16 threads for each volume (instead of 32).

@liyimeng
Copy link
Author

Thanks for sharing the insight! @yasker We gota wait for the next generation longhorn engine then :D

@shuo-wu
Copy link
Contributor

shuo-wu commented Oct 7, 2021

Actually, there is a setting named Disable Revision Counter. By design, it could increase the IOPS performance a little bit. But the risk is, there may be data loss/inconsistency when all replica crash simultaneously and auto salvage is triggered.

Notice that it is a relatively dangerous setting in terms of HA. That's why we did not enable it by default. If you are interested in it, you can take a quick try.

@shuo-wu shuo-wu moved this from In progress to Pending user response in Community Issue Review Oct 7, 2021
@liyimeng
Copy link
Author

@shuo-wu I try to disable revision counter, but at the UI, it says:

Disable Revision Counter:
Required.
This setting is only for volumes created by UI. 
....

How that can be applied to a dynamic provision volume as kbench is using?

@shuo-wu
Copy link
Contributor

shuo-wu commented Oct 26, 2021

@liyimeng Maybe you can create such a volume with PV/PVC in UI first then modify the bench deployment YAML file so that it would use the existing Longhorn volume for testing.

@liyimeng
Copy link
Author

Thanks, I will see if I have chance for doing so. Will report back if I can make it.

@liyimeng
Copy link
Author

liyimeng commented Nov 6, 2021

@shuo-wu The number was not impressive, but since there are too much noise in the cluster, I therefore discard it as a unfair test. I see QingStor present their solution neonio, it is really impressive
Neonio, If #3202 is ready, will longhorn catch up by any chance?

@liyimeng
Copy link
Author

Can this be re-open? It is time to look at performance for longhorn, I guess?

@innobead innobead reopened this Dec 17, 2021
Community Issue Review automation moved this from Pending user response to New Dec 17, 2021
@longhorn longhorn deleted a comment from github-actions bot Dec 17, 2021
@longhorn longhorn deleted a comment from github-actions bot Dec 17, 2021
@innobead innobead removed the stale label Dec 17, 2021
@joshimoo joshimoo added the area/performance System, volume performance label Dec 18, 2021
@innobead innobead modified the milestones: Planning, Backlog Mar 21, 2022
@LarsBingBong
Copy link

Another day - another test.


The test result with hardware ESXi 7.0 U2 and later (VM version 19) on the worker nodes - on the VMWare HCI - was:

TEST_FILE: /volume1/test
TEST_OUTPUT_PREFIX: Local-Path
TEST_SIZE: 39G
Benchmarking iops.fio into Local-Path-iops.json
Benchmarking bandwidth.fio into Local-Path-bandwidth.json
Benchmarking latency.fio into Local-Path-latency.json
TEST_FILE: /volume2/test
TEST_OUTPUT_PREFIX: Longhorn
TEST_SIZE: 39G
Benchmarking iops.fio into Longhorn-iops.json
Benchmarking bandwidth.fio into Longhorn-bandwidth.json
Benchmarking latency.fio into Longhorn-latency.json

================================
FIO Benchmark Comparsion Summary
For: Local-Path vs Longhorn
CPU Idleness Profiling: disabled
Size: 39G
Quick Mode: disabled
================================
                              Local-Path   vs                 Longhorn    :              Change
IOPS (Read/Write)
        Random:           47,687 / 9,780   vs              6,935 / 258    :   -85.46% / -97.36%
    Sequential:          39,378 / 12,350   vs             10,995 / 578    :   -72.08% / -95.32%

Bandwidth in KiB/sec (Read/Write)
        Random:      1,075,840 / 250,161   vs         411,294 / 28,939    :   -61.77% / -88.43%
    Sequential:      1,111,645 / 319,936   vs         566,433 / 70,555    :   -49.05% / -77.95%

Latency in ns (Read/Write)
        Random:      488,575 / 2,559,169   vs    1,951,327 / 9,963,974    :   299.39% / 289.34%
    Sequential:      451,978 / 2,287,137   vs    1,439,435 / 7,084,941    :   218.47% / 209.77%

So no change

@LarsBingBong
Copy link

Just quickly the CPU idleness ...

image

Clearly NOT the issue.

@yasker
Copy link
Member

yasker commented Mar 21, 2022

Hi @LarsBingBong , thanks for the benchmarking.

In general, we don't recommend running Longhorn on top of another Software-Defined Storage. The performance characteristics can be very hard to determine in that case. Though I know many users are using Longhorn on top of VMWare vSAN, so we will look into that. cc @joshimoo

@LarsBingBong
Copy link

Thank you very much for chiming in @yasker - much appreciated. I can imagine you have a very busy schedule. Okay, I wasn't aware on the "we don't recommend running Longhorn on top of another Software-Defined Storage". But, I can see there being a potential issue there.

Would love for that to work better performance wise. So, I'm happy that the Longhorn team will give it a look. Much appreciated indeed!
Especially as I don't really see any alternative to Longhorn. OSS and all that nice jazz.

@LarsBingBong
Copy link

LarsBingBong commented Mar 21, 2022

The results where a bit better when I tried without using the Cilium CNI. To use flannel instead.

image

But, still - not the smoking gun.


I'll stop my testing for now and wait for the Longhorn team to give this a look - ;-) @joshimoo

@liyimeng
Copy link
Author

@LarsBingBong I don't remember if the fio test run with fsync. If not, vSAN might cheat you by caching write locally, while longhorn by its design nature, will never cache write/read locally. If you do wanna see what is longhorn can truly achieve. Better go with raw disks.

But again, with my raw disks testing, longhorn see significant drop of performance. @yasker Good to see you are still around :D. Longing forward to seeing longhorn come with a new engine update.

@LarsBingBong
Copy link

@joshimoo just kindly asking whether you guys have planned to look at this? Thank you very much. The Longhorn product and all you do to make it prosper is greatly appreciated.

@joshimoo
Copy link
Contributor

@LarsBingBong thanks for all the tests, we are very much looking at general io performance enhancements.

@keithalucas is currently working on the SPDK frontend tracked at #3202
Some of that code can be found below, but is not ready for end user utilization yet.
https://github.com/longhorn/longhorn-spdk

We are also looking at smaller optimizations on the existing longhorn-engine to lower the io latency, which should improve the overall io performance.

@fzyzcjy
Copy link

fzyzcjy commented Apr 26, 2022

Hi, is there any estimation when will it be released? Without this, the HDD is almost not usable.

@liyimeng
Copy link
Author

We are also looking at smaller optimizations on the existing longhorn-engine to lower the io latency, which should improve the overall io performance.

@joshimoo is this something coming shortly?

@gp187
Copy link

gp187 commented May 16, 2022

This is still a huge issue. Any updates?

@thiscantbeserious
Copy link

Seems like the keithlucas left rancher mid-year and there hasn't been any work done on this anymore since then.

So I guess SPDK is on hold then, is there anyone from rancher left working on it?

@joshimoo
Copy link
Contributor

joshimoo commented Nov 3, 2022

cc @innobead @DamiaSan is currently looking into this.

@liyimeng
Copy link
Author

liyimeng commented Nov 5, 2022

@joshimoo It has been more than 3 years since longhorn team start to look into spdk. Is it technically achievable in longhorn framework?

@lictw
Copy link

lictw commented Jan 19, 2023

While setting up a new cluster I performed a bunch of performance tests. I gathered information into Google Sheet and posting it. All tests performed via KBench, all nodes have perfect 1 GiB/s local network and same SSD drives. The cluster is vanilla K3s with single master and Flannel network.

Conclusions:

  1. Number of nodes with Longhorn don't affect performance, only the number of replicas and data locality are matter.
  2. IOPS heavily drops even with single replica at same node, and doesn't significantly change at replicas increase.
  3. Write bandwidth drops when replicas increase how it should, but Read bandwidth decrease too, while it should be stable or even increase cause of paralleling.

So, Longhorn is really feature-rich, easy in use and stable software, but it has very high performance cost. I believe and hope that someday this price will be less and my database migrations (large number of operations very dependent on Latency) at testing environments will be applied 2-3 times longer, not ~10 like now.

@jazZcarabazZ
Copy link

While setting up a new cluster I performed a bunch of performance tests. I gathered information into Google Sheet and posting it. All tests performed via KBench, all nodes have perfect 1 GiB/s local network and same SSD drives. The cluster is vanilla K3s with single master and Flannel network.

Conclusions:

  1. Number of nodes with Longhorn don't affect performance, only the number of replicas and data locality are matter.

  2. IOPS heavily drops even with single replica at same node, and doesn't significantly change at replicas increase.

  3. Write bandwidth drops when replicas increase how it should, but Read bandwidth decrease too, while it should be stable or even increase cause of paralleling.

So, Longhorn is really feature-rich, easy in use and stable software, but it has very high performance cost. I believe and hope that someday this price will be less and my database migrations (large number of operations very dependent on Latency) at testing environments will be applied 2-3 times longer, not ~10 like now.

hi, did you try new data locality feature from the latest 1.4.0 release (Data Locality setting, Strict Local)?

@lictw
Copy link

lictw commented Jan 19, 2023

Hello! Now Longhorn 1.3.2 in my use, I will check new release. Data Locality ofc tested, there is a column Same Node in table about Data Locality - has node with workload replica or not. Is Strict Local a new 1.4.0 feature?

@larssb
Copy link

larssb commented Jan 19, 2023

Yes @lictw - Strict Local is a new v1.4.0 only feature.

@liyimeng
Copy link
Author

According to benchmark in #3957, it is still far from ideal.

@lictw
Copy link

lictw commented Jan 19, 2023

Why I have IOPS drop like 5-6 times compared to local-path in my tests even with 1 replica and 1 node (so best-effort), when in your tests it's just 2 times..? What the reason can be? It's performance node with 2 x AMD EPYC 7282 (64 cores in total) without any load during tests, raid 1 SSD with LVM under the /var/lib/longhorn, I can't understand what's wrong there.. K3s version 1.21.14, can it matter?

@derekbit
Copy link
Member

Why I have IOPS drop like 5-6 times compared to local-path in my tests even with 1 replica and 1 node (so best-effort), when in your tests it's just 2 times..? What the reason can be? It's performance node with 2 x AMD EPYC 7282 (64 cores in total) without any load during tests, raid 1 SSD with LVM under the /var/lib/longhorn, I can't understand what's wrong there.. K3s version 1.21.14, can it matter?

@lictw Suggest providing your steps, scripts and disk model. Thank you.

@lictw
Copy link

lictw commented Jan 23, 2023

I used KBench with 10GiB file, SC was:

parameters:
  dataLocality: disabled
  fstype: xfs
  numberOfReplicas: "1"
  staleReplicaTimeout: "30"

Steps:

  1. Install K3s v1.21.14 (single node).
  2. Install Longhorn v1.3.2 (CSI components single replica).
  3. Start testing.

Disks are 2x INTEL SSDSC2KB96 960GB, raid 1 with LVM, LV for /var/lib/longhorn. Thanks in advance.

@derekbit
Copy link
Member

derekbit commented Jan 24, 2023 via email

@lictw
Copy link

lictw commented Jan 24, 2023

local-path is the first row (base) in table, all other results are compared to this result. I will test ext4, I use xfs cause of lost+found directory.

@safts
Copy link

safts commented May 17, 2024

I'm seeing similar performance. I've used kbench to test 2 combinations:

  • local-path vs longhorn (data locality: best-effort, 3 replicas)
  • local-path vs strict-local (data locality: strict-local)

My nodes have 1Gbit networking, so that should be a bottleneck in the Longhorn case. However, it shouldn't be in the strict-local and that's pretty much as bad.

TEST_FILE: /volume1/test
TEST_OUTPUT_PREFIX: Local-Path
TEST_SIZE: 30G
MODE: full
Benchmarking iops.fio into Local-Path-iops.json
Benchmarking bandwidth.fio into Local-Path-bandwidth.json
Benchmarking latency.fio into Local-Path-latency.json
TEST_FILE: /volume2/test
TEST_OUTPUT_PREFIX: Longhorn
TEST_SIZE: 30G
MODE: full
Benchmarking iops.fio into Longhorn-iops.json
Benchmarking bandwidth.fio into Longhorn-bandwidth.json
Benchmarking latency.fio into Longhorn-latency.json

================================
FIO Benchmark Comparsion Summary
For: Local-Path vs Longhorn
CPU Idleness Profiling: enabled
Size: 30G
Quick Mode: disabled
================================
                              Local-Path   vs                 Longhorn    :              Change
IOPS (Read/Write)
        Random:         102,215 / 99,646   vs            5,102 / 2,869    :   -95.01% / -97.12%
    Sequential:          26,542 / 96,409   vs            7,388 / 4,575    :   -72.16% / -95.25%
  CPU Idleness:                      72%   vs                      43%    :                -29%

Bandwidth in KiB/sec (Read/Write)
        Random:        438,565 / 433,142   vs         103,426 / 53,543    :   -76.42% / -87.64%
    Sequential:        438,188 / 420,605   vs         101,684 / 54,092    :   -76.79% / -87.14%
  CPU Idleness:                      90%   vs                      38%    :                -52%

Latency in ns (Read/Write)
        Random:          92,931 / 38,282   vs    1,297,516 / 1,661,621    : 1296.21% / 4240.48%
    Sequential:          43,091 / 39,181   vs    1,358,046 / 1,824,472    : 3051.58% / 4556.52%
  CPU Idleness:                      81%   vs                      81%    :                  0%
TEST_FILE: /volume1/test
TEST_OUTPUT_PREFIX: Local-Path
TEST_SIZE: 30G
MODE: full
Benchmarking iops.fio into Local-Path-iops.json
Benchmarking bandwidth.fio into Local-Path-bandwidth.json
Benchmarking latency.fio into Local-Path-latency.json
TEST_FILE: /volume2/test
TEST_OUTPUT_PREFIX: Strict-Local
TEST_SIZE: 30G
MODE: full
Benchmarking iops.fio into Strict-Local-iops.json
Benchmarking bandwidth.fio into Strict-Local-bandwidth.json
Benchmarking latency.fio into Strict-Local-latency.json

================================
FIO Benchmark Comparsion Summary
For: Local-Path vs Strict-Local
CPU Idleness Profiling: enabled
Size: 30G
Quick Mode: disabled
================================
                              Local-Path   vs             Strict-Local    :              Change
IOPS (Read/Write)
        Random:        104,861 / 103,289   vs            4,849 / 4,969    :   -95.38% / -95.19%
    Sequential:         97,949 / 105,513   vs            7,838 / 7,193    :   -92.00% / -93.18%
  CPU Idleness:                      70%   vs                      49%    :                -21%

Bandwidth in KiB/sec (Read/Write)
        Random:        440,566 / 435,960   vs        124,951 / 124,133    :   -71.64% / -71.53%
    Sequential:        441,136 / 436,143   vs        128,719 / 127,085    :   -70.82% / -70.86%
  CPU Idleness:                      91%   vs                      32%    :                -59%

Latency in ns (Read/Write)
        Random:          84,898 / 36,534   vs        876,757 / 823,783    :  932.72% / 2154.84%
    Sequential:          32,588 / 34,382   vs        850,450 / 910,298    : 2509.70% / 2547.60%
  CPU Idleness:                      80%   vs                      77%    :                 -3%

I am wondering what could be going on. I would expect strict-local to be significantly better (also tbh I would expect the best-effort locality option to use the replica on the node being benchmarked at least for reads, which I'm not sure is happening).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance System, volume performance area/v1-data-engine v1 data engine (iSCSI tgt) kind/question Please use `discussion` to ask questions instead
Projects
Status: Ice Box
Development

No branches or pull requests