New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Epic: Data Path Performance #1541
Comments
@yasker do you have any benchmark number to share? I only see this from Internet https://itnext.io/state-of-persistent-storage-in-k8s-a-benchmark-77a96bb1ac29 it seems longhorn is not impressed with performance. |
@liyimeng We're still working on optimizing the performance. But that article is not quite apple to apple comparison since it's comparing Longhorn (who is crash-consistent and sync to multiple replicas) with others are either cached (e.g. without O_DIRECT) or async (Piraeus) or not replicated (1 replica). So the more valid case in comparison for that picture is more like this: Nonetheless, Longhorn is not the fastest in the group. And we're aiming to change that. |
@yasker thanks for sharing the info and the awesome work! |
We already have SPDK (which is what Mayastor is using for frontend) on our roadmap for v1.2 release. |
@yasker super exciting 👍 |
@yasker do you have metrics comparing longhorn vs ceph performance, we have almost decided to use longhorn but after reading the blogpost from longhorn on the IOPS hit that one has to take compared to bare metal disk we started exploring other options |
bump |
Seems there’s been some competitor performance testing, clearly will be skewed to their product. Of note though is in conclusion section, where longhorn misses a trick with local caching and read from memory. |
Hi @yasker, We are currently evaluating, if we can switch to a distributed storage solution for our Rancher vSphere clusters to replace the native vsphere storage (currently using in-tree as vsphere is < v7), which - at times - is quite flunky. Some of the deployments we have are unfortunately very I/O sensitive, so I am trying to figure out how much of a decline in performance we will have. This is why I am running some performance tests comparing local-path/vsphere (which have pretty much identical performance) with longhorn v1.2 and rook-ceph v1.7.2. To be honest, I don't understand the results I am getting, as they are very bad on the distributed storage side (for both longhorn and ceph), so maybe I am doing something wrong? For benchmarking I mostly relied on the https://github.com/yasker/kbench test, but also did some general iperf / dd steps to establish a baseline. BASELINE: iperf over 60s: dd if=/dev/zero of=here bs=1G count=1 oflag=direct From those results I presumed, that the default 30GB PVC of kbench should be more than sufficient to avoid cache (5 GBit * 25 = ~15.6 GB) kbench standalone results for local-path / vsphere in-tree kbench (average of 5 runs) are as follows:
I then compared vsphere/longhorn vsphere/ceph and longhorn/ceph, where both distributed system were configured to keep 3 replicas spread over multiple physical hosts (worst BW scenario):
I also did some standalone kbench tests for each, but they more or less confirmed the results above.
I am somewhat baffled by these results. Shouldn't performance drastically increase if I take away the distribution part? Obviously write performance improved a little, but it is nowhere near local-path / vsphere BW. As I'm no expert on storage, I didn't try optimizing anything in the longhorn / ceph setups and just went with the defaults. So maybe it's simply a configuration issue, but I wouldn't know where to begin. If anyone can provide input on what to try or do differently, I would greatly appreciate it. |
Hi @h0lk , Can you describe your vsphere storage setup? e.g. SSD vs spinning disk, how many disks per node etc, how many memories you have for each node. Also, it seems the vsphere storage is running on a different network (or is it non-HA?) since it can exceed the maximum possible bandwidth of 125MB/s.
As for suggestions, if you can get a 10G network and more CPUs, I think the result will improve a lot, especially on the bandwidth side. But I will not expect it to be on the same level of native disk in the terms of IOPS. Matching the bandwidth is possible. You can take a look at https://longhorn.io/blog/performance-scalability-report-aug-2020/ for more information on our last benchmark result. We will update it with v1.2.0 soon. |
Did you update the performance report for v1.2? |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
This serves as the general thread for performance-related discussion in Longhorn.
The text was updated successfully, but these errors were encountered: