Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Local volume for distributed data workloads #3957

Closed
innobead opened this issue May 10, 2022 · 21 comments
Closed

[FEATURE] Local volume for distributed data workloads #3957

innobead opened this issue May 10, 2022 · 21 comments
Assignees
Labels
area/csi CSI related like control/node driver, sidecars area/v1-data-engine v1 data engine (iSCSI tgt) component/longhorn-instance-manager Longhorn instance manager (interface between control and data plane) component/longhorn-manager Longhorn manager (control plane) highlight Important feature/issue to highlight kind/feature Feature request, new feature priority/0 Must be fixed in this release (managed by PO) require/doc Require updating the longhorn.io documentation require/lep Require adding/updating enhancement proposal
Milestone

Comments

@innobead
Copy link
Member

innobead commented May 10, 2022

Is your feature request related to a problem? Please describe

Longhorn is a highly available replica-based storage system, and it's good for fault tolerance, read performance, data protection, etc, but on the other side, it also needs some extra costs like requiring more disk paces for replication.

In some cases, especially for distributed data workloads (SS) like databases (ex: Cassandra, Kafka, etc), they already have their own data replication, sharding, etc, so we should provide a better volume type for these use cases but also still support existing specific volume functionalities like snapshotting, backup/restore, etc.

Describe the solution you'd like

  • Extend Data Locality with strict or enforced mode to require one replica should be local and next to the workload
  • Connecting the local replica via a local socket file instead of a TCP connection

Describe alternatives you've considered

N/A

Additional context

#1965

@innobead innobead added kind/feature Feature request, new feature priority/0 Must be fixed in this release (managed by PO) component/longhorn-manager Longhorn manager (control plane) highlight Important feature/issue to highlight labels May 10, 2022
@innobead innobead added this to the v1.4.0 milestone May 10, 2022
@innobead innobead added the require/lep Require adding/updating enhancement proposal label May 10, 2022
@joshimoo joshimoo added component/longhorn-instance-manager Longhorn instance manager (interface between control and data plane) area/csi CSI related like control/node driver, sidecars area/v1-data-engine v1 data engine (iSCSI tgt) labels May 11, 2022
@derekbit
Copy link
Member

derekbit commented Jun 15, 2022

As the benchmarking result, if the volume and the single replica are on the same node, the latency and iops can be improved significantly.

image

In addition, the tcp can be replaced with unix domain socket to gain more performance in this case.

@derekbit
Copy link
Member

Replace the tcp connection between the engine and the single replica with unix-domain-socket.

  • The IOPS and Latency are improved.
  • The bandwidth number is saturated, so the improvement is not seen here.

image

@joshimoo
Copy link
Contributor

@derekbit good job on the evaluation :)

@innobead
Copy link
Member Author

@derekbit as we discussed, let's add this to 1.4.0.

@Bessonov
Copy link

This is a great news! May I ask what brings IO down in comparison to the baseline?

@innobead
Copy link
Member Author

This is a great news! May I ask what brings IO down in comparison to the baseline?

It's more about the improvement of latency. Basically, there are 3 primary things:

  • Data locality
  • Data transfer over Unix socket instead of TCP stack
  • No remote replication

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Nov 24, 2022

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at:

  • Does the PR include the explanation for the fix or the feature?

  • Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart?
    The PR for the YAML change is at:
    The PR for the chart change is at:

#4918

  • Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
    The PR is at

longhorn/longhorn-manager#1562
longhorn/longhorn-engine#771

  • Which areas/issues this PR might have potential impacts on?
    Area: data path, performance
    Issues

  • If labeled: require/LEP Has the Longhorn Enhancement Proposal PR submitted?
    The LEP PR is at lep: add local-volume #4928

  • If labeled: require/doc Has the necessary document PR submitted or merged (including backport-needed/*)?
    The documentation issue/PR is at

  • If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue (including backport-needed/*)
    The automation skeleton PR is at
    The automation test case PR is at
    The issue of automation test case implementation is at (please create by the template)

@Bessonov
Copy link

It's more about the improvement of latency. Basically, there are 3 primary things:

  • Data locality
  • Data transfer over Unix socket instead of TCP stack
  • No remote replication

Thanks for your answer. Probably, my question was misleading. I've asked about difference between local-path-provisioner and longhorn with data locality, unix socket and without replication. The difference between 90,531 and 47,667, and also between 77,799 and 21,158 is still huge.

@innobead
Copy link
Member Author

innobead commented Nov 24, 2022

Thanks for your answer. Probably, my question was misleading. I've asked about difference between local-path-provisioner and longhorn with data locality, unix socket and without replication. The difference between 90,531 and 47,667, and also between 77,799 and 21,158 is still huge.

I see.

Longhorn local volume is not for achieving a similar performance of local-path-provisioner you mentioned, but rather still based on the existing data path w/ some changes above to ensure strict data locality between engine and replica to gain some IO performance if compared with volumes with best-effort or disabled locality.

@derekbit
Copy link
Member

Performance update

image

@michaelandrepearce
Copy link

Latency is still somewhat 500% local path

@derekbit
Copy link
Member

derekbit commented Nov 24, 2022

@michaelandrepearce @Bessonov

The local volume's data path is not changed a lot in this improvement in order to preserve the existing functionalities such as snapshotting, backup, restore, etc.

We will continue the improvement the local volume, e.g. pass-througn, to squeeze more performance. However, the performance difference is still significant after these improvements because of the existing data path design.

@michaelandrepearce
Copy link

michaelandrepearce commented Nov 24, 2022

sure 150% or something is fine to get the overlay, but 500% on latency is a bit too much to buy, makes it unusable for localpath usecases which is where you want fast low latency disk access for systems that themselves take care of replication (cassandra, redpanda, postgres, chronicle stores), also its a quite some what of a reduction of IOPS available, looking at stats above write is from 97k down to 28k at best. Its not to dismiss what work has been done here, i think its a great step in the right direction, its just to realistically fit localpv use cases perf is somewhat off atm. Did anything get done with SPDK? I think in last discussions it was an idea that it could help reduce some of that..

@michaelandrepearce
Copy link

michaelandrepearce commented Nov 24, 2022

what i raised and has been closed for this issue, #1965 the point here was to and i quote use case for localpv's akin to openebs;s local pv offerings e.g. their lvm-localpv or hostpath-localpv , or minio's directpv. without needing to switch vendor, and also still having some unified management, e.g. ui,monitoring, backup.

"Using K8s native LocalPV's are useful as no network-based storage can keep up with baremetal in write IOPS/latency/throughput, when using NVME/Optane disks. Giving Direct I/O: Near-zero disk performance overhead"

@derekbit
Copy link
Member

derekbit commented Nov 25, 2022

Need to update hte upgrade test image, because new options are added.
cc @longhorn/qa

@chriscchien
Copy link
Contributor

chriscchien commented Nov 28, 2022

Hi @derekbit

When I attached strict-local volume form one node to another node, volume will looped between attaching and detaching, the only active button was delete, did this circumstance expected? thank you.

@derekbit
Copy link
Member

volume will looped between attaching and detaching

This is expected, because the strict-local volume one replica should satisfy

  1. single replica
  2. engine and replica are on the same

So, if you attach the volume to another node, the attach-detach loop should happe because of the lack of a replica.

@chriscchien
Copy link
Contributor

Verified in longhorn master 400b8c with test steps
Result Pass

  1. Can successfully create a local volume with numberOfReplicas=1 and dataLocality=strict-local
  2. Webhook rejected the following cases when the volume is created or attached
    • Local volume with dataLocality=strict-local but numberOfReplicas>1
    • Update a attached local volume's numberOfReplicas to a value greater than one
    • Update a attached local volume's dataLocality to disabled or best-effort
  3. Volume and restored volume can used by workload and data kept consistent (tested by deployment with nodeName set)

@innobead
Copy link
Member Author

volume will looped between attaching and detaching

This is expected, because the strict-local volume one replica should satisfy

  1. single replica
  2. engine and replica are on the same

So, if you attach the volume to another node, the attach-detach loop should happe because of the lack of a replica.

After discussing with @derekbit , see if we need to have a validation hook to avoid unnecessary intended reconsiling for this situation.

@innobead
Copy link
Member Author

innobead commented Nov 30, 2022

@chriscchien
Copy link
Contributor

Verify test case Node restart/down scenario with Pod Deletion Policy When Node is Down set to delete-both-statefulset-and-deployment-pod on longhorn master 066dde with strict-local volume
Result Pass

With Pod Deletion Policy When Node is Down set to delete-both-statefulset-and-deployment-pod, after volume attached node power off for 10 minutes then power up that node, deployment pod will eventually recreated and attached to node which local volume attached to, and the data kept consistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/csi CSI related like control/node driver, sidecars area/v1-data-engine v1 data engine (iSCSI tgt) component/longhorn-instance-manager Longhorn instance manager (interface between control and data plane) component/longhorn-manager Longhorn manager (control plane) highlight Important feature/issue to highlight kind/feature Feature request, new feature priority/0 Must be fixed in this release (managed by PO) require/doc Require updating the longhorn.io documentation require/lep Require adding/updating enhancement proposal
Projects
None yet
Development

No branches or pull requests

7 participants