[FEATURE] Support multi-network K8s clusters (storage network) #2285

janeczku · 2021-03-02T10:58:05Z

Is your feature request related to a problem? Please describe.

A typical bare-metal server/network architecture involves servers being attached to multiple networks, including a dedicated storage fabric that provides accelerated throughput and IO separation from the data plane.

Longhorn does not support segregate networks and uses the default CNI network for all the storage traffic. That makes it impossible to guarantee network resources available for storage operations because the same network is shared with all in-cluster application traffic.

Describe the solution you'd like

The canonical approach to provide multi-network capabilities in a K8s cluster is using a CNI multiplexer such as Multus which provides the means to attach pods to different underlay networks.

Longhorn should have an integration with Multus.
That may be implemented by exposing a global setting allowing the admin to specify the Multus network that should be used for any storage-related traffic (e.g. engine -> replica). Longhorn would then configure the pods with the Multus specific annotations, e.g. k8s.v1.cni.cncf.io/networks: storage-net.

The setup and configuration of the Multus network (e.g. creation of NetworkAttachmentDefinition) would be out of the scope of Longhorn.

Describe alternatives you've considered
N/A

The text was updated successfully, but these errors were encountered:

yasker · 2021-05-24T22:32:14Z

Considering moving this back to v1.2.0.

yasker · 2021-05-26T00:00:32Z

Move out to v1.3.0.

innobead · 2021-07-15T07:18:04Z

cc @jenting

jenting · 2021-07-19T02:27:07Z

Interested.

As I know, Harvester uses Multus, it'd benefit the Harvester project.

Also, make sure the Longhorn engine <-> replica network throughput would not be affected by other applications.

c3y1huang · 2021-11-15T11:25:42Z

For the longhorn-engine replica part, the data server and sync-agent server are the data plane. Having the setting allows the user to specify which network interface is the data plane. At the longhorn engine, it gets the network interface IPs from the Pod and creates the data and sync-agent server with that interface IP address. (Need a way to report error if the network interface not found) Change the setting need to restart the longhorn-engine Pods to recreate the data and sync-agent server (need to consider should the volume need to be detached or not)

The components that get affected are the replica process, engine process, and iscsiadm. Currently, the challenge is how to overcome that Multus CNI is not visible on the host network to be used by iscsiadm.

innobead · 2021-11-15T11:58:57Z

For the longhorn-engine replica part, the data server and sync-agent server are the data plane. Having the setting allows the user to specify which network interface is the data plane. At the longhorn engine, it gets the network interface IPs from the Pod and creates the data and sync-agent server with that interface IP address. (Need a way to report error if the network interface not found) Change the setting need to restart the longhorn-engine Pods to recreate the data and sync-agent server (need to consider should the volume need to be detached or not)

The components that get affected are the replica process, engine process, and iscsiadm. Currently, the challenge is how to overcome that Multus CNI is not visible on the host network to be used by iscsiadm.

@joshimoo @keithalucas @derekbit any concerns for what @c3y1huang described? I don't think we need to deal with the communication between iscsi client and server (longhorn engine) over the storage network, but what we do care about are actually below:

data traffic between engine and replicas
data traffic between replicas

keithalucas · 2021-11-15T14:26:27Z

For the longhorn-engine replica part, the data server and sync-agent server are the data plane. Having the setting allows the user to specify which network interface is the data plane. At the longhorn engine, it gets the network interface IPs from the Pod and creates the data and sync-agent server with that interface IP address. (Need a way to report error if the network interface not found) Change the setting need to restart the longhorn-engine Pods to recreate the data and sync-agent server (need to consider should the volume need to be detached or not)

The components that get affected are the replica process, engine process, and iscsiadm. Currently, the challenge is how to overcome that Multus CNI is not visible on the host network to be used by iscsiadm.

@joshimoo @keithalucas @derekbit any concerns for what @c3y1huang described? I don't think we need to deal with the communication between iscsi client and server (longhorn engine) over the storage network, but what we do care about are actually below:

data traffic between engine and replicas

data traffic between replicas

The iSCSI server (iscsid) runs outside of a pod on a node. iscsiadm instructs iscsid to establish a connection with the tgt process running on the node, but within a pod (the engine instance manager pod). The iscsid to tgt communication is local to that node and shouldn't be transported over the network at all, so we don't need to configure it to use the storage network interface.

I agree that we only need to care about using the storage network interface for traffic between the longhorn engine and longhorn replicas for normal read and write operations and the traffic between replicas for syncing.

shuo-wu · 2021-12-22T03:45:16Z

The iscsid to tgt communication is local to that node and shouldn't be transported over the network at all

Sorry I didn't follow this statement. If I understand correctly, the client and the server are always on the same node but in different network namespaces. And the connection is always a TCP connection. If we need to switch to using the storage network interface totally depends on how we consider the storage network. Is the storage network for data transmission only? Or it's for the whole storage system. There is no extra overhead for the switching.

After checking the LEP, I would consider that the network is for the whole storage system. Then I prefer to use the storage network IP only in the engine (including launching tgt target). Otherwise, which IP we should use when invoking the APIs in the engine process would be a confusing part.

c3y1huang · 2022-01-12T04:40:11Z

Moved out of sprint to work on the dependent issue #2821 as discussed in #3277 (comment).

c3y1huang · 2022-03-10T05:09:34Z

Move this out of the sprint.
#3546 (comment)

innobead · 2022-05-03T10:49:45Z

cc @smallteeths

longhorn-io-github-bot · 2022-06-01T01:45:07Z

Pre Ready-For-Testing Checklist

~~Where is the reproduce steps/test steps documented?~~
The reproduce steps/test steps are at:
~~Is there a workaround for the issue? If so, where is it documented?~~
The workaround is at:
Does the PR include the explanation for the fix or the feature?
Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart?
The PR for the YAML change is at: Storage network deployment update #3990
The PR for the chart change is at: Storage network deployment update #3990
Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
The PR is at

Which areas/issues this PR might have potential impacts on?
Area manager, instance-manager, backing-image-manager
Issues
If labeled: require/LEP Has the Longhorn Enhancement Proposal PR submitted?
The LEP PR is at Add LEP for Storage Network Through gRPC Proxy #3277
If labeled: area/ui Has the UI issue filed or ready to be merged (including backport-needed/*)?
The UI issue/PR is at
If labeled: require/doc Has the necessary document PR submitted or merged (including backport-needed/*)?
The documentation issue/PR is at:

If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue (including backport-needed/*)
The automation skeleton PR is at
The automation test case PR is at storage-network: do not reset setting longhorn-tests#984
The issue of automation test case implementation is at (please create by the template)
If labeled: require/automation-engine Has the engine integration test been merged (including backport-needed/*)?
The engine automation PR is at
If labeled: require/manual-test-plan Has the manual test plan been documented?
The updated manual test plan is at storage-network: manual setup and test longhorn-tests#980
If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility?
The compatibility issue is filed at

yangchiu · 2022-06-06T02:31:56Z

verified passed on v1.3.0-rc2 by following the manual test plan.

created eth0 (10.0.1.0/24) and eth1 (10.0.2.0/24) for each instance, and let pods communicated with each other on eth1 using multus.
remember to stop source/destination check else the volume replicas would keep fail.

janeczku changed the title ~~[FEATURE] Longhorn should support dedicated storage network~~ [FEATURE] Longhorn should support multi-network K8s clusters Mar 2, 2021

janeczku changed the title ~~[FEATURE] Longhorn should support multi-network K8s clusters~~ [ENHANCEMENT] Longhorn should support multi-network K8s clusters Mar 2, 2021

joshimoo added the kind/feature Feature request, new feature label Mar 2, 2021

yasker added this to the v1.1.2 milestone Mar 2, 2021

yasker changed the title ~~[ENHANCEMENT] Longhorn should support multi-network K8s clusters~~ [ENHANCEMENT] Longhorn should support multi-network K8s clusters (storage network) Apr 25, 2021

innobead added area/install-uninstall-upgrade Install, Uninstall or Upgrade related highlight Important feature/issue to highlight labels Apr 26, 2021

innobead modified the milestones: v1.1.2, v1.2.0 Apr 29, 2021

innobead added reprioritization-needed Need to reconsider to re-prioritize in another milestone instead of the current one priority/1 Highly recommended to fix in this release (managed by PO) labels May 12, 2021

innobead modified the milestones: v1.2.0, v1.3.0 May 19, 2021

innobead removed the reprioritization-needed Need to reconsider to re-prioritize in another milestone instead of the current one label May 19, 2021

yasker added the reprioritization-needed Need to reconsider to re-prioritize in another milestone instead of the current one label May 24, 2021

yasker modified the milestones: v1.3.0, v1.2.0 May 24, 2021

innobead added investigation-needed Need to identify the case before estimating and starting the development and removed reprioritization-needed Need to reconsider to re-prioritize in another milestone instead of the current one labels May 25, 2021

innobead assigned cclhsu May 25, 2021

yasker modified the milestones: v1.2.0, v1.3.0 May 26, 2021

yasker unassigned cclhsu May 26, 2021

innobead assigned jenting Jul 19, 2021

joshimoo mentioned this issue Jul 21, 2021

[FEATURE] Expose iSCSI targets to clients outside of the cluster #2813

Open

innobead removed this from the v1.3.0 milestone Jul 26, 2021

innobead assigned c3y1huang Nov 8, 2021

c3y1huang added the require/lep Require adding/updating enhancement proposal label Nov 16, 2021

c3y1huang mentioned this issue Nov 18, 2021

Add LEP for Storage Network Through gRPC Proxy #3277

Merged

innobead added the require/blog label Dec 20, 2021

c3y1huang mentioned this issue Jan 18, 2022

[IMPROVEMENT] Implement gRPC proxy in IM to replace MGR engine binary invocation #3546

Closed

c3y1huang mentioned this issue Feb 15, 2022

Implement storage network through gRPC proxy longhorn/longhorn-manager#1212

Merged

This was referenced Apr 29, 2022

Implement storage-network longhorn/backing-image-manager#52

Merged

Implement gRPC proxy longhorn/longhorn-instance-manager#105

Merged

innobead added the area/ui UI related like UI or CLI label May 3, 2022

c3y1huang mentioned this issue May 10, 2022

lep: add note to storage network through gRPC proxy #3959

Merged

innobead added the area/storage-network Storage network for control plane or data plane label May 11, 2022

c3y1huang mentioned this issue May 16, 2022

Storage network deployment update #3990

Merged

innobead assigned yangchiu Jun 1, 2022

yangchiu closed this as completed Jun 6, 2022

c3y1huang mentioned this issue Jun 7, 2022

[DOC] Add storage network setting document #4081

Closed

innobead changed the title ~~[FEATURE] Longhorn should support multi-network K8s clusters (storage network)~~ [FEATURE] Support multi-network K8s clusters (storage network) Jun 15, 2022

c3y1huang mentioned this issue Jul 12, 2022

[BACKPORT][v1.3.x] storage-network: do not reset setting longhorn/longhorn-tests#1014

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Support multi-network K8s clusters (storage network) #2285

[FEATURE] Support multi-network K8s clusters (storage network) #2285

janeczku commented Mar 2, 2021 •

edited

Loading

yasker commented May 24, 2021

yasker commented May 26, 2021

innobead commented Jul 15, 2021

jenting commented Jul 19, 2021

c3y1huang commented Nov 15, 2021

innobead commented Nov 15, 2021 •

edited

Loading

keithalucas commented Nov 15, 2021

shuo-wu commented Dec 22, 2021

c3y1huang commented Jan 12, 2022 •

edited

Loading

c3y1huang commented Mar 10, 2022 •

edited

Loading

innobead commented May 3, 2022

longhorn-io-github-bot commented Jun 1, 2022 •

edited by c3y1huang

Loading

yangchiu commented Jun 6, 2022

[FEATURE] Support multi-network K8s clusters (storage network) #2285

[FEATURE] Support multi-network K8s clusters (storage network) #2285

Comments

janeczku commented Mar 2, 2021 • edited Loading

yasker commented May 24, 2021

yasker commented May 26, 2021

innobead commented Jul 15, 2021

jenting commented Jul 19, 2021

c3y1huang commented Nov 15, 2021

innobead commented Nov 15, 2021 • edited Loading

keithalucas commented Nov 15, 2021

shuo-wu commented Dec 22, 2021

c3y1huang commented Jan 12, 2022 • edited Loading

c3y1huang commented Mar 10, 2022 • edited Loading

innobead commented May 3, 2022

longhorn-io-github-bot commented Jun 1, 2022 • edited by c3y1huang Loading

Pre Ready-For-Testing Checklist

yangchiu commented Jun 6, 2022

janeczku commented Mar 2, 2021 •

edited

Loading

innobead commented Nov 15, 2021 •

edited

Loading

c3y1huang commented Jan 12, 2022 •

edited

Loading

c3y1huang commented Mar 10, 2022 •

edited

Loading

longhorn-io-github-bot commented Jun 1, 2022 •

edited by c3y1huang

Loading