Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Support multi-network K8s clusters (storage network) #2285

Closed
janeczku opened this issue Mar 2, 2021 · 17 comments
Closed

[FEATURE] Support multi-network K8s clusters (storage network) #2285

janeczku opened this issue Mar 2, 2021 · 17 comments
Assignees
Labels
area/install-uninstall-upgrade Install, Uninstall or Upgrade related area/storage-network Storage network for control plane or data plane area/ui UI related like UI or CLI highlight Important feature/issue to highlight investigation-needed Need to identify the case before estimating and starting the development kind/feature Feature request, new feature priority/0 Must be fixed in this release (managed by PO) require/lep Require adding/updating enhancement proposal
Milestone

Comments

@janeczku
Copy link
Contributor

janeczku commented Mar 2, 2021

Is your feature request related to a problem? Please describe.

A typical bare-metal server/network architecture involves servers being attached to multiple networks, including a dedicated storage fabric that provides accelerated throughput and IO separation from the data plane.

Longhorn does not support segregate networks and uses the default CNI network for all the storage traffic. That makes it impossible to guarantee network resources available for storage operations because the same network is shared with all in-cluster application traffic.

Describe the solution you'd like

Multi-Network-Storage

The canonical approach to provide multi-network capabilities in a K8s cluster is using a CNI multiplexer such as Multus which provides the means to attach pods to different underlay networks.

Longhorn should have an integration with Multus.
That may be implemented by exposing a global setting allowing the admin to specify the Multus network that should be used for any storage-related traffic (e.g. engine -> replica). Longhorn would then configure the pods with the Multus specific annotations, e.g. k8s.v1.cni.cncf.io/networks: storage-net.

The setup and configuration of the Multus network (e.g. creation of NetworkAttachmentDefinition) would be out of the scope of Longhorn.

Describe alternatives you've considered
N/A

@janeczku janeczku changed the title [FEATURE] Longhorn should support dedicated storage network [FEATURE] Longhorn should support multi-network K8s clusters Mar 2, 2021
@janeczku janeczku changed the title [FEATURE] Longhorn should support multi-network K8s clusters [ENHANCEMENT] Longhorn should support multi-network K8s clusters Mar 2, 2021
@joshimoo joshimoo added the kind/feature Feature request, new feature label Mar 2, 2021
@yasker yasker added this to the v1.1.2 milestone Mar 2, 2021
@yasker yasker changed the title [ENHANCEMENT] Longhorn should support multi-network K8s clusters [ENHANCEMENT] Longhorn should support multi-network K8s clusters (storage network) Apr 25, 2021
@innobead innobead added area/install-uninstall-upgrade Install, Uninstall or Upgrade related highlight Important feature/issue to highlight labels Apr 26, 2021
@innobead innobead modified the milestones: v1.1.2, v1.2.0 Apr 29, 2021
@innobead innobead added reprioritization-needed Need to reconsider to re-prioritize in another milestone instead of the current one priority/1 Highly recommended to fix in this release (managed by PO) labels May 12, 2021
@innobead innobead modified the milestones: v1.2.0, v1.3.0 May 19, 2021
@innobead innobead removed the reprioritization-needed Need to reconsider to re-prioritize in another milestone instead of the current one label May 19, 2021
@yasker yasker added the reprioritization-needed Need to reconsider to re-prioritize in another milestone instead of the current one label May 24, 2021
@yasker yasker modified the milestones: v1.3.0, v1.2.0 May 24, 2021
@yasker
Copy link
Member

yasker commented May 24, 2021

Considering moving this back to v1.2.0.

@innobead innobead added investigation-needed Need to identify the case before estimating and starting the development and removed reprioritization-needed Need to reconsider to re-prioritize in another milestone instead of the current one labels May 25, 2021
@yasker yasker modified the milestones: v1.2.0, v1.3.0 May 26, 2021
@yasker
Copy link
Member

yasker commented May 26, 2021

Move out to v1.3.0.

@innobead
Copy link
Member

cc @jenting

@jenting
Copy link
Contributor

jenting commented Jul 19, 2021

Interested.

As I know, Harvester uses Multus, it'd benefit the Harvester project.

Also, make sure the Longhorn engine <-> replica network throughput would not be affected by other applications.

@c3y1huang
Copy link
Contributor

For the longhorn-engine replica part, the data server and sync-agent server are the data plane. Having the setting allows the user to specify which network interface is the data plane. At the longhorn engine, it gets the network interface IPs from the Pod and creates the data and sync-agent server with that interface IP address. (Need a way to report error if the network interface not found) Change the setting need to restart the longhorn-engine Pods to recreate the data and sync-agent server (need to consider should the volume need to be detached or not)

The components that get affected are the replica process, engine process, and iscsiadm. Currently, the challenge is how to overcome that Multus CNI is not visible on the host network to be used by iscsiadm.

@innobead
Copy link
Member

innobead commented Nov 15, 2021

For the longhorn-engine replica part, the data server and sync-agent server are the data plane. Having the setting allows the user to specify which network interface is the data plane. At the longhorn engine, it gets the network interface IPs from the Pod and creates the data and sync-agent server with that interface IP address. (Need a way to report error if the network interface not found) Change the setting need to restart the longhorn-engine Pods to recreate the data and sync-agent server (need to consider should the volume need to be detached or not)

The components that get affected are the replica process, engine process, and iscsiadm. Currently, the challenge is how to overcome that Multus CNI is not visible on the host network to be used by iscsiadm.

@joshimoo @keithalucas @derekbit any concerns for what @c3y1huang described? I don't think we need to deal with the communication between iscsi client and server (longhorn engine) over the storage network, but what we do care about are actually below:

  • data traffic between engine and replicas
  • data traffic between replicas

@keithalucas
Copy link

For the longhorn-engine replica part, the data server and sync-agent server are the data plane. Having the setting allows the user to specify which network interface is the data plane. At the longhorn engine, it gets the network interface IPs from the Pod and creates the data and sync-agent server with that interface IP address. (Need a way to report error if the network interface not found) Change the setting need to restart the longhorn-engine Pods to recreate the data and sync-agent server (need to consider should the volume need to be detached or not)

The components that get affected are the replica process, engine process, and iscsiadm. Currently, the challenge is how to overcome that Multus CNI is not visible on the host network to be used by iscsiadm.

@joshimoo @keithalucas @derekbit any concerns for what @c3y1huang described? I don't think we need to deal with the communication between iscsi client and server (longhorn engine) over the storage network, but what we do care about are actually below:

  • data traffic between engine and replicas
  • data traffic between replicas

The iSCSI server (iscsid) runs outside of a pod on a node. iscsiadm instructs iscsid to establish a connection with the tgt process running on the node, but within a pod (the engine instance manager pod). The iscsid to tgt communication is local to that node and shouldn't be transported over the network at all, so we don't need to configure it to use the storage network interface.

I agree that we only need to care about using the storage network interface for traffic between the longhorn engine and longhorn replicas for normal read and write operations and the traffic between replicas for syncing.

@c3y1huang c3y1huang added the require/lep Require adding/updating enhancement proposal label Nov 16, 2021
@shuo-wu
Copy link
Contributor

shuo-wu commented Dec 22, 2021

The iscsid to tgt communication is local to that node and shouldn't be transported over the network at all

Sorry I didn't follow this statement. If I understand correctly, the client and the server are always on the same node but in different network namespaces. And the connection is always a TCP connection. If we need to switch to using the storage network interface totally depends on how we consider the storage network. Is the storage network for data transmission only? Or it's for the whole storage system. There is no extra overhead for the switching.

After checking the LEP, I would consider that the network is for the whole storage system. Then I prefer to use the storage network IP only in the engine (including launching tgt target). Otherwise, which IP we should use when invoking the APIs in the engine process would be a confusing part.

@c3y1huang
Copy link
Contributor

c3y1huang commented Jan 12, 2022

Moved out of sprint to work on the dependent issue #2821 as discussed in #3277 (comment).

@c3y1huang
Copy link
Contributor

c3y1huang commented Mar 10, 2022

Move this out of the sprint.
#3546 (comment)

@innobead
Copy link
Member

innobead commented May 3, 2022

cc @smallteeths

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Jun 1, 2022

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at:

  • Is there a workaround for the issue? If so, where is it documented?
    The workaround is at:

  • Does the PR include the explanation for the fix or the feature?

  • Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart?
    The PR for the YAML change is at: Storage network deployment update #3990
    The PR for the chart change is at: Storage network deployment update #3990

  • Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
    The PR is at

  • Which areas/issues this PR might have potential impacts on?
    Area manager, instance-manager, backing-image-manager
    Issues

  • If labeled: require/LEP Has the Longhorn Enhancement Proposal PR submitted?
    The LEP PR is at Add LEP for Storage Network Through gRPC Proxy #3277

  • If labeled: area/ui Has the UI issue filed or ready to be merged (including backport-needed/*)?
    The UI issue/PR is at

  • If labeled: require/doc Has the necessary document PR submitted or merged (including backport-needed/*)?
    The documentation issue/PR is at:

  • If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue (including backport-needed/*)
    The automation skeleton PR is at
    The automation test case PR is at storage-network: do not reset setting longhorn-tests#984
    The issue of automation test case implementation is at (please create by the template)

  • If labeled: require/automation-engine Has the engine integration test been merged (including backport-needed/*)?
    The engine automation PR is at

  • If labeled: require/manual-test-plan Has the manual test plan been documented?
    The updated manual test plan is at storage-network: manual setup and test longhorn-tests#980

  • If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility?
    The compatibility issue is filed at

@yangchiu
Copy link
Member

yangchiu commented Jun 6, 2022

verified passed on v1.3.0-rc2 by following the manual test plan.

created eth0 (10.0.1.0/24) and eth1 (10.0.2.0/24) for each instance, and let pods communicated with each other on eth1 using multus.
remember to stop source/destination check else the volume replicas would keep fail.

@yangchiu yangchiu closed this as completed Jun 6, 2022
@innobead innobead changed the title [FEATURE] Longhorn should support multi-network K8s clusters (storage network) [FEATURE] Support multi-network K8s clusters (storage network) Jun 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/install-uninstall-upgrade Install, Uninstall or Upgrade related area/storage-network Storage network for control plane or data plane area/ui UI related like UI or CLI highlight Important feature/issue to highlight investigation-needed Need to identify the case before estimating and starting the development kind/feature Feature request, new feature priority/0 Must be fixed in this release (managed by PO) require/lep Require adding/updating enhancement proposal
Projects
Status: Closed
Development

No branches or pull requests